[00:10:31] AaronSchulz: yt? [00:16:00] springle: yup [00:16:08] I already figured out the issue from earlier [00:16:13] those unknown db errors [00:16:14] oh [00:16:16] :) [00:16:20] cool, jobrunners? [00:17:58] springle: https://gerrit.wikimedia.org/r/#/c/169964/ [00:18:30] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [00:18:30] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [00:19:00] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [00:19:31] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 47, cluster_name: production-logstash-eqiad, relocating_shards: 1, active_shards: 94, initializing_shards: 0, number_of_data_nodes: 3 [00:20:19] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 42, down: 0, shutdown: 0 [00:20:30] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 47: active_shards: 94: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:22:05] AaronSchulz: excellent :) [00:30:58] hi, https://office.wikimedia.org/wiki/Data_access says "Request access to stat1 from ops if you're working with EventLogging data." I used to have access to stat1 when I worked on E3, now I think I need access to stat1001 for Flow EL data [00:31:14] spagewmf: Access requests need an RT ticket :) [00:32:11] Reedy thx. Is RT superseded by Phabricator yet? [00:32:17] nope [00:38:27] I try to create an RT ticket, get "You have no permission to create tickets in that queue. [00:38:30] No details " [00:40:12] Requests: ops-requests@rt.wikimedia.org [00:42:18] spagewmf: just use ops-request queue, doesnt matter via web ui or mail [00:43:47] mutante: Tickets > New ticket goes to https://rt.wikimedia.org/SelfService/Create.html?Queue=1 , no choice of a queue. (I'm logged in to RT) [00:44:41] spagewmf: the problem is the "SelfService" part in the URL, https://rt.wikimedia.org/Ticket/Create.html?Queue=15 [00:45:13] that probably means you are logged in as an email address [00:45:26] (03PS8) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [00:46:12] mutante: yes I loggedin as spage@wikimedia.org with a password from 2012. I didn't choose that queue, I just followed a link. [00:46:38] spagewmf: that's not the right user. which link did you follow [00:46:58] your user should just be a nickname, not an email address [00:47:11] otherwise this is your autocreated user without permissions [00:47:21] that you got from mailing RT in the past [00:48:21] mutante: that sounds right, I think it let me view tickets that "the machinery" created for me. I'll e-mail ops-requests, thanks [00:49:27] "the machinery" = email, yea [00:51:04] spagewmf: this is why "Autocreated when added as a watcher" [00:51:20] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 327 seconds [00:51:21] spagewmf: you probably never had an actual RT user [00:51:56] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 353 seconds [00:52:09] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: puppet fail [00:52:16] if it weren't for phabricator, i would say to request one [00:52:37] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:06] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:11:10] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:25:23] (03PS1) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 [01:28:21] (03PS2) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 (https://bugzilla.wikimedia.org/72072) [01:33:32] (03PS1) 10Tim Landscheidt: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/169981 [01:37:20] (03PS1) 10Ori.livneh: Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 [01:44:16] (03Abandoned) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 (https://bugzilla.wikimedia.org/72072) (owner: 10Chmarkine) [02:21:56] (03CR) 10Tim Starling: [C: 031] Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 (owner: 10Ori.livneh) [02:22:35] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%): [02:38:53] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 333 MB (3% inode=73%): [03:07:02] (03CR) 10Ori.livneh: [C: 032] Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 (owner: 10Ori.livneh) [03:48:12] periodically restart it? [03:48:34] what about in-flight requests? [03:55:27] TimStarling: ^ [03:55:57] does it not have a graceful restart mechanism? [03:56:07] I don't know [03:56:23] we plan on restarting it after every scap, so it will need one even without that cron job [03:56:34] I doubt that this exists now [03:57:45] doesn't look like it [03:58:47] # When `hhvm.server.graceful_shutdown_wait` is set to a positive [03:58:48] # integer, HHVM will perform a graceful shutdown on SIGHUP. [03:58:48] kill signal HUP [03:59:09] /etc/hhvm/fcgi.ini:hhvm.server.graceful_shutdown_wait [04:01:16] 5 seconds doesn't' seem all that long [04:03:10] also, there is a window between shutdown & start during which apache would throw a 50x (503 most likely) [04:03:35] because apache is in the middle, pybal won't have its socket closed and so it will have to resort to polling to depool [04:03:38] true [04:04:09] finally, doesn't this mean the JIT would have to warm up again? [04:04:22] yes [04:04:26] that's the point of restarting it after scap [04:04:40] otherwise the JIT "cache" grows forever [04:05:25] paravoid: are you saying that it's desirable for pybal to depool the server during this time? [04:05:57] I'm still trying to figure out the effects of this tbh [04:06:26] so I'm not saying anything yet [04:06:43] [fluorine:/a/mw-log] $ grep -c 'Segmentation fault' apache2.log [04:06:43] 1999 [04:06:56] (that's zend) [04:07:10] and? [04:07:32] so it's not going to be the end of the world to serve a few 503s when a server restarts, though it's obviously something we should fix [04:07:52] yes, 503s happen right now [04:08:05] they're indicative of a problem though, they're not just there by design [04:08:18] there's also [04:08:33] also I'm pretty sure there are connections that live for > 5s [04:08:40] so at least that part is wrong [04:08:42] ideally it would depool in advance, right? [04:08:52] before it stops accepting connections [04:08:57] yes [04:09:09] but for that, pybal needs an API [04:09:21] in our traditional stack [04:09:30] pybal keeps a long-running TCP connection open [04:09:59] when you kill apache, the connection tears down gracefully and pybal immediately detects that and depools instantly [04:10:16] instantly as in milliseconds [04:10:21] well yes :) [04:10:44] we could apache2ctl graceful-stop ; restart hhvm ; start apache [04:11:17] I've said in the past that we (HHVM team) should just write pybal code [04:11:49] yeah, i'm up for that [04:11:53] e.g. pybal could have a TCP server and we could send a depool message to it [04:12:18] we've had several ideas on how to add some level of instrumentation to pybal [04:12:29] they've all have been on the backburner though [04:12:32] etcd was popular, iirc [04:12:47] yeah, that was my idea from a while back [04:13:11] i don't love that it uses http long polling to watch keys [04:14:04] it would be quite nice to have an init script that waits for pybal to acknowledge that it has depooled before shutting down the server [04:14:25] yes, we could do that [04:15:02] and then you'd still have to wait for all inflight requests to be over before actually killing the server [04:15:08] I'd suggest a much higher timeout than 5s [04:15:27] sure, make it 30 [04:15:40] fortunately, most downloads/large files (e.g. a video) are being served from swift [04:16:03] (typically, although MW stays in the path for private wikis IIRC) [04:17:38] /etc/init/hhvm-graceful.conf: task; start on stopping hhvm RESULT=ok [04:17:45] http://manpages.ubuntu.com/manpages/utopic/en/man7/stopping.7.html [04:18:41] varnish hopefully buffers the result from MediaWiki [04:18:55] how do you mean? [04:19:36] well, squid certainly used to download from MediaWiki as fast as MediaWiki could supply data, even for private requests [04:19:58] then it would close the MW connection and continue to send data to the end user [04:20:28] I thought varnish did the same [04:20:36] it does, yes [04:20:44] unless streaming is enabled [04:20:49] which for MW is not [04:21:06] or return (pipe) [04:21:18] yes, but not in this case [04:21:24] so you won't have connections open for 30s due to a user streaming a meeting video from office.wikimedia.org over dialup [04:21:51] nod [04:23:20] hm, varnish probably does keep the connection open and pipelines though [04:23:47] one connection != one request [04:25:40] TimStarling: why did we set retry=0 for apache's mod_proxy again? [04:25:49] that's correct [04:26:00] well, for the general case, not sure about HHVM [04:26:02] otherwise if you restart hhvm then it will serve 503s for 30 seconds [04:26:06] yeah, that [04:26:33] there should probably not be keepalive between varnish and MW [04:27:15] why are we going through all this trouble again? [04:27:18] are we giving up on leaks? :) [04:27:29] see the bug [04:27:33] we fixed one leak, it was fun, took a week or so [04:27:48] heh, ori said "it was not fun" [04:27:53] different perspectives ;) [04:28:01] fun for some [04:28:12] well, i will know what i'm doing the next time around [04:28:20] fixing more leaks would also be fun but would take more weeks, and we probably aren't leaking much memory anymore [04:28:27] there was a lot of hysterical running around for me this time [04:28:59] so not only should the app server depool itself before hhvm is stopped, it should also wait for the warmup requests regiment to complete before repooling itself [04:29:01] that would be nice [04:29:23] yes that's why I asked early in the conversation about warmup [04:30:35] it'll be fine [04:30:45] while we're still in this deployment phase, maybe we should let it run for a while and see if there are actually any leaks? [04:30:56] meanwhile in two hours, plus or minus 15 minutes, mod_passenger on palladium will crash and this channel will get a screenful of puppet failure alerts [04:31:25] I don't see how this is relevant [04:31:39] unless you want to point out we're incompetent or something :) [04:31:59] no, but we tend to gravitate to interesting problems rather than urgent ones [04:32:25] we could leave it and see if there are more leaks [04:32:36] if that was the case we would have fixed palladium [04:32:44] but if the answer is yes then the solution is probably going to be to restart hhvm periodically ;) [04:32:48] i think we're still leaking [04:32:56] a lot less than before, but still [04:33:53] I don't find the palladium issues more urgent than what happens in our production stack affecting real users fwiw [04:34:13] well, see the 2k segfaults from zend then [04:34:41] again, I don't see the point of this [04:35:05] ori: I don't think that is a productive line of argument [04:35:42] if you were proposing to introduce a zend change that was known to segfault and throw 503s periodically, I don't think we'd be discussing this at all [04:35:43] all right. i didn't mean to be jerky. [04:36:19] i don't think we should live with periodic restarts as the status quo [04:36:47] but even with a restart every 24-48hrs, that's a big enough window for diffing heap profiles and getting down to the bottom of leaks [04:36:59] I'm not necessarily saying we shouldn't restart [04:37:02] paravoid: you know for many years we had a daily restart for apache because of zend memory leaks [04:37:18] at one time it was even hourly IIRC [04:37:29] I just want to understand the implications of that and figure out how to make our stack deal with that [04:37:33] I didn't know that [04:37:43] was it then that mark added the idle tcp connection check to pybal? :) [04:38:22] the idle connection check was a great idea, I don't think it was motivated by any particular problem [04:38:31] :) [04:38:34] tcp canary [04:39:27] or https://en.wikipedia.org/wiki/Dead_Hand_(nuclear_war) [04:39:44] I think it was introduced not long after pybal itself, so mark was still thinking generally about ways to improve it [04:40:04] before pybal we had a sad little php script written by me, and it wasn't very easy to do event-driven things in it [04:40:35] pybal needs some love [04:40:52] it's amazing how much little engineering time we are giving in it compared to how much we're relying on it [04:40:54] i'm looking at the code, it's pretty nice actually [04:41:09] paravoid: what does it need? [04:41:17] other than what was already identified above [04:41:32] dunno, we've had various ideas over the year, I don't think they were documented anywhere [04:41:51] so I guess the answer is... "a bug tracker"? [04:42:38] but anyway, the implications of this change are not /just/ related to pybal [04:43:29] e.g. depending how large is that 503-window, we may also start getting icinga alerts [04:43:47] how much does it take for HHVM to boot up anyway? [04:43:52] it's very short [04:44:17] if it's short then we can just set the retry count appropriately [04:45:01] a single appserver going down is not a thing that needs urgent attention, it's only really urgent if the service address goes down [04:45:22] true [04:45:41] for icinga, that's the difference between "alert on IRC" and page [04:46:12] so a single appserver wouldn't page us but it would print spam here [04:46:22] only CRITICALs get printed here though [04:46:25] i don't think so [04:46:39] icinga has a multiple attempts mechanism as well [04:46:39] we have icinga alerts configured for it and normal service restarts haven't triggered them [04:46:55] you can say "if the check fails 3 consecutive times, then alert" [04:47:00] usually we do that [04:47:48] !log enabled heap profiling on mw1189 [04:47:55] Logged the message, Master [04:48:43] so aiui from this conversation, hhvm doesn't take much time to boot (so the stop/start 503 window is small), but it takes a longer while to be good performance-wise because of the JIT warmup [04:49:15] so that "being slow window" is longer and ideally our loadbalancing should also account for it and not send requests that way [04:49:53] it's still shorter than i expected, i'd say ~5 seconds though i haven't measured it precisely [04:50:44] but yeah, having the loadbalancer give the machine a break would be nice [04:50:59] it shouldn't be that fast [04:51:22] and it would come with other benefits, like the ability to do staggered deployments that abort mid-way and roll back if some error threshold is crossed [04:51:30] it can't be warming up the JIT in 5s [04:52:04] the apache graphs in ganglia might tell us [04:55:14] i can't find a clear indication [04:56:21] well, we can tail -f api.log | grep mw114 and look at the response time as we restart it [04:56:53] * ori tries [05:21:38] (03PS8) 10Faidon Liambotis: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 [05:21:50] (03CR) 10Faidon Liambotis: [C: 032] mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 (owner: 10Faidon Liambotis) [05:27:52] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:41] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:44] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66275 bytes in 0.176 second response time [05:29:02] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:15] <_joe_> wat? [05:32:22] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [05:32:44] grumble [05:33:22] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.015 sec. response time [05:34:43] i didn't touch mw1023 or mw1027 [05:34:53] was that the auto-restart? [05:34:54] <_joe_> mw1027 is down in ganglia [05:34:59] <_joe_> but the machine is ok [05:35:07] meh that change's broken [05:35:32] (the sodium one I mean) [05:37:00] <_joe_> paravoid: I guess you're still awake [05:37:10] I woke up two hours ago [05:37:17] <_joe_> oh [05:37:22] <_joe_> very early :( [05:37:41] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [05:37:46] damn it, gnuplot [05:37:50] i want a vertical line [05:38:10] <_joe_> ori: mw1023 was the restart [05:38:16] <_joe_> not sure about mw1027 [05:39:09] <_joe_> ori: I will revert temporarily that periodic restart thing - for now I'd like to see when/if performance degrades [05:40:13] okay [05:40:22] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [05:40:42] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 66275 bytes in 0.163 second response time [05:41:40] <_joe_> mmmh quite strange indeed [05:43:55] TimStarling: you're handy with gnuplot, right? any idea why the following doesn't give me a vertical line for the restart? script: https://dpaste.de/ngpJ/raw graph: http://i.imgur.com/U0SieyI.png [05:44:41] (03PS1) 10Faidon Liambotis: Revert "mail: remove secondary MX role from sodium" [puppet] - 10https://gerrit.wikimedia.org/r/169986 [05:44:46] <_joe_> ori: I can help you with gnuplot maybe [05:45:02] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "mail: remove secondary MX role from sodium" [puppet] - 10https://gerrit.wikimedia.org/r/169986 (owner: 10Faidon Liambotis) [05:45:04] not really, I used it a bit in my PhD, a decade ago, not much since then [05:45:44] <_joe_> I stopped using it a few years later [05:46:02] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.013 sec. response time [05:46:24] <_joe_> ori: and I have no idea [05:46:36] <_joe_> why your set arrow doesn't work [05:47:20] data's at http://people.wikimedia.org/~ori/mw1114-restart-1414645218.tsv.gz (no private data, just timestampms) [05:47:25] bbiab [05:47:38] <_joe_> I rarely worked in gnuplot with time data where I converted unix epochs - maybe that's the problem? [05:47:49] <_joe_> (arrow not understanding the timeformat) [05:48:52] <_joe_> I'll bbl, I have a family to wake up and feed :) [06:02:44] _joe_: remember that if you revert that patch, you have to ensure => absent it, otherwise the cronjob will stay [06:03:55] heheh: http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/ [06:04:02] Most popular words for PHP developers: [06:04:13] entries 5 & 6: [06:04:18] localisation, translatewiki [06:04:25] Nikerabbit: ^ [06:04:42] what is "wp" there? [06:04:48] wordpress [06:04:50] ah [06:06:45] localisation and translatewiki are up there because of translation updater bot (e.g. https://github.com/wikimedia/mediawiki/commit/605e69ffedee38b77bfc14b0bdda199f30819b2d) [06:07:04] yeah I figured that [06:07:55] under "Most popular words for C developers" there's "free" [06:11:12] what the hell [06:11:31] Coren: around? [06:11:40] so labstore1001 is https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Labs+NFS+cluster+eqiad [06:11:44] saturated a bit [06:11:46] I investigate it [06:11:48] end up in toollabs [06:12:03] then I see a process called "vi" writing with 80-100M/s [06:12:06] 51716 733 65.3 68.8 3209664 2789432 ? D Oct29 527:08 vi languas.txt [06:12:15] someone is a fast typist [06:12:16] and then.. [06:12:17] root@tools-dev:/data/project/icelab# ls -lah /data/project/icelab/.final_mg.txt.swp [06:12:21] -rw-r--r-- 1 tools.icelab tools.icelab 3.5T Oct 30 06:11 /data/project/icelab/.final_mg.txt.swp [06:12:31] ...and has a lot to say [06:13:00] !?! [06:13:07] seriously [06:13:09] just kidding, i have no idea what that is [06:13:19] no I know [06:13:32] I'm just in awe [06:25:07] <_joe_> paravoid: ROTFL [06:25:52] RECOVERY - Disk space on ocg1001 is OK: DISK OK [06:26:33] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:26:54] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:28:23] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:13] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:35] <_joe_> it's weird. mod_passenger time happening one hour before [06:34:59] <_joe_> btw, the most awesome thing is vi handling a 3.5 TB file [06:35:08] <_joe_> I never thought it could [06:41:33] RECOVERY - check if salt-minion is running on ocg1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:41:42] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:13] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:33] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:52] i was off by two minutes [06:48:03] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:48:03] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:48:03] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:48:33] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 3 failures [06:49:42] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:37] lol ori, I'd say l10n-bot gamed the system ;) [06:51:38] _joe_: it's always at this time [06:56:02] DST? :) [06:58:17] that's cron.daily isn't it? [06:58:24] 06:26 [06:58:34] Debian's cron.daily default [07:00:02] (03PS1) 10Matanya: access: give spage access to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/169990 [07:00:10] [Thu Oct 30 06:25:58 2014] [notice] Graceful restart requested, doing restart [07:00:13] well duh [07:00:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:00:54] that's logrotate isn't it [07:01:20] # Rotate puppetmaster passenger (apache) logs [07:01:20] /var/log/apache2/*.log { [07:01:23] postrotate [07:01:23] /usr/sbin/service apache2 graceful [07:01:53] uhm... [07:04:55] so what is this mod_passenger issue you've talked about before? [07:05:07] I vaguelly recall someone saying something about a crash? [07:05:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:05:34] _joe_: ? [07:06:34] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:06:42] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:07:30] <_joe_> paravoid: sorry, just back [07:08:01] <_joe_> so, whenever apache is gracefully restarted, precise's version of mod passenger will not be happy [07:08:06] <_joe_> and crash some requests [07:08:33] <_joe_> so any hosts compiling in that moment [07:08:41] <_joe_> will have some catalog failures [07:10:11] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:10:15] <_joe_> I think they offered the version that gracefully restarted correctly as a paid option [07:15:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:15:54] (03PS1) 10Faidon Liambotis: puppetmaster: restart apache2 once on logrotate [puppet] - 10https://gerrit.wikimedia.org/r/169992 [07:16:00] this should help, perhaps even fix it [07:16:32] root@palladium:/var/log# lastcomm | grep apache2ctl |grep 06:25 |wc -l [07:16:35] 14 [07:16:46] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster: restart apache2 once on logrotate [puppet] - 10https://gerrit.wikimedia.org/r/169992 (owner: 10Faidon Liambotis) [07:17:31] <_joe_> meh [07:18:12] <_joe_> that's why I favor using file_line and not messing with distro scripts whenever not needed [07:18:52] debian's logrotate has this [07:19:04] <_joe_> I know [07:19:08] <_joe_> I never thought we didn't [07:19:23] it's not the distro's fault if someone had a NIH moment and wrote their own logrotate instead of copying Debian's and changing weekly->daily [07:19:30] <_joe_> yes [07:19:44] <_joe_> as I said, file_line in puppet fixes that [07:20:01] <_joe_> (the NIH syndrome) [07:20:08] oh right, I misread you, sorry [07:20:18] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:20:43] so what might have happened is [07:21:01] apache kept getting gracefully restarted very quickly and while passenger was initializing [07:21:07] <_joe_> yes [07:21:19] <_joe_> so that's why we had so many failures [07:21:24] yeah possibly [07:21:26] <_joe_> instead of 1-2 at max [07:21:44] at the same time, there is an another orthogonal issue here [07:21:54] which is that we shouldn't get an alert for a single puppet failure [07:22:06] <_joe_> well, it depends [07:22:14] <_joe_> for misc servers, maybe [07:22:28] I mean, it can happen [07:22:32] <_joe_> because of the horrible puppet antipattern we so often experience on icinga [07:22:37] we also do apt-get update during the puppet run, that can fail too [07:22:44] <_joe_> yes [07:22:52] <_joe_> I've seen that causing a shower of alerts [07:23:18] I hadn't, but I would guess as much [07:25:38] <_joe_> !log raising the weight of mw1114 in the api pool to test the throughput it can withstand [07:25:43] Logged the message, Master [07:26:46] (03PS1) 10KartikMistry: Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 [07:31:33] (03CR) 10Santhosh: [C: 031] Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 (owner: 10KartikMistry) [07:49:48] (03PS9) 10Qgil: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [07:56:07] (03CR) 10Faidon Liambotis: "Note that this unconditional, which may present some security issues. On our production stack, we use an explicit whitelist of servers/sub" [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [08:04:44] <_joe_> !log doing the same with mw1189, to see how different appserver generations respond [08:04:52] Logged the message, Master [08:17:06] <_joe_> !log powercycling mw1189, enabling hyperthreading [08:17:15] Logged the message, Master [08:21:44] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 (owner: 10KartikMistry) [08:37:04] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [08:38:53] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [08:42:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:45:31] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:12] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:57:45] PROBLEM - Host 208.80.154.50 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:58:02] <_joe_> Error: /Stage[main]/Puppetmaster::Passenger/Service[puppetmaster]: Could not evaluate: Could not find init script or upstart conf file for 'puppetmaster' [08:58:04] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:58:05] <_joe_> mmmh [08:58:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:03:01] _joe_: already fixes [09:03:03] fixed* [09:03:22] more precisely... already caused the problem and fixed it as well [09:31:44] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM but we need to verify it will cause no problems on hosts having this class before merging" [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [09:32:11] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:35:35] (03CR) 10Alexandros Kosiaris: [C: 032] Fix bacula rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/169679 (owner: 10Alexandros Kosiaris) [09:35:53] (03PS2) 10Alexandros Kosiaris: Modularize backups.pp [puppet] - 10https://gerrit.wikimedia.org/r/169680 [09:36:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Ran in puppet compiler, it was a noop" [puppet] - 10https://gerrit.wikimedia.org/r/169680 (owner: 10Alexandros Kosiaris) [09:36:31] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:38:10] (03Abandoned) 10Alexandros Kosiaris: puppetmaster's pid ensured absent [puppet] - 10https://gerrit.wikimedia.org/r/166516 (owner: 10Alexandros Kosiaris) [09:48:33] <_joe_> !log load testing the hhvm appserver pool as well [09:48:41] Logged the message, Master [10:10:58] (03PS1) 10Filippo Giunchedi: syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 [10:11:15] I missed 170000 by two reviews :( [10:12:34] <_joe_> he [10:12:40] on for 180000 ! [10:13:09] * godog busy while preparing 10k puppet commits [10:13:11] <_joe_> let's automagically create 10000 PS's [10:13:31] <_joe_> godog: "retab" everything [10:13:40] hahha introduce tabs! [10:13:45] * godog grabs the transmogrifier [10:15:01] <_joe_> !log load test ended [10:15:09] Logged the message, Master [10:39:27] (03CR) 10Alexandros Kosiaris: [C: 032] syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 (owner: 10Filippo Giunchedi) [10:50:28] (03PS1) 10Alexandros Kosiaris: Add a description parameter to ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170003 [10:55:51] (03PS1) 10Alexandros Kosiaris: Add descriptions to DNS ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/170004 [11:05:51] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:14] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:00] <_joe_> mmmh something restarted hhvm here [11:11:12] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 66295 bytes in 0.288 second response time [11:11:36] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [11:13:52] _joe_: getting db errors [11:13:58] is it related ^ [11:14:09] <_joe_> matanya: are you on hhvm? [11:14:14] yes [11:14:15] <_joe_> no, I don't think so [11:14:32] function: WikiPage::updateRevisionOn [11:14:39] 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.22) [11:15:59] <_joe_> matanya: I found you in the logs :P [11:16:51] if you will query me in the logs, the query will time out :P [11:18:31] (03CR) 10Filippo Giunchedi: hiera: mediawiki-based backend for labs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/168984 (owner: 10Giuseppe Lavagetto) [11:23:55] <_joe_> thanks godog [11:24:25] np _joe_, I'm not a ruby expert by any stretch of imagination though [11:24:45] <_joe_> and you know about me :P [11:25:53] <_joe_> combining ruby with the mw api is... fun [11:26:17] yeah, I 've been meaning to take a look at that patch as well [11:26:33] my first reaction was to run away when I saw all the ruby [11:27:10] <_joe_> akosiaris: well, any hiera backend should be a ruby script. Unless we build a python daemon and just make ruby send queries to it :P [11:27:20] hmmm [11:27:26] that sounds intriguing :-) [11:27:27] <_joe_> (I considered the option when writing nuyaml) [11:28:29] (03CR) 10JanZerebecki: [C: 04-1] "Use ssl_ciphersuite function instead?" [puppet] - 10https://gerrit.wikimedia.org/r/169949 (owner: 10Dzahn) [11:34:46] (03PS1) 10KartikMistry: Beta: Regression fix: Enable ca-pt and pt-ca [puppet] - 10https://gerrit.wikimedia.org/r/170006 [11:38:40] <_joe_> !log depooling mw1030 and mw1031 for reimaging as hhvm appservers [11:38:46] Logged the message, Master [11:46:36] (03PS1) 10Alexandros Kosiaris: Add unpuppetized ganglia swift views [puppet] - 10https://gerrit.wikimedia.org/r/170007 [11:50:53] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Regression fix: Enable ca-pt and pt-ca [puppet] - 10https://gerrit.wikimedia.org/r/170006 (owner: 10KartikMistry) [11:51:19] akosiaris: most of those swift views depend on logtailer which isn't there anymore, I'll take a closer look after lunch! [11:51:41] :-( [11:51:48] you are breaking my heart.... [11:51:56] anyway thanks godog [11:52:14] I know, sorry about that akosiaris didn't want to rain on your parade :) [12:05:57] Oh, nice, the virtualization cluster is no longer over 100 % load, makes the graphs prettier https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Virtualization+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=load_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [12:08:27] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [12:18:12] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not restart hhvm for now [puppet] - 10https://gerrit.wikimedia.org/r/170009 [12:22:15] (03PS1) 10Alexandros Kosiaris: ganglia web: avoid using tmpfs on trusty [puppet] - 10https://gerrit.wikimedia.org/r/170010 [12:24:13] PROBLEM - DPKG on mw1031 is CRITICAL: Connection refused by host [12:24:23] PROBLEM - Disk space on mw1031 is CRITICAL: Connection refused by host [12:24:23] PROBLEM - nutcracker port on mw1031 is CRITICAL: Connection refused by host [12:24:43] PROBLEM - RAID on mw1030 is CRITICAL: Connection refused by host [12:24:45] PROBLEM - nutcracker process on mw1031 is CRITICAL: Connection refused by host [12:24:53] PROBLEM - puppet last run on mw1031 is CRITICAL: Connection refused by host [12:24:53] PROBLEM - HHVM processes on mw1031 is CRITICAL: Connection refused by host [12:25:00] PROBLEM - check configured eth on mw1030 is CRITICAL: Connection refused by host [12:25:03] PROBLEM - check if dhclient is running on mw1030 is CRITICAL: Connection refused by host [12:25:23] PROBLEM - check if salt-minion is running on mw1030 is CRITICAL: Connection refused by host [12:25:24] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:25:45] PROBLEM - DPKG on mw1030 is CRITICAL: Connection refused by host [12:25:45] PROBLEM - RAID on mw1031 is CRITICAL: Connection refused by host [12:25:56] PROBLEM - Disk space on mw1030 is CRITICAL: Connection refused by host [12:25:56] PROBLEM - nutcracker port on mw1030 is CRITICAL: Connection refused by host [12:26:03] PROBLEM - nutcracker process on mw1030 is CRITICAL: Connection refused by host [12:26:04] PROBLEM - check configured eth on mw1031 is CRITICAL: Connection refused by host [12:26:16] PROBLEM - puppet last run on mw1030 is CRITICAL: Connection refused by host [12:26:17] PROBLEM - HHVM processes on mw1030 is CRITICAL: Connection refused by host [12:26:25] PROBLEM - check if dhclient is running on mw1031 is CRITICAL: Connection refused by host [12:26:37] PROBLEM - check if salt-minion is running on mw1031 is CRITICAL: Connection refused by host [12:32:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not page ops with hhvm processes errors. [puppet] - 10https://gerrit.wikimedia.org/r/170011 [12:32:47] RECOVERY - HHVM processes on mw1030 is OK: PROCS OK: 1 process with command name hhvm [12:33:25] RECOVERY - check if dhclient is running on mw1030 is OK: PROCS OK: 0 processes with command name dhclient [12:33:35] RECOVERY - check if salt-minion is running on mw1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:34:05] RECOVERY - DPKG on mw1030 is OK: All packages OK [12:34:05] RECOVERY - RAID on mw1031 is OK: OK: no RAID installed [12:34:06] RECOVERY - RAID on mw1030 is OK: OK: no RAID installed [12:34:06] RECOVERY - nutcracker process on mw1031 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:34:06] RECOVERY - Disk space on mw1030 is OK: DISK OK [12:34:06] RECOVERY - nutcracker port on mw1030 is OK: TCP OK - 0.000 second response time on port 11212 [12:34:15] RECOVERY - HHVM processes on mw1031 is OK: PROCS OK: 1 process with command name hhvm [12:34:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: do not page ops with hhvm processes errors. [puppet] - 10https://gerrit.wikimedia.org/r/170011 (owner: 10Giuseppe Lavagetto) [12:34:48] RECOVERY - nutcracker process on mw1030 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:34:49] RECOVERY - check configured eth on mw1031 is OK: NRPE: Unable to read output [12:34:49] RECOVERY - check configured eth on mw1030 is OK: NRPE: Unable to read output [12:34:58] RECOVERY - check if dhclient is running on mw1031 is OK: PROCS OK: 0 processes with command name dhclient [12:34:58] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [12:35:08] RECOVERY - DPKG on mw1031 is OK: All packages OK [12:35:11] RECOVERY - check if salt-minion is running on mw1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:35:38] RECOVERY - Disk space on mw1031 is OK: DISK OK [12:35:39] RECOVERY - nutcracker port on mw1031 is OK: TCP OK - 0.000 second response time on port 11212 [12:35:58] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [12:36:39] (03PS1) 10Hashar: contint: Graphviz on Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) [12:38:48] PROBLEM - NTP on mw1030 is CRITICAL: NTP CRITICAL: Offset unknown [12:40:50] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on contint puppetmaster in labs." [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [12:41:38] HASHAR [12:41:46] he's gone ;( [12:45:19] RECOVERY - NTP on mw1030 is OK: NTP OK: Offset -0.009510040283 secs [12:45:48] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:45:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:46:50] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:46:59] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:47:04] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia web: avoid using tmpfs on trusty [puppet] - 10https://gerrit.wikimedia.org/r/170010 (owner: 10Alexandros Kosiaris) [12:47:48] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:47:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:47:58] _joe_: I merged your change as well [12:48:10] <_joe_> akosiaris: he sorry [12:48:22] no worries, just letting you know [12:48:23] <_joe_> I am doing too many things at once [12:48:54] !log enabled puppet on uranium [12:48:59] Logged the message, Master [12:49:05] let's see if we will still have ganglia in 5 mins :-) [12:49:05] * mark enables Hyperthreading on _joe_ [12:51:32] <_joe_> !log rebooting mw1030 and mw1031 to use the updated kernel [12:51:37] Logged the message, Master [12:52:43] hmmm so uranium could use rrdcached me thinks [12:53:38] PROBLEM - Host mw1031 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:48] PROBLEM - Host mw1030 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:54:09] RECOVERY - Host mw1031 is UP: PING OK - Packet loss = 0%, RTA = 2.55 ms [12:54:10] RECOVERY - Host mw1030 is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms [12:57:01] (03PS1) 10Alexandros Kosiaris: Enable ganglia diskstat plugin on ganglia::web [puppet] - 10https://gerrit.wikimedia.org/r/170017 [12:57:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Enable ganglia diskstat plugin on ganglia::web [puppet] - 10https://gerrit.wikimedia.org/r/170017 (owner: 10Alexandros Kosiaris) [12:59:21] (03PS2) 10Alexandros Kosiaris: Add a description parameter to ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170003 [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141030T1300). [13:06:06] K4 can't really be awake right now [13:09:22] (03PS1) 10Giuseppe Lavagetto: HAT: move mw1030 and mw1031 [puppet] - 10https://gerrit.wikimedia.org/r/170021 [13:10:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] HAT: move mw1030 and mw1031 [puppet] - 10https://gerrit.wikimedia.org/r/170021 (owner: 10Giuseppe Lavagetto) [13:39:46] (03PS1) 10Ottomata: Fix for stat1002 host in rsync job for aggregate datasets [puppet] - 10https://gerrit.wikimedia.org/r/170024 [13:40:27] (03CR) 10Ottomata: [C: 032 V: 032] Fix for stat1002 host in rsync job for aggregate datasets [puppet] - 10https://gerrit.wikimedia.org/r/170024 (owner: 10Ottomata) [13:50:44] (03CR) 10Alexandros Kosiaris: [C: 032] Add a description parameter to ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170003 (owner: 10Alexandros Kosiaris) [13:51:23] (03PS2) 10Alexandros Kosiaris: Add descriptions to DNS ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/170004 [13:51:31] (03CR) 10Alexandros Kosiaris: [C: 032] Add descriptions to DNS ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/170004 (owner: 10Alexandros Kosiaris) [14:02:11] (03CR) 10Ottomata: [C: 031] access: give spage access to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [14:13:43] ottomata: can you explain to me the logic of access request tickets ? [14:14:12] ? [14:14:26] i can see them, prepare a patch, and then no longer have permission once they are moed from ops-request to "somethingelse" ? [14:14:31] oh [14:14:35] nope, i cannot explain that to you at all [14:14:40] i have no idea why it is that wya [14:14:44] oh no, it's easy to explain [14:14:46] 'RT is shit'? [14:14:49] there are permissions on other queues [14:14:51] i do not know about them [14:15:03] it is the third time in a week or so [14:15:14] bugzilla security issues, for example. Once they get moved to the hidden security area, the ones who filed 'em can still see 'em [14:15:36] but not those who reply ? [14:15:45] that makes zero sense to me [14:16:13] RT is for email [14:17:18] You should consider the web interface as an edge-case tool [14:17:29] That's my understanding [14:18:01] Nemo_bis: I don't think you get emails after they are queue-moved [14:18:15] so i'll just add admin-cc to all tickets [14:21:12] naw, you can reply to emails you get from any queue [14:26:19] matanya: late reply but from what Daniel told me - it is so managers or ops can discuss candidates in private [14:26:53] makes sense, and i know that, but still confusing [14:27:32] Well how would you like to be told you can't have shell because x and y of several people in a public ticket? :p [14:27:40] *off [14:27:59] Public in the sense of rt users [14:28:24] i wouldn't care, tbh. but what ever [14:28:35] Yeah [14:31:48] (03PS1) 10Matanya: access: grant reedy access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170035 [14:33:04] <_joe_> !log pooling mw1031/2 in the hhvm appservers pool [14:33:14] Logged the message, Master [14:37:17] !log powering down elastic1003-1006 to replace ssds [14:37:23] Logged the message, Master [14:40:01] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:05] (03PS1) 10Alexandros Kosiaris: Actually use view_name on ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170037 [14:40:52] PROBLEM - Host elastic1005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:41:02] PROBLEM - Host elastic1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:31] PROBLEM - Host elastic1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:41] <^demon|away> cmjohnson: Heh, was just wondering if we'd started yet this morning :) [14:42:58] (03PS1) 10Giuseppe Lavagetto: HHVM: get 10% of anonymous traffic. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170039 [14:43:00] (03PS1) 10Giuseppe Lavagetto: HHVM: get 15% of anonymous traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170040 [14:43:02] (03PS1) 10Giuseppe Lavagetto: HHVM: get 25% of anonymous traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170041 [14:43:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "not for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170040 (owner: 10Giuseppe Lavagetto) [14:43:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "not for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170041 (owner: 10Giuseppe Lavagetto) [14:44:11] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=72%): [14:49:42] Well, sports fans, who's up for SWAT today [14:50:28] I see aude has a patch in, nobody else though [14:50:41] Aw, who added UTC-5 to the deployment calendar, that's cute [14:50:54] it's a small patch [14:51:46] (03PS2) 10Giuseppe Lavagetto: mediawiki: do not restart hhvm for now [puppet] - 10https://gerrit.wikimedia.org/r/170009 [14:53:05] OH, whoa, it's added dynamically, crazypants [14:53:13] Anyway, I'll do SWAT [14:53:20] <^d> I am actually awake. [14:53:24] I'm staring down a nasty rebase conflict and I could use the break [14:53:59] yeah, I get UTC+5.5 [14:54:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: do not restart hhvm for now [puppet] - 10https://gerrit.wikimedia.org/r/170009 (owner: 10Giuseppe Lavagetto) [14:54:12] rebase conflict best conflict [14:55:34] YuviPanda: Not when it's because of file-wide style changes [14:55:43] I've stopped even looking at the conflict [14:55:51] well, still better than a conflict over Oil, for example... [14:56:04] I just delete <<<<< HEAD and =====, then try to figure out where the new stuff is supposed to be in reality [14:56:04] and people who make file wide style changes should be asked to rebase open active patches too [14:56:13] YEAH, JAMES_f [14:56:22] But no, I don't mind [14:56:47] I should open a connection to tin [14:57:02] (03PS1) 10Alexandros Kosiaris: Fix typo with backup::set [puppet] - 10https://gerrit.wikimedia.org/r/170045 [14:57:24] aude: Only wmf6? [14:57:28] Not that I'm complaining [14:57:29] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo with backup::set [puppet] - 10https://gerrit.wikimedia.org/r/170045 (owner: 10Alexandros Kosiaris) [14:57:34] yes, only affects test.wikidata [14:58:05] OK then [14:58:37] I'll get a head start on Jenkins [15:00:04] manybubbles, anomie, ^d, marktraceur, aude: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141030T1500). Please do the needful. [15:00:13] * marktraceur is on it already, damn jouncebot [15:00:52] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:01:23] I wish we could put mediawiki/core patches that are in wmf/* branches into this channel [15:01:39] * marktraceur looks at YuviPanda expectantly [15:01:50] ah, hmm. that's fairly trivial to do... [15:02:03] but I'm off this week! Someone else do it! it's just config, we already do that for betalabs patches [15:02:06] (03CR) 10Filippo Giunchedi: [C: 031] HHVM: get 10% of anonymous traffic. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170039 (owner: 10Giuseppe Lavagetto) [15:02:12] I don't think it's that useful in -dev anyway [15:02:18] I might do it in between stupid rebases [15:02:44] yay :) [15:03:03] YuviPanda: I'd also like grrrit-wm to ping the +2er when it all gets merged finally, but that's harder. :) [15:03:24] YuviPanda: If you're off this week, why are you hanging around here? [15:03:29] We're terrible [15:03:36] marktraceur: that already happens, no? when jenkins-bot merges.. [15:03:40] marktraceur: owner is highlighted [15:03:44] marktraceur: I'm in the himalayas! [15:03:45] Owner, but not reviewer [15:03:49] Neato [15:04:07] aaah, yeah... [15:04:15] marktraceur: it's cold out at night, and i'm in a village. [15:04:23] marktraceur: so sitting around on a stone table with a bunch of other geeks [15:04:27] I hope you brought enough t-shirts [15:04:50] marktraceur: as usual I didn't [15:04:54] Oh, you'd love my new locale. It's 5 celsius outside and I have the patio door open [15:05:07] (only a little bit though) [15:05:13] Oh, merged [15:05:17] yay [15:05:39] marktraceur: heh, I plan on visiting [15:05:42] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:05:58] YuviPanda: You can wait until summer, which is much better [15:07:21] Syncing [15:07:27] ok [15:07:35] !log marktraceur Synchronized php-1.25wmf6/extensions/Wikidata/: [SWAT] [wmf6] Fix edit link for aliases (duration: 00m 12s) [15:07:36] Aaaaand test, aude [15:07:42] Logged the message, Master [15:07:46] looks good [15:07:47] thanks! [15:07:51] !log moving shards off of elastic1015 and elastic1016 so we can replace their hard drives/turn on hyper threading [15:07:56] Logged the message, Master [15:08:21] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:08:29] Sweet [15:08:34] I declare SWAT closed [15:08:50] ohi [15:08:52] manybubbles: [15:08:53] :) [15:09:07] hi! [15:09:17] (03PS2) 10Filippo Giunchedi: syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 [15:09:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 (owner: 10Filippo Giunchedi) [15:09:29] I'm in and out a bit this morning because I'm making up for the day off I was only half able to take yesterday :) [15:11:02] hah, aye [15:11:16] what's the servers status/what should I work on today? [15:13:24] !log upgrading PHP on mw1113 to php5_5.3.10-1ubuntu3.15+wmf1 [15:13:30] Logged the message, Master [15:14:03] (03CR) 10Mark Bergsma: [C: 032] HHVM: get 10% of anonymous traffic. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170039 (owner: 10Giuseppe Lavagetto) [15:14:11] (03Merged) 10jenkins-bot: HHVM: get 10% of anonymous traffic. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170039 (owner: 10Giuseppe Lavagetto) [15:14:20] oops [15:15:51] <^d> ottomata: Well cmjohnson took 3-6 down already for ssd replacement. Nik was saying we could drain 15-16 next. [15:15:54] ^d ottomata manybubbles elastic1003-6 is in installer now...a few more mins [15:16:00] <^d> Sweet, thx [15:16:02] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:16:18] <^d> I'll go ahead and drain shards from those, actually. [15:16:38] i didn't think of the +2 effects here ;) [15:17:08] <^d> Oh he beat me to it. [15:17:33] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [15:17:43] <^d> cmjohnson, ottomata: 1015 and 1016 are next, Nik's already draining their shards now. [15:18:22] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [15:18:24] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:41] RECOVERY - Host elastic1004 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [15:18:44] (03PS2) 10Reedy: Write ganglia temp file to /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169866 [15:18:49] (03CR) 10Reedy: [C: 032] Write ganglia temp file to /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169866 (owner: 10Reedy) [15:18:51] PROBLEM - DPKG on mw1113 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:18:52] RECOVERY - Host elastic1006 is UP: PING OK - Packet loss = 0%, RTA = 4.39 ms [15:19:04] (03Merged) 10jenkins-bot: Write ganglia temp file to /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169866 (owner: 10Reedy) [15:21:58] ok cool, ja lemme know when 1003-1006 are ready for puppetization [15:22:20] <_joe_> Reedy: if your going to merge on tin, please let me know [15:22:31] ottomata: I just caught the tail end of that (stupid me not having a bouncer) - I imagine you are waiting on cmjohnson telling you that? [15:22:32] !log reedy Synchronized docroot and w: Fix dbtree caching (duration: 00m 15s) [15:22:33] <_joe_> as we have the 10% hhvm ramp-up waiting [15:22:37] Logged the message, Master [15:22:55] yup [15:23:07] _joe_ i can help do some app servers today [15:23:13] <_joe_> ottomata: cool [15:23:14] can you send me that link to that etherpad with instructions? [15:23:26] <_joe_> http://etherpad.wikimedia.org/p/app-server-upgrade [15:24:32] PROBLEM - Host elastic1003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:26:17] ottomata: all yours...all the old stuff on palladium has been removed but plz check [15:26:22] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [15:26:32] <^d> manybubbles: I've gotta get you on my bouncer. [15:26:33] <^d> :) [15:26:55] _joe_: wmf-reimage? [15:27:07] <_joe_> ottomata: on palladium, yes [15:27:16] ^d: if you've got spare [15:27:18] <_joe_> it's a script that does a few things in your place [15:27:26] _joe_ it isn't on my path? [15:27:27] where is it? [15:27:39] can you guys put that on wikitech instead of etherpad btw? :) [15:29:16] _joe_, shoudl I do 1032-1034 now? [15:29:25] !log oblivian Synchronized wmf-config/CommonSettings.php: Serving 10% of anons with HHVM (duration: 00m 06s) [15:29:31] Logged the message, Master [15:29:41] <_joe_> ottomata: 2 should be enough for today [15:29:55] 1032,1033? [15:30:11] shoudl I do them one at a time? [15:30:40] <_joe_> I do them in parallel, but the first time, do one at a time if you're not confident with the procedure [15:30:53] <_joe_> I did 5 in parallel the other week :P [15:31:46] ha, ok [15:32:23] " October 27: All logged-in users served by HHVM" https://wikitech.wikimedia.org/wiki/Deployments [15:32:36] This didn't happen, did it? I see edits without HHVM tag [15:32:36] _joe_, where is wmf-reimage? [15:32:45] Nemo_bis: No, it didn't [15:33:07] <_joe_> Nemo_bis: no it did not [15:33:23] cmjohnson: wait, which elastic servers am I doing right now? [15:33:30] just double checking so I don't do something dumb [15:33:32] <_joe_> Nemo_bis: we hit a big roadblock and we restarted deploying recently. We should've updated that [15:33:32] 1003-1006 [15:33:38] <_joe_> ottomata: palladium [15:33:57] 4 servers cmjohnson? [15:34:01] _joe_, where on palladium? [15:34:03] it isn't on my path [15:34:09] ok tanks [15:34:11] thanks [15:34:13] <_joe_> ottomata: on the root's path for sure [15:34:21] (tanks are no good) [15:34:24] <_joe_> /usr/local/bin/ [15:34:25] ah in /usr/local/bin, hm, dunno why that's not on my path [15:34:35] wait, it is... [15:34:39] <_joe_> ottomata: as super user [15:34:41] ah [15:34:44] not execuatable as me [15:34:44] ok [15:34:51] <_joe_> on purpose [15:34:52] <_joe_> :) [15:34:54] put it in sbin then [15:34:58] ottomata: yep 4 servers...do you want me to do it? [15:35:02] naw i can do it [15:35:07] already there, just double checking [15:35:16] <_joe_> well, not properly a system script [15:36:12] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:39:23] oh, wmf-reimage just does all the salt/puppet key stuff automatically [15:39:23] cool [15:39:54] !log starting to reimage mw1032 [15:40:02] Logged the message, Master [15:42:00] _joe_, i'm editing the pybal file [15:42:08] vim says it is open by someone else too :) [15:42:15] s'ok? [15:42:18] <_joe_> ottomata: which is me right now, sorry [15:42:21] <_joe_> 1 sec [15:42:32] k [15:42:32] <_joe_> go on [15:43:08] !log Going to upgrade Zuul and monitor the result over the next hour. [15:43:10] hello ;) [15:43:12] _joe_, what is the process with pybal configs now? [15:43:13] Logged the message, Master [15:43:16] <_joe_> hashar! [15:43:17] do we git commit every change? [15:43:42] there's no !log in jabber :) [15:43:46] <_joe_> ottomata: you can do a commit of all our changes at the end of your day [15:43:47] * greg-g waves to hashar [15:43:52] ook! [15:44:18] greg-g: hey I can use dologmsg on tin.eqiad.wmnet though :] [15:44:34] * greg-g thought bubbles "maybe we should have a non-IRC non-bastion !log alternative" [15:44:52] i was tempted to greet hashar in gerrit today [15:45:29] <^d> greg-g: You can edit [[SAL]] by hand :) [15:45:45] (03PS1) 10Ottomata: Reimage mw1032 as appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/170052 [15:45:47] ^d: less wiki editing the better :P [15:46:03] * greg-g just committed hearesy [15:46:46] * mark puts a real SWAT team on the deployment calendar, to greg's location [15:46:54] haha [15:47:04] (03CR) 10Ottomata: [C: 032] Reimage mw1032 as appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/170052 (owner: 10Ottomata) [15:49:33] ^d, cmjohnson, 1003-1006 are coming up...i think. 1003 is taking a while to do some elasticsearhc stuff [15:49:34] HMMM [15:49:39] [elastic1003] failed to send join request to master [15:49:45] <^d> ruh roh [15:49:46] [[elastic1001][ [15:50:16] <^d> I see it joined tho. [15:50:19] hm, i see it now [15:50:26] <^d> I think we're ok [15:50:28] reason [org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task. [15:50:31] that is the last message though [15:50:34] in production...log [15:51:15] <^d> Ah, [2014-10-30 15:49:49,938][INFO ][discovery.zen ] [elastic1001] received a join request for an existing node [[elastic1003][OjlwXz3DQNuxWkhpN9YWAA][elastic1003][inet[/10.64.0.110:9300]]{rack=A3, row=A, master=false}] [15:51:24] (03CR) 10Anomie: Only add the "oauthadmin" group on the central wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [15:51:30] hm. [15:51:38] <^d> no, it's ok. [15:51:40] <^d> all of them did that. [15:51:43] <^d> i think. [15:52:18] <^d> manybubbles: we ok? [15:52:31] (03CR) 10Hoo man: "Good catch... doh. Will follow-up in a second" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [15:54:02] (03PS1) 10Hoo man: Declare $wgGroupPermissions global in anon function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170056 [15:54:46] anomie: https://gerrit.wikimedia.org/r/170056 [15:54:48] feel free to deploy whenever you want [15:54:52] I can also do it now, if urgent [15:55:03] <^d> ottomata: Yeah, I think we're cool. [15:55:06] man... that this happens twice to me in two weeks (not declaring globals) [15:55:07] <^d> Going to unban 3-6. [15:55:10] cool. [15:55:15] yeah they look normal now [15:55:34] hoo: Probably not that urgent, as long as no one wants their OAuth consumers approved. [15:55:46] heh... who would want that anyway :D [15:58:57] I think I'll just push it now, it's trivial [16:00:04] awight, AndyRussG: Dear anthropoid, the time has come. Please deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141030T1600). [16:00:05] (03CR) 10Hoo man: [C: 032] "Trivial fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170056 (owner: 10Hoo man) [16:00:12] (03Merged) 10jenkins-bot: Declare $wgGroupPermissions global in anon function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170056 (owner: 10Hoo man) [16:00:24] (03PS1) 10Giuseppe Lavagetto: HHVM: fix monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/170060 [16:00:52] !log hoo Synchronized wmf-config/CommonSettings.php: Fix oauthadmin (duration: 00m 09s) [16:00:58] Logged the message, Master [16:02:56] * hoo away again [16:03:08] !log Stopping zuul [16:03:12] Logged the message, Master [16:04:31] !log restarted Zuul with upgraded version ( wmf-deploy-20140924-1..wmf-deploy-20141030-1 ) [16:04:36] Logged the message, Master [16:05:32] (03PS1) 10Ottomata: Disable test kafkatee webstatscollector instance to troubleshoot missing lines [puppet] - 10https://gerrit.wikimedia.org/r/170061 [16:05:39] lets see what happens [16:06:28] (03CR) 10Ottomata: [C: 032] Disable test kafkatee webstatscollector instance to troubleshoot missing lines [puppet] - 10https://gerrit.wikimedia.org/r/170061 (owner: 10Ottomata) [16:07:33] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [16:11:22] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [16:16:30] ^d: back - everything look ok? [16:16:37] <^d> Yeah [16:16:54] <^d> cmjohnson: 1015 and 1016 are empty and ready whenever. [16:17:08] (03PS3) 10Reedy: Add robots.txt rewrite rule where wiki is public [puppet] - 10https://gerrit.wikimedia.org/r/147487 [16:17:18] ^d thanks...getting them now [16:17:32] !log powering off elastic1015-16 to replace ssds [16:17:39] Logged the message, Master [16:19:00] ^d: cool. [16:19:05] ottomata: elastic1003 doesn't have ht [16:20:02] PROBLEM - Host elastic1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:18] manybubbles...are you sure? i changed the setting [16:20:41] PROBLEM - Host elastic1016 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:46] ottomata: elastic1006 doesn't have noatime [16:21:09] cmjohnson: I'm sure - yeah - half the number of cores in top that I expect [16:21:16] (03PS1) 10Hashar: contint: Zuul status learned to query a single change [puppet] - 10https://gerrit.wikimedia.org/r/170070 (https://bugzilla.wikimedia.org/47609) [16:22:01] (03PS2) 10Hashar: contint: Zuul status learned to query a single change [puppet] - 10https://gerrit.wikimedia.org/r/170070 (https://bugzilla.wikimedia.org/47609) [16:22:16] !log moving shards off of elastic1003 and elastic1006 so they can be restarted. elastic1003 need hyperthreading and elastic1006 needs noatime. [16:22:22] Logged the message, Master [16:22:44] (03PS4) 10Filippo Giunchedi: import debian/ directory [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 [16:24:01] (03PS2) 10Reedy: Make apple-touch-icon.png configurable via touch.php [puppet] - 10https://gerrit.wikimedia.org/r/147488 [16:25:32] oh why does misc varnish suddenly cache Zuul status page :/ [16:26:54] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [16:28:31] !log awight Synchronized php-1.25wmf5/extensions/CentralNotice: push CentralNotice updates (duration: 00m 11s) [16:28:37] Logged the message, Master [16:28:49] (03CR) 10Filippo Giunchedi: "let's shoot for monday (nov 3rd)" [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [16:31:55] !log awight Synchronized php-1.25wmf6/extensions/CentralNotice: push CentralNotice updates (duration: 00m 09s) [16:32:00] Logged the message, Master [16:32:52] (03CR) 10Filippo Giunchedi: [C: 04-1] "unfortunately I think most (all?) of those metrics were based off swift logtailer which is no more (swift native statsd has taken its plac" [puppet] - 10https://gerrit.wikimedia.org/r/170007 (owner: 10Alexandros Kosiaris) [16:33:10] (03PS1) 10Jgreen: add codfw fundraising hosts [dns] - 10https://gerrit.wikimedia.org/r/170072 [16:34:39] (03CR) 10Filippo Giunchedi: [C: 04-1] Add robots.txt rewrite rule where wiki is public (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [16:35:14] (03CR) 10Reedy: [C: 04-1] ". -> \. too" [puppet] - 10https://gerrit.wikimedia.org/r/147488 (owner: 10Reedy) [16:36:55] <_joe_> Reedy: please use an Include for robots.txt, favicon, etc [16:37:05] <_joe_> didn't I do that already, btw? [16:37:13] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [16:37:27] (03PS1) 10Andrew Bogott: Replace libmariadbclient-dev with libmysqlclient-dev. [puppet] - 10https://gerrit.wikimedia.org/r/170073 [16:37:48] _joe_: Really, most of the configs just want refactoring further so they all use the common incl [16:38:10] !log Zuul status page is freezing because the status.json is being cached :-/ [16:38:16] Logged the message, Master [16:38:17] (03CR) 10Jgreen: [C: 032 V: 031] add codfw fundraising hosts [dns] - 10https://gerrit.wikimedia.org/r/170072 (owner: 10Jgreen) [16:38:27] <_joe_> Reedy: I'll take a look tomorrow [16:38:36] !log Upgraded kibana to v3.1.1 via Trebuchet [16:38:42] Logged the message, Master [16:39:53] (03CR) 10coren: Replace libmariadbclient-dev with libmysqlclient-dev. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170073 (owner: 10Andrew Bogott) [16:40:09] (03PS2) 10Giuseppe Lavagetto: HHVM: get 15% of anonymous traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170040 [16:40:20] well that's not good :/ kibana 3.1.1 works fine in beta but not so much in prod [16:40:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Let's go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170040 (owner: 10Giuseppe Lavagetto) [16:41:07] (03CR) 10Yuvipanda: Replace libmariadbclient-dev with libmysqlclient-dev. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170073 (owner: 10Andrew Bogott) [16:41:28] (03PS3) 10Reedy: Make apple-touch-icon.png configurable via touch.php [puppet] - 10https://gerrit.wikimedia.org/r/147488 [16:41:30] (03PS4) 10Reedy: Add robots.txt rewrite rule where wiki is public [puppet] - 10https://gerrit.wikimedia.org/r/147487 [16:41:45] RECOVERY - Host elastic1015 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [16:42:01] bd808: ruby issue? (kibana uses ruby, right?) [16:42:14] RECOVERY - Host elastic1016 is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [16:42:36] No, it's all browser js. Maybe something with the apache reverse proxy. Looking now [16:44:03] PROBLEM - RAID on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:14] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [16:44:14] PROBLEM - puppet last run on elastic1016 is CRITICAL: Connection refused by host [16:44:14] PROBLEM - puppet last run on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:23] PROBLEM - check if dhclient is running on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:23] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.13:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:44:24] PROBLEM - DPKG on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:24] PROBLEM - Disk space on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:34] PROBLEM - SSH on elastic1015 is CRITICAL: Connection timed out [16:44:34] PROBLEM - check if salt-minion is running on elastic1015 is CRITICAL: Timeout while attempting connection [16:44:35] PROBLEM - check configured eth on elastic1016 is CRITICAL: Connection refused by host [16:44:35] PROBLEM - check if dhclient is running on elastic1016 is CRITICAL: Connection refused by host [16:44:38] !log oblivian Synchronized wmf-config/CommonSettings.php: Serving 15% of anons with HHVM (ludicrous speed!) (duration: 00m 16s) [16:44:45] Logged the message, Master [16:44:47] PROBLEM - check if salt-minion is running on elastic1016 is CRITICAL: Connection refused by host [16:44:53] PROBLEM - SSH on elastic1016 is CRITICAL: Connection refused [16:44:53] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.12:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:44:54] PROBLEM - DPKG on elastic1016 is CRITICAL: Connection refused by host [16:44:54] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [16:45:03] PROBLEM - check configured eth on elastic1015 is CRITICAL: Timeout while attempting connection [16:45:04] PROBLEM - Disk space on elastic1016 is CRITICAL: Connection refused by host [16:45:05] sorry i didn't ack these problems b4 [16:45:23] PROBLEM - RAID on elastic1016 is CRITICAL: Connection refused by host [16:45:34] RECOVERY - SSH on elastic1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:45:42] <_joe_> Reedy: kibana is php + js [16:46:02] https://github.com/elasticsearch/kibana [16:46:04] <_joe_> ottomata: are you reimaging mw1032 right now? [16:46:07] JavaScript 93.2% CSS 5.2% Ruby 1.5% Shell 0.1% [16:46:20] I'm not going completely mad at least [16:46:35] <_joe_> Reedy: uhm I was sure it was part php [16:46:52] !log Reverted kibana to e317bc6 [16:46:57] Logged the message, Master [16:47:20] _joe_: v1 had php in it, v2 switched to ruby, v3 is all browser js [16:47:25] _joe_ yes [16:47:33] v4 (in beta now) brings ruby back :( [16:47:34] ah puppet finihshed [16:47:45] running puppet again with salt key signed [16:47:54] RECOVERY - SSH on elastic1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:48:00] <_joe_> bd808: v4 will be a swift ios-only app [16:48:06] ottomata 1015-6 are yours [16:48:13] ^d ^^^ [16:49:01] bd808: oh it does? :( [16:49:02] heh. They are adding a sinatra ruby app back in v4 to do the reverse proxy to elasticsearch. Apparently setting up apache/nginx as a proxy is too hard or something. [16:49:11] LOL [16:49:23] computers are hard [16:49:23] I did not read that .... [16:49:41] paravoid: https://github.com/elasticsearch/kibana/tree/master is the 4.0 beta code now. [16:50:29] can this shit just be worked around? [16:50:32] greg-g: hey, I'm going over my deploy window a hair in order to kick some SWAT stuff for marktraceur [16:50:41] (03CR) 10Mark Bergsma: [C: 032] Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 (owner: 10Ottomata) [16:50:46] This seems to be most of what the sinatra app does for now -- https://github.com/elasticsearch/kibana/blob/master/src/server/routes/proxy.rb [16:50:57] (03PS1) 10Alexandros Kosiaris: Introduce postgres role for labsdb1004, labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/170077 [16:51:25] But there is a /plugins/ route too that maybe will do something fancy in the future [16:52:32] ooo. maybe the new version isn't broken. Looks like we have 0 events logged since 00:00Z :( [16:52:51] * bd808 shakes fist at logstash [16:53:11] (03PS4) 10Ottomata: Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 [16:53:43] (03CR) 10Ottomata: [C: 032] Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 (owner: 10Ottomata) [16:54:05] (03PS2) 10Andrew Bogott: Replace libmariadbclient-dev with libmysqlclient-dev. [puppet] - 10https://gerrit.wikimedia.org/r/170073 [16:54:58] !log uploaded php5_5.3.10-1ubuntu3.15+wmf1 on apt.wikimedia.org [16:55:05] Logged the message, Master [16:56:14] marxarelli: ok yr patches are merged, deploying now... [16:56:40] ori: ping [16:56:44] sorry, marktraceur ^^ [16:57:01] marktraceur: can I just sync-file for that? [16:57:08] awight: Should be able to [16:57:08] PROBLEM - NTP on elastic1015 is CRITICAL: NTP CRITICAL: No response from NTP server [16:57:15] It's just the one file, IIRC [16:57:19] k [16:58:03] PROBLEM - NTP on elastic1016 is CRITICAL: NTP CRITICAL: No response from NTP server [16:58:15] !log restarted logstash on logstash1001. No events logged since 00:00Z [16:58:20] Logged the message, Master [16:58:21] !log restarted logstash on logstash1002. No events logged since 00:00Z [16:58:21] (03CR) 10coren: [C: 031] "That should work." [puppet] - 10https://gerrit.wikimedia.org/r/170073 (owner: 10Andrew Bogott) [16:58:27] Logged the message, Master [16:58:28] ^d, we have 2 more nodes to do, right? [16:58:52] !log awight Synchronized php-1.25wmf5/includes/specials/SpecialUpload.php: Parse 'upload_source_url' message on SpecialUpload (duration: 00m 11s) [16:58:57] Logged the message, Master [16:59:20] <^d> ottomata: 9-12. [16:59:22] I'm asking a commonist to test, awight [16:59:35] !log restarted logstash on logstash1003. No events logged since 00:00Z [16:59:42] Logged the message, Master [16:59:52] (03CR) 10Andrew Bogott: [C: 032] Replace libmariadbclient-dev with libmysqlclient-dev. [puppet] - 10https://gerrit.wikimedia.org/r/170073 (owner: 10Andrew Bogott) [16:59:55] marktraceur: not quite... deployed... [17:00:24] Oh [17:00:26] !log awight Synchronized php-1.25wmf6/includes/specials/SpecialUpload.php: Parse 'upload_source_url' message on SpecialUpload (duration: 00m 10s) [17:00:29] marktraceur: ok, now it's deployed for 1.25wmf5 and 6 [17:00:31] Logged the message, Master [17:00:32] Sweet [17:00:37] It only needed wmf5 to be testable [17:00:41] Commons is on wmf5 still [17:00:55] awight: s'ok, thanks for the heads up (cc marktraceur ) [17:01:20] !log Logs on logstash1003 showed "Failed to flush outgoing items " on shutdown. Maybe something not quite right about elasticsearch_http plugin? [17:01:25] Logged the message, Master [17:04:33] awight, greg-g, s\ says it works [17:04:40] Oh, it's Steinsplitter, OK [17:04:58] <^d> ottomata: Is 1006 noatime'd now? [17:06:03] marktraceur: greg-g: okay, doffing my deploy hat! [17:06:06] thanks [17:09:59] yes [17:10:03] shoudl be [17:10:20] Lots and lots of db failures in logstash. "Connection error: Unknown error (10.64.16.29)" [17:10:41] Like 2500 in the last 15 minutes [17:11:18] awight: ^ [17:11:27] marktraceur: ^ [17:11:34] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:38] !log Upgraded kibana to v3.1.1 again. Better testing now that logstash is working. [17:11:44] Logged the message, Master [17:11:54] <_joe_> mmmmh [17:11:55] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:57] Couldn't have been us...could it? We just changed a message to be parsed [17:12:04] <_joe_> lemme see what's happening on mw1023 [17:12:14] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 2 failures [17:12:16] <_joe_> this is the second time this happens today... not good [17:12:21] (03PS1) 10Andrew Bogott: Install libmysqlclient-dev on Trusty but not on Precise. [puppet] - 10https://gerrit.wikimedia.org/r/170089 [17:12:44] <^d> ottomata: 1003 is empty for HT. [17:12:57] ^d. will be wth you in a bit, in meeting [17:13:02] (03CR) 10jenkins-bot: [V: 04-1] Install libmysqlclient-dev on Trusty but not on Precise. [puppet] - 10https://gerrit.wikimedia.org/r/170089 (owner: 10Andrew Bogott) [17:13:15] <^d> ottomata: no worries :) [17:14:18] greg-g: bd808: I don't think that can be related to our deployments... [17:14:22] (03PS2) 10Andrew Bogott: Install libmysqlclient-dev on Trusty but not on Precise. [puppet] - 10https://gerrit.wikimedia.org/r/170089 [17:14:35] awight: I'm just reflexive with "who touched things last" :) [17:14:42] If it matters for anything, I did see the MysqlDB version toggling between 5 and 10 in Special:Version [17:14:47] greg-g: of course, no worries! [17:15:19] greg-g: what I touched: CentralNotice front and backend things; Special:Upload parsing thing [17:15:35] greg-g: exact patches are still available on the wikitech Deployments page, if u want [17:15:50] greg-g: should I stick around for debugging, or is this a decent time to BART for 30 min? [17:15:54] s'ok, waiting for joejoe [17:15:54] db1040 looks sick (server that's not responding) -- https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=db1040.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1414689286&v=0.73&m=load_fifteen&vl=%20&ti=Fifteen%20Minute%20Load%20Average&z=large [17:16:09] looks like you're in the clear [17:16:10] ok biab [17:16:11] awight: [17:16:15] :D [17:16:23] (03CR) 10coren: [C: 031] "Oh, horrors." [puppet] - 10https://gerrit.wikimedia.org/r/170089 (owner: 10Andrew Bogott) [17:16:23] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 2 failures [17:16:24] !log Upgrading Zuul to have the status page emit a Cache-Control header {{bug|72766}} wmf-deploy-20141030-1..wmf-deploy-20141030-2 [17:16:32] Logged the message, Master [17:16:39] * awight looks appreciatively at still beating, stolen heart of the db machine and cackles into the dawn [17:18:39] !log "Connection error: Unknown error (10.64.16.29)" 1052 in last 5m; 2877 in last 15m [17:18:45] Logged the message, Master [17:19:20] (03CR) 10Andrew Bogott: [C: 032] Install libmysqlclient-dev on Trusty but not on Precise. [puppet] - 10https://gerrit.wikimedia.org/r/170089 (owner: 10Andrew Bogott) [17:21:07] !log 10.64.16.29 is db1040 in the s4 pool [17:21:12] Logged the message, Master [17:22:36] which opsen is taking a look at db1040? [17:24:38] _joe_: you may have missed, db1040 is sick [17:26:00] bleugh s4 master too [17:26:37] hm [17:26:55] (03PS1) 10Ottomata: Fix comment about row of analytics1012 [puppet] - 10https://gerrit.wikimedia.org/r/170092 [17:27:01] yay commons [17:27:36] doesn't look to be doing a great deal [17:29:54] <^d> manybubbles: Should we start draining 9-12? [17:30:07] <^d> Or wait until 2 of the 4 we have down come up? [17:31:43] PROBLEM - HHVM rendering on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:03] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:16] uhhhh [17:33:36] ^d: meh - we've got plenty of space. I think we can drain [17:34:43] <^d> I'll start [17:35:03] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.392 second response time [17:35:18] <_joe_> and I found the problem [17:35:26] <_joe_> not that this is very conforting [17:35:49] hhvm problem? [17:35:51] or s4? [17:35:55] <_joe_> hhvm [17:35:59] <_joe_> sorry I'm focused on that [17:36:07] that's fine :) [17:36:07] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:10] _joe_: good to stay focused :) [17:36:13] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:13] <_joe_> I figured I'm not the only ops here [17:36:14] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:23] Reedy: how about now? more apaches [17:36:23] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:29] <_joe_> mmmh very bad [17:36:34] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:36] (03CR) 10Hashar: [C: 031 V: 032] "I have manually edited on gallium /etc/apache/zuul_proxy and reloaded apache. That unleashed the feature. So this has been tested :-D" [puppet] - 10https://gerrit.wikimedia.org/r/170070 (https://bugzilla.wikimedia.org/47609) (owner: 10Hashar) [17:36:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [17:36:46] PROBLEM - HHVM rendering on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:53] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:54] PROBLEM - HHVM rendering on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.980 second response time [17:36:54] PROBLEM - Apache HTTP on mw1053 is CRITICAL: Connection timed out [17:36:54] PROBLEM - Apache HTTP on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.023 second response time [17:37:02] <_joe_> wat? [17:37:13] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:16] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:17] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:18] Are they just api apaches? [17:37:23] <_joe_> Reedy: no [17:37:24] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.507 second response time [17:37:33] <_joe_> hhvm apaches [17:37:34] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 1.134 second response time [17:38:03] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.441 second response time [17:38:08] !log Zuul seems to be happy. Reverted my lame patch to send Cache-Control headers since we have a cache breaker it is not needed. [17:38:09] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.214 second response time [17:38:12] Logged the message, Master [17:38:24] PROBLEM - HHVM rendering on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.068 second response time [17:38:33] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 7.463 second response time [17:38:33] PROBLEM - HHVM rendering on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:37] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:51] <_joe_> shit [17:39:14] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:14] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 5.387 second response time [17:39:24] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.524 second response time [17:39:27] _joe_: want ori? [17:39:36] <_joe_> greg-g: not at the moment [17:39:43] PROBLEM - HHVM rendering on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.323 second response time [17:39:44] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.114 second response time [17:39:44] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 66193 bytes in 2.936 second response time [17:40:14] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:14] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:25] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:40:53] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.520 second response time [17:41:14] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.344 second response time [17:41:15] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:34] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:45] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.417 second response time [17:41:45] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:22] <_joe_> !log rolling restarted hhvm appservers [17:42:24] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.423 second response time [17:42:25] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.287 second response time [17:42:25] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 8.851 second response time [17:42:25] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 9.776 second response time [17:42:26] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.903 second response time [17:42:27] Logged the message, Master [17:42:56] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.269 second response time [17:42:56] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.362 second response time [17:42:57] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.197 second response time [17:43:17] PROBLEM - Apache HTTP on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [17:43:18] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [17:43:37] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.085 second response time [17:43:48] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.234 second response time [17:43:57] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.175 second response time [17:43:58] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [17:43:58] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.171 second response time [17:44:17] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [17:44:38] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [17:45:52] (03PS1) 10Alexandros Kosiaris: servermon: run make_update 6 times a day [puppet] - 10https://gerrit.wikimedia.org/r/170096 [17:46:14] (03PS1) 10Reedy: Revert "HHVM: get 10% of anonymous traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170097 [17:46:18] (03CR) 10jenkins-bot: [V: 04-1] Revert "HHVM: get 10% of anonymous traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170097 (owner: 10Reedy) [17:46:43] <_joe_> Reedy: just write 5 there :) [17:46:45] <_joe_> Reedy: on it [17:46:47] PROBLEM - Apache HTTP on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [17:46:58] PROBLEM - HHVM rendering on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.048 second response time [17:47:29] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.700 second response time [17:47:29] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:31] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:31] PROBLEM - HHVM rendering on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:48] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.311 second response time [17:47:59] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:09] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:10] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:17] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 6.499 second response time [17:48:29] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time [17:48:58] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.221 second response time [17:49:00] (03PS2) 10Reedy: Revert "HHVM: get 10% of anonymous traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170097 [17:49:04] (03CR) 10jenkins-bot: [V: 04-1] Revert "HHVM: get 10% of anonymous traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170097 (owner: 10Reedy) [17:49:07] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.096 second response time [17:49:18] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.093 second response time [17:49:28] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.190 second response time [17:49:32] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 1.013 second response time [17:49:35] (03Abandoned) 10Reedy: Revert "HHVM: get 10% of anonymous traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170097 (owner: 10Reedy) [17:49:48] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:49:51] (03PS1) 10Giuseppe Lavagetto: revert to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170098 [17:49:58] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.099 second response time [17:50:18] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.352 second response time [17:51:01] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.361 second response time [17:51:47] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [17:52:57] (03CR) 10Giuseppe Lavagetto: [C: 032] revert to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170098 (owner: 10Giuseppe Lavagetto) [17:53:57] !log oblivian Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 06s) [17:54:07] Logged the message, Master [17:54:11] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:11] PROBLEM - HHVM rendering on mw1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [17:54:18] <_joe_> !log syncronized downsizing to 5% [17:54:23] Logged the message, Master [17:54:27] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:07] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:20] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.304 second response time [17:55:24] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:29] getting 500s on mw.org [17:55:49] yep [17:55:51] known :) [17:56:13] kk [17:58:07] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:17] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:51] ottomata: did you get elastic105/16 back [17:59:02] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: run make_update 6 times a day [puppet] - 10https://gerrit.wikimedia.org/r/170096 (owner: 10Alexandros Kosiaris) [17:59:16] (03CR) 10Alexandros Kosiaris: [C: 032] Actually use view_name on ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170037 (owner: 10Alexandros Kosiaris) [17:59:35] ^d lmk when it's safe to upgrade 1009-112 [17:59:49] <^d> Will do, still draining. [18:00:10] <^d> 15-16 are not up tho? or still waiting on otto for puppet? [18:01:19] <_joe_> greg-g: ori can be helpful, yes [18:03:28] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [18:03:37] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.165 second response time [18:03:46] _joe_: texted [18:03:57] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.066 second response time [18:04:17] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.173 second response time [18:04:26] ori just walked in [18:04:45] RoanKattouw: voice to him to get on IRC plz :) [18:04:56] ^d adding to puppet now [18:05:14] hey [18:05:16] i'm here [18:05:16] <^d> coolio [18:05:18] what's up? [18:05:44] segfaults all over the place [18:05:46] 19:52 -!- Irssi: Pasting 5 lines to #mediawiki_security. Press Ctrl-K if you wish to do this or Ctrl-C to cancel. [18:05:50] 19:52 < paravoid> #9 ifree (ptr=0x7f452e357700) at src/jemalloc.c:1233 [18:05:51] okay, looking [18:05:53] 19:52 < paravoid> #10 free (ptr=0x7f452e357700) at src/jemalloc.c:1308 [18:05:56] 19:52 < paravoid> #11 0x00007f457d733688 in xmlFreeNodeList__internal_alias (cur=) at ../../tree.c:3683 [18:05:59] 19:52 < paravoid> #12 0x00007f457d733840 in xmlFreeProp__internal_alias (cur=0x7f4506ede2c0) at ../../tree.c:2081 [18:06:00] multiple hosts or only one? [18:06:02] 19:52 < paravoid> #13 0x00007f457d73390c in xmlFreePropList__internal_alias (cur=0x7f45d16c5318) at ../../tree.c:2056 [18:06:05] 19:52 < paravoid> [...] [18:06:07] 19:52 < paravoid> #19 0x00007f457d733621 in xmlFreeNodeList__internal_alias (cur=) at ../../tree.c:3654 [18:06:08] probably a double free [18:06:10] 19:52 < paravoid> #20 0x00007f457d733a3c in xmlFreeNode__internal_alias (cur=0x7f452e34a980) at ../../tree.c:3728 [18:06:13] 19:52 < paravoid> #21 0x0000000000eb8372 in HPHP::c_DOMDocument::sweep() () [18:06:16] 19:52 < paravoid> #22 0x000000000091a396 in HPHP::Sweepable::SweepAll() () [18:06:19] multiple [18:06:22] yup [18:06:45] <_joe_> ori: hell broke loose [18:06:47] (03PS1) 10Milimetric: Update wikimetrics submodule pointer [puppet] - 10https://gerrit.wikimedia.org/r/170103 [18:06:49] The Valve owned wiki of Team Fortress is having issues as well [18:06:52] "(Cannot contact the database server)" [18:06:53] okay, is the operational side of things under control? can I look at traces and fix this? [18:06:54] weird [18:07:03] ottomata: can you merge https://gerrit.wikimedia.org/r/170103 [18:07:07] just a submodule pointer [18:07:24] <_joe_> ori: ygpm [18:07:24] <_joe_> ori: the whole appservers cluster [18:07:28] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:47] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:03] the whole appservers cluster what? [18:08:14] <_joe_> ori: it kinda is now [18:08:22] <_joe_> but it's flapping [18:08:22] is what? [18:08:26] <_joe_> like ^^ [18:08:38] are you handling it? can i look away from irc and look at the source code and find a fix? [18:09:38] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [18:09:46] <_joe_> ori: the whole hhvm appserver cluster had problems [18:09:47] <_joe_> ori: sure [18:09:47] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.220 second response time [18:10:02] okay, tabbing out of irc for a few to look at this [18:10:47] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:58] (03CR) 10Alexandros Kosiaris: "Sigh... anything that can be salvaged ?" [puppet] - 10https://gerrit.wikimedia.org/r/170007 (owner: 10Alexandros Kosiaris) [18:10:59] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:59] PROBLEM - HHVM rendering on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:10] paravoid: how did you get that trace? [18:11:17] dumped core [18:11:22] mw1053 [18:11:31] /var/tmp/hhvm/core [18:11:47] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:20] here's one from mw1024: https://dpaste.de/Q96F/raw [18:12:29] why was this not happening until now? what is the timeline? [18:12:40] this is a guess, but could it be that the XML patch is freeing the node, and a memory sweep is attempting to free that again? [18:12:49] yes, it looks exactly like that [18:12:58] right [18:13:00] <_joe_> what the hell has happened all of a sudden? [18:13:05] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 1.027 second response time [18:13:08] how are memory sweeps triggered in HHVM? [18:13:25] is it memory pressure, is it some memory or time threshold? [18:13:28] PROBLEM - HHVM rendering on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:37] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:46] I'd guess (again) that sweeps weren't initiated before, but now the US is waking up and traffic is increasing [18:13:48] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [18:13:48] end of request [18:13:55] ^d, what's wrong with 1006? [18:14:00] did you dial back down to 5%? [18:14:07] PROBLEM - Host elastic1015 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:11] <_joe_> ori: one server froze this morning [18:14:20] <_joe_> I took a trace and wanted to show you this evening [18:14:26] (03CR) 10Ottomata: [C: 032] Update wikimetrics submodule pointer [puppet] - 10https://gerrit.wikimedia.org/r/170103 (owner: 10Milimetric) [18:14:31] Didn't this happen in August?: https://github.com/facebook/hhvm/issues/3438 [18:14:32] <_joe_> paravoid: none of the above [18:14:37] <_joe_> paravoid: api appservers with hhvm handle 10 times the load [18:14:46] different load [18:14:49] Bsadowski1: different double free [18:14:51] <^d> ottomata: manybubbles had said that 1006 still needed noatime, so he redrained it for that. [18:14:55] Oh [18:14:57] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:58] Okay [18:15:01] i see noatime [18:15:09] <_joe_> ori: yes [18:15:10] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:17] <^d> ottomata: Ok, then I guess 1006 is ok. [18:15:20] ottomata: and 1003 was hyperthreading [18:15:21] <^d> 1003 still needs HT tho. [18:15:41] <^d> Unbanning 1006 since it has noatime. [18:15:48] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.088 second response time [18:15:48] ottomata: /dev/md2 on /var/lib/elasticsearch type ext4 (rw) [18:16:06] hmm [18:16:11] from mount [18:16:17] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.203 second response time [18:16:17] ^d: wait a bit [18:16:18] it is in fstab and i remounted... [18:16:21] <^d> k. [18:16:24] hm, strange [18:16:25] ok [18:16:30] i can stop es there now? [18:16:32] on 1006? [18:16:38] <_joe_> paravoid: we're back at 5%, and the load is non-existent. Still, it's crashing [18:16:48] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:48] PROBLEM - HHVM rendering on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:49] <_joe_> one thing we could do is re-upload the old package [18:16:51] OK, current theory: DOMDocument fix causes a double free, causing a segfault, causing a core dump and service restart, during which time HHVM is slow or unresponsive [18:16:53] ottomata: checking [18:16:54] <^d> ottomata: yeah [18:17:03] _joe_: we could, but to the app servers, not api [18:17:07] PROBLEM - Host elastic1016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [18:17:12] ori: the paste you gave https://dpaste.de/Q96F/raw has nothing XML-related [18:17:13] ottomata: yeah - its cool [18:17:22] ori: is this an actually crashed server? [18:17:35] or is it just a working one at some random point in time? [18:17:49] <_joe_> ori: that was my idea [18:17:52] <_joe_> paravoid: that is a server in the locked-down state [18:18:02] paravoid: it was working but had just had an alert issued for it [18:18:21] <_joe_> where it's still up but not responding [18:18:22] i don't think it's locked up, i think it's just initializing under load [18:18:34] what was its uptime? [18:18:42] a few minutes [18:18:47] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:54] well okay, we can ignore that then [18:19:07] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:10] PROBLEM - HHVM rendering on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:17] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:20] <_joe_> ok, re-uploading the old packages [18:19:28] RECOVERY - Host elastic1015 is UP: PING OK - Packet loss = 0%, RTA = 4.57 ms [18:19:51] _joe_: thanks [18:20:48] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:00] <_joe_> ok [18:21:07] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.131 second response time [18:21:21] <_joe_> can someone keep an eye on the cluster in the meantime? [18:21:25] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:25] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:42] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:51] PROBLEM - HHVM rendering on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:14] PROBLEM - SSH on elastic1015 is CRITICAL: Connection refused [18:22:15] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:16] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:31] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.172 second response time [18:22:32] RECOVERY - Host elastic1016 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [18:22:32] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 1.804 second response time [18:22:41] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.002 second response time [18:22:44] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:52] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [18:23:41] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:41] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 6.798 second response time [18:23:51] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:53] _joe_: and by keep an eye on you mean? telling you more are having problems? [18:25:33] RECOVERY - SSH on elastic1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:25:46] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:52] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.302 second response time [18:25:52] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.323 second response time [18:25:52] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [18:25:52] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [18:26:01] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.298 second response time [18:26:02] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.171 second response time [18:26:02] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [18:26:04] _joe_: heh [18:26:05] er [18:26:07] greg-g: heh :) [18:26:51] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.072 second response time [18:27:41] <_joe_> ok, mw1018 has the old packages [18:27:54] push it to all the app servers (not the api) [18:27:56] it will fix the issue [18:28:05] it's a double-free resulting from the patch for the leak [18:28:12] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:21] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:07] ^ _joe_ [18:30:52] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [18:30:59] um, sorry, not following along here, ori: I was reimaging mw1032 for hhvm [18:31:01] haven't repooled it [18:31:02] should I not? [18:31:11] (03CR) 10JanZerebecki: [C: 031] "After talking about this: It is good to apply this minimal version to fix the security issue. Then change to ssl_ciphersuite() in a new co" [puppet] - 10https://gerrit.wikimedia.org/r/169949 (owner: 10Dzahn) [18:31:20] (03CR) 10Dzahn: [C: 032] "Jan, yes, let's make another change though. let me apply the quick fix and confirm really quick it is applied on the instance tools-webpro" [puppet] - 10https://gerrit.wikimedia.org/r/169949 (owner: 10Dzahn) [18:31:24] ottomata: hang on for a moment [18:32:01] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.078 second response time [18:32:18] k [18:32:22] _joe_: ack? [18:32:32] can we get the wmf3 package on all the non-api hhvm servers? [18:32:33] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:44] <_joe_> ori: ok, working on it [18:32:51] ok, ^d, manybubbles, 1006 has noatime now [18:32:52] RECOVERY - Disk space on elastic1015 is OK: DISK OK [18:32:52] RECOVERY - DPKG on elastic1015 is OK: All packages OK [18:32:53] sorry about that [18:33:11] RECOVERY - check if dhclient is running on elastic1015 is OK: PROCS OK: 0 processes with command name dhclient [18:33:11] RECOVERY - check if salt-minion is running on elastic1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:33:14] sorry, doing a lot of multitasking today! [18:33:22] ^ is that someone else doing puppet on the remaining nodes? [18:33:25] cmjohnson: ? [18:33:42] RECOVERY - RAID on elastic1015 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:33:42] RECOVERY - check configured eth on elastic1015 is OK: NRPE: Unable to read output [18:33:49] ottomata yeah that's me [18:33:57] PROBLEM - puppet last run on elastic1015 is CRITICAL: CRITICAL: Puppet has 1 failures [18:34:20] <_joe_> ori: I am deploying the package now [18:34:21] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [18:34:21] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 66200 bytes in 0.190 second response time [18:34:32] k, thanks cmjohnson [18:34:36] ottomata: elastic1006 looks good [18:34:51] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.432 second response time [18:35:11] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.236 second response time [18:35:13] !log unbanning elastic1006 now that it is proplery configured [18:35:18] !log restarting nginx on toollabs webproxy [18:35:18] Logged the message, Master [18:35:23] Logged the message, Master [18:35:43] ottomata: just 1003 left from the last batch [18:35:46] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [18:35:52] RECOVERY - Disk space on elastic1016 is OK: DISK OK [18:36:02] RECOVERY - RAID on elastic1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:36:05] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.098 second response time [18:36:19] recovering, right? [18:36:41] RECOVERY - check if salt-minion is running on elastic1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:36:41] RECOVERY - check if dhclient is running on elastic1016 is OK: PROCS OK: 0 processes with command name dhclient [18:36:41] RECOVERY - check configured eth on elastic1016 is OK: NRPE: Unable to read output [18:36:52] RECOVERY - DPKG on elastic1016 is OK: All packages OK [18:37:11] PROBLEM - DPKG on mw1163 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:37:11] PROBLEM - DPKG on mw1025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:37:11] PROBLEM - HHVM processes on mw1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:37:19] PROBLEM - HHVM processes on mw1027 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:37:26] PROBLEM - DPKG on mw1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:37:26] PROBLEM - HHVM rendering on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [18:37:26] PROBLEM - HHVM rendering on mw1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [18:37:26] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:37:26] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [18:37:27] PROBLEM - DPKG on mw1024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:37:27] PROBLEM - DPKG on mw1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:37:38] PROBLEM - HHVM processes on mw1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:38:02] PROBLEM - HHVM processes on mw1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:38:04] <_joe_> ori: we're safe and secyure for now [18:38:10] PROBLEM - DPKG on mw1023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:38:10] PROBLEM - HHVM processes on mw1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:38:26] blah [18:38:50] mw1030.eqiad.wmnet, mw1031, and mw1032 still don't have wmf3 [18:38:54] PROBLEM - DPKG on mw1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:39:04] PROBLEM - DPKG on mw1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:39:05] PROBLEM - HHVM processes on mw1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:39:41] PROBLEM - DPKG on mw1053 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:39:41] PROBLEM - DPKG on mw1022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:39:42] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.078 second response time [18:39:51] RECOVERY - ElasticSearch health check for shards on elastic1016 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 30, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 18, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 30 [18:39:54] RECOVERY - DPKG on mw1025 is OK: All packages OK [18:40:00] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.191 second response time [18:40:00] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 18: initializing_shards: 0: unassigned_shards: 0 [18:40:03] what is going on [18:40:15] RECOVERY - HHVM processes on mw1027 is OK: PROCS OK: 1 process with command name hhvm [18:40:21] <_joe_> ori: yeah I'm arriving there [18:40:26] RECOVERY - DPKG on mw1026 is OK: All packages OK [18:40:27] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.159 second response time [18:40:27] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.185 second response time [18:40:27] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [18:40:27] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.195 second response time [18:40:28] RECOVERY - DPKG on mw1027 is OK: All packages OK [18:40:37] can you guys please schedule downtime so that we stop getting paged [18:40:40] if these are known issues [18:40:41] RECOVERY - HHVM processes on mw1025 is OK: PROCS OK: 1 process with command name hhvm [18:40:47] jgage: who are you addressing? [18:40:57] is this re elastic or hhvm? [18:40:57] anybody who is causing us to get paged repeatedly :) [18:41:01] hhvm [18:41:13] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [18:41:15] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.743333333333 [18:41:32] RECOVERY - DPKG on mw1028 is OK: All packages OK [18:41:36] i just got 10 [18:41:42] RECOVERY - DPKG on mw1029 is OK: All packages OK [18:41:43] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [18:41:44] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.198 second response time [18:41:54] is your SMS queue really the most important thing right now? [18:41:59] could we talk about that in, say, five minutes? [18:42:03] RECOVERY - HHVM processes on mw1028 is OK: PROCS OK: 1 process with command name hhvm [18:42:08] <_joe_> mmmh whoever tried to reinstall hhvm on a few servers, that went wrong [18:42:11] <_joe_> jgage: this is an outage [18:42:16] ok [18:42:27] <_joe_> . [18:42:52] RECOVERY - HHVM processes on mw1029 is OK: PROCS OK: 1 process with command name hhvm [18:43:13] I imagine elasticserach is also causing you to get paged. ottomata can you schedule the downtime before you do the things? [18:43:17] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.170 second response time [18:43:22] rather, I guess it'd be cmjohnson too [18:43:27] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [18:43:31] aye, not a bad idea. :) [18:43:31] Ugh [18:43:36] um, are there anymore to do after this though? [18:43:37] Could https://gerrit.wikimedia.org/r/#/c/170011/ be deployed please? [18:43:46] all pages have been for hhvm [18:44:12] jgage: https://gerrit.wikimedia.org/r/#/c/170011/ is supposed to stop those pages, something had critical=>true that shouldn't have had it [18:44:19] But it seems to either not have been deployed or not be working [18:44:20] cool ok [18:44:32] _joe_: Did you never end up deploying https://gerrit.wikimedia.org/r/#/c/170011/ ? Or does it just not work? [18:44:41] ms-fe paged too [18:45:33] huh, i haven't gotten that one. perhaps it's queued somewhere. [18:45:50] I doubt anyone's even noticed it till I mentioned it, it's in the midst of paging spam [18:46:01] <_joe_> RoanKattouw: not now please [18:46:01] <_joe_> 1 min [18:46:03] I didn't get that one either [18:46:05] Sure [18:46:08] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:48:25] bblack: I did not get ms-fe either ... [18:48:30] only HHVM [18:48:47] RECOVERY - DPKG on mw1024 is OK: All packages OK [18:48:59] RECOVERY - DPKG on mw1023 is OK: All packages OK [18:49:16] my ms-fe page also shows a datestamp of Wed Sept 10 09:06:34 UTC, and says it's CRITICAL, but I don't see ms-fe wrong in icinga [18:49:18] RECOVERY - DPKG on mw1022 is OK: All packages OK [18:49:18] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 66193 bytes in 1.745 second response time [18:49:19] wtf? [18:49:30] something's a week in the future [18:49:40] Sep 10 ? [18:49:50] RECOVERY - puppet last run on elastic1015 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:49:58] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.602 second response time [18:50:15] smells of something being stuck in the queue since the problems we had back then [18:50:17] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [18:50:27] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 66192 bytes in 0.226 second response time [18:50:27] RECOVERY - HHVM processes on mw1053 is OK: PROCS OK: 1 process with command name hhvm [18:50:28] not to mention Sept 10 is a monday this year [18:50:33] <_joe_> RoanKattouw: I have my theory [18:50:42] _joe_: should be fine everywhere now [18:50:44] <_joe_> RoanKattouw: icinga config is broken since forever [18:50:59] <_joe_> so my change didn't go through [18:51:02] bblack: my cal says Wed [18:51:02] RECOVERY - DPKG on mw1053 is OK: All packages OK [18:51:03] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:51:10] bblack: Sept .. not Nov :P [18:51:15] oh right [18:51:22] _joe_: mw1063 has the wrong package [18:51:22] RECOVERY - DPKG on mw1163 is OK: All packages OK [18:51:27] RECOVERY - NTP on elastic1015 is OK: NTP OK: Offset -0.02422082424 secs [18:51:37] I guess I should figure out how months work :p [18:51:42] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 31, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 31 [18:51:43] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [18:51:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:51:46] so I'm getting pages from several weeks ago [18:51:47] :-) [18:51:52] RECOVERY - HHVM processes on mw1163 is OK: PROCS OK: 1 process with command name hhvm [18:52:05] yeah I think it was around the time globalsms was having issues [18:52:16] oh, even dumber. I'm not, it's just a combination of the stupid message-id thing with our t-mobile pages and how my android SMS client groups them [18:52:30] they're showing up paired with re-used sender IDs from totally unrelated messages in the past [18:52:36] <_joe_> ori: I had to fix a few apt nuances around [18:52:38] <_joe_> yes now it is [18:52:53] heh [18:52:57] <_joe_> ori: mw1063? [18:53:06] oh, it alerted earlier [18:53:08] i guess unrelated [18:53:19] <_joe_> ori: 1163 has just been reinstalled [18:53:23] so the 9xxx "from" number on the t-mobile pages apparently have started being reused now, which is even worse than when they were all unique for every message [18:53:41] _joe_: can we go back to 10%? [18:53:46] <_joe_> ori: mw1021 still is on the wrong version thogh [18:53:55] RECOVERY - NTP on elastic1016 is OK: NTP OK: Offset -0.0165220499 secs [18:54:29] (03PS2) 10Alexandros Kosiaris: Introduce postgres role for labsdb1004, labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/170077 [18:54:53] <_joe_> ori: without a patch? [18:55:10] <_joe_> well, if we have one by tomorrow... [18:55:19] <_joe_> I don't see the point though [18:55:27] to relieve pressure on the zend [18:55:30] app servers [18:56:18] <_joe_> but really up to you [18:56:30] <_joe_> I am bailing out now [18:56:34] OK [18:56:36] thanks for the help [18:56:37] (03PS1) 10Dzahn: domainproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/170117 [18:56:40] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce postgres role for labsdb1004, labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/170077 (owner: 10Alexandros Kosiaris) [18:56:41] <_joe_> ori: they are at 67% load [18:56:52] <_joe_> which is not optimal but not so bad [18:56:55] <_joe_> but, please let me go to dinner [18:57:01] they = zend or hhvm? [18:57:02] ok [18:57:08] <_joe_> since this is temporarily solved [18:57:14] yes, go [18:57:36] <_joe_> we can discuss this later [18:58:51] (03CR) 10JanZerebecki: [C: 031] domainproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/170117 (owner: 10Dzahn) [18:58:52] you actually have to get off the computer if you want to get off the computer :P [18:59:10] <_joe_> ori: zend, hhvm is at 10% [18:59:32] Zen Master Ori [18:59:54] <_joe_> I've beed substituted with a lord of the rings movie [19:00:00] <^d> Gah, bouncer died. [19:00:05] Zend Master [19:00:16] mutante: Thanks for taking it to the next level :) [19:01:32] <^d> cmjohnson: 9-12 are empty and ready for shooting (dunno if manybubbles mentioned it, lost all scrollback and so forth) [19:01:53] nope...cool..then I will get them now [19:01:55] thx [19:02:08] ottomata: 3 still need hyperthreading. poke. poke. :) [19:02:09] !log powering off elastic1009-1002 to replace ssds [19:02:13] Logged the message, Master [19:02:43] manybubbles: which ones? [19:02:55] ottomata: sorry - elastic1003 [19:03:00] hm. ok [19:03:02] it is empty? [19:03:11] <^d> yes. [19:03:11] <^d> and 1015 and 1016 need noatime, also empty. [19:03:34] oh, cmjohnson probably doen'st know that part [19:03:47] cmjohnson: [19:03:51] if you are doing elasticsearch installs [19:03:59] you need to do this before you run puppet (and elasticsearch gets started) [19:04:00] https://wikitech.wikimedia.org/wiki/Search#Adding_new_nodes [19:04:18] okay...yeah didn't know that part [19:05:15] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:25] PROBLEM - Host elastic1009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:05:48] PROBLEM - Host elastic1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:25] cmjohnson: if you want, you can do those steps on the 1015 and 1016 [19:06:26] or I can [19:06:27] lemme know [19:06:38] (03PS1) 10Ori.livneh: webperf: annotate navtiming data in graphite with 'https'/'http' [puppet] - 10https://gerrit.wikimedia.org/r/170121 [19:06:42] oh, yeah, if you do that before puppet runs, you wont' have to stop/start elasticsearhc, obviously [19:06:54] PROBLEM - Host elastic1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:01] (03PS2) 10Ori.livneh: webperf: annotate navtiming data in graphite with 'https'/'http' [puppet] - 10https://gerrit.wikimedia.org/r/170121 [19:07:03] greg-g: MaxSem: woot, thanks! my deployment keys thing is resolved now [19:08:52] (03PS3) 10Ori.livneh: webperf: annotate navtiming data in graphite with 'https'/'http' [puppet] - 10https://gerrit.wikimedia.org/r/170121 [19:09:09] ottomata: go ahead and do it..gonna take me a few to swap out the disks [19:10:15] k [19:12:35] PROBLEM - Host elastic1016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:13:04] thats me [19:13:07] didn't do downtime :/ [19:13:17] (03PS2) 10Dzahn: domainproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/170117 [19:13:26] PROBLEM - Host elastic1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:39] <^d> yay, feels better to be on the bouncer. [19:14:37] (03CR) 10John F. Lewis: [C: 031] domainproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/170117 (owner: 10Dzahn) [19:18:27] RECOVERY - Host elastic1015 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [19:18:28] RECOVERY - Host elastic1016 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [19:19:49] ok, sorry bout that ^d, 15 and 16 shoudl be good now [19:19:55] and 03 as well [19:20:12] <_joe_> jgage: I removed pages for hhvm this morning [19:20:27] <_joe_> jgage: but osmium with puppet failing made neon fail [19:20:32] (03CR) 10Dzahn: [C: 032] domainproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/170117 (owner: 10Dzahn) [19:20:39] <_joe_> so, just fixed it in the db [19:20:40] <_joe_> and rerunning puppet on neon [19:22:24] (03CR) 10Ori.livneh: [C: 032] webperf: annotate navtiming data in graphite with 'https'/'http' [puppet] - 10https://gerrit.wikimedia.org/r/170121 (owner: 10Ori.livneh) [19:22:37] (03PS1) 10Spage: Typo in labs wgContentHandlerUseDB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170124 (https://bugzilla.wikimedia.org/49193) [19:23:46] _joe_, mw1032 is ready for repoolling, lemme know if I should [19:23:49] that's the state it is in [19:25:21] ^d: any chance for https://bugzilla.wikimedia.org/show_bug.cgi?id=68452 being implemented ? [19:26:55] <^d> I honestly hadn't thought about it recently. [19:27:07] <_joe_> ottomata: not now thanks! [19:27:12] <_joe_> and don't go on with more [19:27:48] _joe_: why not repool it? [19:28:25] ^d: too busy with cirrus to big wikis ? it can wait, just spare some brain resources to it in near future please :) [19:28:48] <_joe_> ori: because I don't think I installed the wmf3 packages there [19:28:57] _joe_: oh, right. all right. [19:29:05] ok, _joe_, i will update etherpad, lemm eknow when I should continue (tomorrow or whenver, or just finish it up when you are ready) [19:29:55] <_joe_> ottomata: I hope tomorro we will [19:30:13] (03CR) 10Dzahn: [C: 031] "+1 , pending approval" [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [19:30:41] k [19:32:05] (03CR) 10Dzahn: [C: 031] "+1 , pending wait period (i guess)" [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [19:33:54] <^d> ottomata: Everything looks good, it's just 1009-12 to go :) [19:34:09] <^d> All other 27 nodes pooled. [19:34:38] coool, i think cmjohnson is putting the new drives now, ja? [19:34:43] on those [19:35:00] PROBLEM - NTP on elastic1015 is CRITICAL: NTP CRITICAL: Offset unknown [19:35:40] (03PS1) 10Spage: $wgContentHandlerUseDB true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) [19:37:48] (03CR) 10Spage: [C: 04-2] "I'm putting this up for review, but blockers of bug 49193 need to be addressed before deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [19:38:18] (03CR) 10Reedy: [C: 031] "lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170124 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [19:38:22] RECOVERY - NTP on elastic1015 is OK: NTP OK: Offset -0.0147947073 secs [19:39:38] spagewmf: I'm guessing you're meaning the security bug as a blocker? [19:41:15] (03CR) 10Reedy: "I'm guessing you mean the security bug 70901 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [19:41:41] Reedy yes. Flow and some other projects want ContentHandlerUseDB, but csteipp needs to feel good about it. I'm unclear if there are other blockers, like permissions adjustments or a new right. [19:42:13] (03PS1) 10Cscott: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 [19:42:31] spagewmf: Daniel Kinzler is probably the best person to comment on that with him writing it :) [19:42:35] Added him as a reviewer [19:42:50] (03CR) 10jenkins-bot: [V: 04-1] Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [19:43:22] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [19:43:32] RECOVERY - Host elastic1009 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [19:43:37] RECOVERY - Host elastic1011 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [19:43:41] (03PS2) 10Cscott: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 [19:44:24] Reedy: thanks. Also I don't know if some wiki doesn't want this, e.g. it uses some extension with a dangerous content format [19:44:52] RECOVERY - Host elastic1012 is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [19:45:13] PROBLEM - DPKG on elastic1010 is CRITICAL: Connection refused by host [19:45:22] PROBLEM - check if dhclient is running on elastic1010 is CRITICAL: Connection refused by host [19:45:24] PROBLEM - check configured eth on elastic1010 is CRITICAL: Connection refused by host [19:45:25] PROBLEM - ElasticSearch health check for shards on elastic1010 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.142:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:45:33] PROBLEM - check if salt-minion is running on elastic1010 is CRITICAL: Connection refused by host [19:45:33] PROBLEM - RAID on elastic1010 is CRITICAL: Connection refused by host [19:45:53] PROBLEM - puppet last run on elastic1010 is CRITICAL: Connection refused by host [19:46:02] PROBLEM - check if dhclient is running on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:03] PROBLEM - RAID on elastic1011 is CRITICAL: Connection refused by host [19:46:03] PROBLEM - check if salt-minion is running on elastic1011 is CRITICAL: Connection refused by host [19:46:03] PROBLEM - check configured eth on elastic1011 is CRITICAL: Connection refused by host [19:46:03] PROBLEM - puppet last run on elastic1011 is CRITICAL: Connection refused by host [19:46:03] PROBLEM - Disk space on elastic1010 is CRITICAL: Connection refused by host [19:46:03] PROBLEM - SSH on elastic1010 is CRITICAL: Connection refused [19:46:04] PROBLEM - DPKG on elastic1011 is CRITICAL: Connection refused by host [19:46:06] spagewmf: i can't think of a reason for it not to have $wgContentHandlerUseDB = true [19:46:12] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [19:46:12] PROBLEM - Disk space on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:13] PROBLEM - check if salt-minion is running on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:22] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [19:46:22] PROBLEM - ElasticSearch health check for shards on elastic1009 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.141:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:46:23] PROBLEM - SSH on elastic1009 is CRITICAL: Connection timed out [19:46:27] it was there in the beginning to allow schema changes [19:46:33] PROBLEM - check configured eth on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:33] PROBLEM - SSH on elastic1011 is CRITICAL: Connection timed out [19:46:34] PROBLEM - ElasticSearch health check for shards on elastic1011 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.143:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:46:34] PROBLEM - puppet last run on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:42] PROBLEM - DPKG on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:42] PROBLEM - check if dhclient is running on elastic1011 is CRITICAL: Timeout while attempting connection [19:46:43] and as a new feature in mw1.21, to be tried in a more limited manner [19:46:52] PROBLEM - RAID on elastic1009 is CRITICAL: Timeout while attempting connection [19:46:55] but we are not in 1.25 :) [19:47:01] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [19:47:02] PROBLEM - Disk space on elastic1011 is CRITICAL: Timeout while attempting connection [19:47:07] now* [19:47:22] PROBLEM - check configured eth on elastic1012 is CRITICAL: Timeout while attempting connection [19:47:32] PROBLEM - Disk space on elastic1012 is CRITICAL: Timeout while attempting connection [19:47:38] PROBLEM - RAID on elastic1012 is CRITICAL: Timeout while attempting connection [19:47:42] PROBLEM - SSH on elastic1012 is CRITICAL: Connection timed out [19:47:44] PROBLEM - check if dhclient is running on elastic1012 is CRITICAL: Timeout while attempting connection [19:47:44] PROBLEM - puppet last run on elastic1012 is CRITICAL: Timeout while attempting connection [19:48:13] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [19:48:13] PROBLEM - check if salt-minion is running on elastic1012 is CRITICAL: Timeout while attempting connection [19:48:13] PROBLEM - DPKG on elastic1012 is CRITICAL: Timeout while attempting connection [19:48:13] PROBLEM - ElasticSearch health check for shards on elastic1012 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.144:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:48:43] RECOVERY - SSH on elastic1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:48:48] RECOVERY - SSH on elastic1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:48:48] RECOVERY - SSH on elastic1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:49:26] (03CR) 10Dzahn: [C: 032] ""up to a billion change and 999 patchsets" sounds enough indeed, also already tested by hashar :)" [puppet] - 10https://gerrit.wikimedia.org/r/170070 (https://bugzilla.wikimedia.org/47609) (owner: 10Hashar) [19:49:33] RECOVERY - SSH on elastic1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:50:13] aude: I agree it should be fine, but what if some wiki+obscure extension permits setContentFormat( 'Main_Page', 'sql-query') 8-) [19:51:24] * aude can't imagine that ;) [19:51:43] ottomata: elastics are finished [19:51:53] after all, content handler actually works fine without the setting [19:52:02] but you are stuck with one content type per namespace in that case [19:52:44] or "file extension" (e.g. the .js pages in MediaWiki ns) [19:53:39] and have scribunto using it etc. already [19:54:29] aude: yes, without this making a talk page into a Flow board is a hack. FYI Flow will have a Special page to convert from wiki page to Flow board. [19:54:51] * aude nods [19:55:13] anyway, suppose chris can review the config change [19:58:43] PROBLEM - NTP on elastic1010 is CRITICAL: NTP CRITICAL: No response from NTP server [20:00:02] PROBLEM - NTP on elastic1009 is CRITICAL: NTP CRITICAL: No response from NTP server [20:00:02] PROBLEM - NTP on elastic1011 is CRITICAL: NTP CRITICAL: No response from NTP server [20:00:52] PROBLEM - NTP on elastic1012 is CRITICAL: NTP CRITICAL: No response from NTP server [20:10:16] (03PS1) 10Dzahn: remove en2.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/170138 [20:11:35] (03CR) 10Dzahn: "hrmm "inurl:en2.wikipedia.org" About 636 results" [dns] - 10https://gerrit.wikimedia.org/r/170138 (owner: 10Dzahn) [20:12:49] (03CR) 10Dzahn: "https://www.google.com/search?q=en2.wikipedia.org#q=inurl:en2.wikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/170138 (owner: 10Dzahn) [20:14:29] (03PS3) 10Cscott: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 [20:14:31] (03PS1) 10Cscott: De-lint modules/ocg/manifests/decommission.pp. [puppet] - 10https://gerrit.wikimedia.org/r/170140 [20:18:45] (03PS1) 10Dzahn: wikipedia.org: sort other services alphabetically [dns] - 10https://gerrit.wikimedia.org/r/170145 [20:20:35] (03CR) 10Dzahn: [C: 032] wikipedia.org: sort other services alphabetically [dns] - 10https://gerrit.wikimedia.org/r/170145 (owner: 10Dzahn) [20:21:45] Reedy: ^ and that also gave us "mai.wp" [20:21:54] Reedy: you can go ahead with installing the wiki if you like [20:21:57] lol [20:21:58] (03CR) 10John F. Lewis: [C: 031] "Interesting. The link Daniel gave is most just tracers or archives from what I see." [dns] - 10https://gerrit.wikimedia.org/r/170138 (owner: 10Dzahn) [20:21:58] whee [20:22:08] yea:) needed a reason that actually touches the zonefile [20:22:17] if we touch only lang template.. doesnt update [20:22:20] what, a new wiki? [20:22:26] indeed [20:22:27] aude: yes, "mai" [20:22:30] mutante: gonna update wikitech? [20:22:37] Reedy: ok, yea [20:22:39] ooooh, populate sited ;) [20:22:41] site* [20:22:46] s [20:23:22] (03CR) 10Dzahn: "was trying to find a reason to touch wikipedia.org to make the zones get regenarated to get "mai.wikipedia.org" active :)" [dns] - 10https://gerrit.wikimedia.org/r/170145 (owner: 10Dzahn) [20:34:07] <^d> Hmm, wonder why the elastics are all complaining about ntp. [20:49:34] (03PS2) 10Krinkle: contint: Graphviz on Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [20:49:44] (03PS3) 10Krinkle: contint: Graphviz on Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [20:49:49] (03CR) 10Krinkle: [C: 031] contint: Graphviz on Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [20:51:25] (03CR) 10Hashar: "Danke mutante :-)" [puppet] - 10https://gerrit.wikimedia.org/r/170070 (https://bugzilla.wikimedia.org/47609) (owner: 10Hashar) [20:56:04] (03CR) 10Dzahn: [C: 031] "lgtm, we already use graphviz in several places (hhvm,toollabs,bugzilla,librenms,..). mini typo in the comment: Graphiz vs. Graphviz" [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [20:58:13] (03CR) 10Dzahn: [C: 031] "13:02 < Reedy> it was "load balancing"" [dns] - 10https://gerrit.wikimedia.org/r/170138 (owner: 10Dzahn) [21:05:35] (03CR) 10Gage: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/169728 (owner: 10BryanDavis) [21:06:39] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [21:08:51] (03PS1) 10QChris: For ganglia views, use description as name only if it is set [puppet] - 10https://gerrit.wikimedia.org/r/170160 [21:15:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [21:19:05] (03CR) 10Krinkle: "Hm.. from what I heard this was no longer needed?" [puppet] - 10https://gerrit.wikimedia.org/r/145997 (https://bugzilla.wikimedia.org/67957) (owner: 10Ori.livneh) [21:19:36] https://wikitech.wikimedia.org/wiki/Special:NewFiles [21:20:00] (03CR) 10Hashar: "Style suggestion inline https://gerrit.wikimedia.org/r/#/c/170140/1/modules/ocg/manifests/decommission.pp :D" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [21:21:12] (03PS1) 10Ori.livneh: webperf/navtiming: add ssl negotiation time [puppet] - 10https://gerrit.wikimedia.org/r/170167 [21:21:45] mutante: nice graph there :p [21:22:05] hashar: mediawiki-core-bundle-rubocop? [21:22:19] (03CR) 10jenkins-bot: [V: 04-1] webperf/navtiming: add ssl negotiation time [puppet] - 10https://gerrit.wikimedia.org/r/170167 (owner: 10Ori.livneh) [21:22:28] JohnLewis: https://wikitech.wikimedia.org/wiki/File:Weegees_Army%27s.PNG [21:22:34] (03CR) 10QChris: "It seems at least the varnishkafka, and kafkatee ganglia" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170003 (owner: 10Alexandros Kosiaris) [21:22:34] ori: that is a linter for ruby. Should be non voting [21:22:49] but mediawiki-core tests are so slow already, why add something of such small value? [21:22:52] ori: it is a work in progress by Zeljkof and Dan [21:23:09] mutante: I prefer the graph. (seriously - wtf is that?) [21:23:18] jenkins is really, really, really slow -- we should be scrupulous about what we add [21:23:38] JohnLewis: i would like to have a gallery of just the contribs of one user [21:23:52] ^d are elastic1009-12 good? [21:24:30] JohnLewis: https://wikitech.wikimedia.org/wiki/Special:Contributions/WikitechAnio243 [21:24:37] ori: Jenkins is not slow. The mediawiki/core PHPUnit test suite is and it needs to be given some love. The Rubocop one is quite fast and all jobs are run in parallel so that is not really slowing anything. [21:25:38] mutante: they're... definitely... creative? [21:26:00] (03PS2) 10Ori.livneh: webperf: navtiming: add ssl negotiation time [puppet] - 10https://gerrit.wikimedia.org/r/170167 [21:26:54] sorry but https://wikitech.wikimedia.org/wiki/File:Wikipedia_block.PNG is top work [21:27:06] JohnLewis: bonus for trying .exe ?:P [21:27:07] jorm ^^ you have competition ;) [21:27:40] mutante: .exe2! It's twice as good as .exe [21:27:47] lol, right [21:27:56] (03CR) 10Hashar: "OCG on beta seems to have two instances for now: deployment-pdf01 and deployment-pdf02 do you plan to convert them both to Jenkins slaves?" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [21:28:16] .com all the way! [21:28:23] double rot13 encoding [21:28:29] don't delete them mutante - keep them to show even Wikitech can be an awesome wiki :D [21:29:59] ok :) so what i really wanted was the warning icon thing on wikitech :) [21:30:00] <^d> cmjohnson: Good question, lemme check. [21:30:02] <^d> (was in meeting) [21:30:15] that made me look at the Special page [21:30:52] <^d> cmjohnson: Still need initial puppet run, I'm getting password prompt. [21:30:59] okay [21:31:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:31:22] JohnLewis: this one bugs me though https://wikitech.wikimedia.org/wiki/File:788029901.jpg :o [21:31:53] mutante: Comment: 'oops 404' [21:33:32] ooh, Internal Error on delete action [21:34:22] (03CR) 10Ori.livneh: [C: 032] webperf: navtiming: add ssl negotiation time [puppet] - 10https://gerrit.wikimedia.org/r/170167 (owner: 10Ori.livneh) [21:35:54] (03CR) 10Gage: [C: 031] "In general: puppetmaster's puppet.log includes diffs, so it seems that for example updating a password hash from the private puppet repo w" [puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [21:35:59] mutante: coincidence? I think not! [21:37:08] what the hell is that a logo for? [21:37:36] JohnLewis: i don't know.. [21:37:44] JohnLewis: i dont think i have deleted images on wikitech before [21:37:55] just uploaded them [21:39:08] jorm: it is the next Wikipedia logo of course [21:39:43] Clearly. [21:41:56] (03PS1) 10Chad: Adjust number of content shards for largest wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170173 [21:43:09] mutante: https://github.com/wikimedia/operations-mediawiki-config/commit/d4d53e82730f142057caa4f952d2bbb667bf7ab9 [21:43:38] Does Wikitech log to files or anything useful at all? [21:44:53] Reedy: shouldn't it inherit the cluster defaults for logging? [21:45:09] JohnLewis: Well, yeah [21:45:23] so why ask? :p [21:45:31] Many reasons [21:46:07] good point [21:46:16] As we found when moving it to the same code/config base, virt1000 can't access many things on the cluster [21:51:05] (03PS1) 10Ori.livneh: Update my (=ori's) deployment script [puppet] - 10https://gerrit.wikimedia.org/r/170177 [21:55:23] <^d> Reedy: Would https://phabricator.wikimedia.org/P50 work how I expect it to? [21:55:39] <^d> (it's been awhile since I've played with +'s in init settings) [21:56:13] For commons it would [21:56:20] as there's no key conflict [21:56:23] for en/de... [21:57:11] I think it should override.. should be default, group, dbname for precedence I think [21:57:32] live hack on tin, sync-common, eval.php? [21:59:28] <^d> Yeah i'm gonna have to. [22:04:26] <^d> Reedy: I want + on all 3. [22:04:38] <^d> Like I thought [22:05:57] (03PS1) 10Chad: Adjust number of replicas for de/enwiki content and commons file index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170182 [22:09:44] (03CR) 10BryanDavis: "> However from modules/puppetmaster/lib/puppet/reports/logstash.rb:34 it seems that this patch would only send the message "Puppet run on " [puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [22:16:09] (03CR) 10Ori.livneh: [C: 032] Update my (=ori's) deployment script [puppet] - 10https://gerrit.wikimedia.org/r/170177 (owner: 10Ori.livneh) [22:19:22] ^d: I knew that much. Just whether it actually had the intended result ;) [22:20:42] (03CR) 10Daniel Kinzler: "If you turn this on, the page and revision tables have to have the xxx_model and xxx_format columns. Do all our wikis have these, now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [22:21:58] (03CR) 10Reedy: "Yup, see https://bugzilla.wikimedia.org/show_bug.cgi?id=49193#c14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [22:23:13] (03PS1) 10Ori.livneh: hhvm: remove 'hhvm-watch-mem' script [puppet] - 10https://gerrit.wikimedia.org/r/170188 [22:36:14] (03PS2) 10Ori.livneh: hhvm: remove 'hhvm-watch-mem' script [puppet] - 10https://gerrit.wikimedia.org/r/170188 [22:38:10] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: remove 'hhvm-watch-mem' script [puppet] - 10https://gerrit.wikimedia.org/r/170188 (owner: 10Ori.livneh) [22:45:53] (03PS1) 10Rush: phab rename ext_ref as Reference [puppet] - 10https://gerrit.wikimedia.org/r/170237 [22:50:03] <^d> Only swat so far is mine, I can do it. [22:58:51] <^d> ^d: ping for swat [22:59:00] <^d> yo ^d, I'm here. [22:59:10] <^d> ok cool, i'm going to start merging your patches. [22:59:18] <^d> sounds good ^d, thx [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem, ^d: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141030T2300). [23:00:11] <^d> jouncebot: ^d and I are doing swat :D [23:00:45] is there a brad here? :) [23:00:57] <^d> tab-complete says no [23:03:00] <^d> you know, swat is less fun when you're doing it for yourself. [23:04:14] (03PS2) 10Gage: logstash: reformat gelf filter config [puppet] - 10https://gerrit.wikimedia.org/r/169728 (owner: 10BryanDavis) [23:04:16] (03PS2) 10Gage: logstash: Drop full_message field from GELF messages [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [23:04:18] (03PS1) 10Gage: Logstash: Hadoop: drop INFO-level messages to save space [puppet] - 10https://gerrit.wikimedia.org/r/170244 [23:04:35] hmm that's not what i expected to happen [23:04:49] trying to submit a patch which depends on those [23:05:22] jgage: you probably rebased [23:05:34] ach, yeah [23:05:45] just merge them all :) [23:05:48] hehe ok [23:05:56] i figured you'd be doing the others soon anyway [23:06:04] <^d> rebase is like giving a shotgun to a kid who's trying to fight off bears. [23:06:10] <^d> yeah, it may help them keep the bears away [23:06:11] I won't since I'm not a root ;) [23:06:17] <^d> but they're probably also going to shoot themselves. [23:06:57] wikitech says "don't merge, rebase!" :P [23:07:11] but maybe that doesn't apply to this scenario [23:07:28] gerrit has made me a huge fan of interactive rebasing to maintain fake feature branches [23:07:32] (03CR) 10Gage: [C: 032] logstash: Drop full_message field from GELF messages [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [23:07:46] (03CR) 10Gage: [C: 032] logstash: reformat gelf filter config [puppet] - 10https://gerrit.wikimedia.org/r/169728 (owner: 10BryanDavis) [23:07:57] (03CR) 10Gage: [C: 032] Logstash: Hadoop: drop INFO-level messages to save space [puppet] - 10https://gerrit.wikimedia.org/r/170244 (owner: 10Gage) [23:08:08] !log demon Synchronized php-1.25wmf5/extensions/CirrusSearch: (no message) (duration: 00m 05s) [23:08:15] Logged the message, Master [23:08:18] !log demon Synchronized php-1.25wmf6/extensions/CirrusSearch: (no message) (duration: 00m 04s) [23:08:23] Logged the message, Master [23:08:48] ok, all sorted [23:09:02] * ^d declares swat a resounding success [23:09:04] <^d> go me [23:11:06] (03CR) 10Dzahn: [C: 032] contint: Graphviz on Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [23:14:08] (03CR) 10Dzahn: "gallium: ii graphviz 2.26.3-10ubuntu1.1" [puppet] - 10https://gerrit.wikimedia.org/r/170012 (https://bugzilla.wikimedia.org/72454) (owner: 10Hashar) [23:16:33] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=73%): [23:17:25] (03CR) 10Cscott: "I thought I'd switch deployment-pdf02 to the role::ocg::beta first, and get it working first, before switching deployment-pdf01 (which is " [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [23:18:37] (03CR) 10Dzahn: [C: 031] "DNS has been added & this looks reasonable (though i don't speak 'mai')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [23:18:44] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=73%): [23:24:37] (03PS1) 10Spage: Enable Flow on mw.org usability research test page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170250 [23:43:33] (03PS1) 10Gage: logstash: hadoop: reenable gelf output [puppet] - 10https://gerrit.wikimedia.org/r/170252 [23:44:28] (03CR) 10Gage: [C: 032] logstash: hadoop: reenable gelf output [puppet] - 10https://gerrit.wikimedia.org/r/170252 (owner: 10Gage) [23:50:45] (03PS1) 10Dzahn: puppetception - lint [puppet] - 10https://gerrit.wikimedia.org/r/170256 [23:51:01] (03PS2) 10Dzahn: puppetception - lint [puppet] - 10https://gerrit.wikimedia.org/r/170256