[00:10:31] AaronSchulz: yt? [00:16:00] springle: yup [00:16:08] I already figured out the issue from earlier [00:16:13] those unknown db errors [00:16:14] oh [00:16:16] :) [00:16:20] cool, jobrunners? [00:17:58] springle: https://gerrit.wikimedia.org/r/#/c/169964/ [00:18:30] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [00:18:30] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [00:19:00] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [00:19:31] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 47, cluster_name: production-logstash-eqiad, relocating_shards: 1, active_shards: 94, initializing_shards: 0, number_of_data_nodes: 3 [00:20:19] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 42, down: 0, shutdown: 0 [00:20:30] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 47: active_shards: 94: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:22:05] AaronSchulz: excellent :) [00:30:58] hi, https://office.wikimedia.org/wiki/Data_access says "Request access to stat1 from ops if you're working with EventLogging data." I used to have access to stat1 when I worked on E3, now I think I need access to stat1001 for Flow EL data [00:31:14] spagewmf: Access requests need an RT ticket :) [00:32:11] Reedy thx. Is RT superseded by Phabricator yet? [00:32:17] nope [00:38:27] I try to create an RT ticket, get "You have no permission to create tickets in that queue. [00:38:30] No details " [00:40:12] Requests: ops-requests@rt.wikimedia.org [00:42:18] spagewmf: just use ops-request queue, doesnt matter via web ui or mail [00:43:47] mutante: Tickets > New ticket goes to https://rt.wikimedia.org/SelfService/Create.html?Queue=1 , no choice of a queue. (I'm logged in to RT) [00:44:41] spagewmf: the problem is the "SelfService" part in the URL, https://rt.wikimedia.org/Ticket/Create.html?Queue=15 [00:45:13] that probably means you are logged in as an email address [00:45:26] (03PS8) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [00:46:12] mutante: yes I loggedin as spage@wikimedia.org with a password from 2012. I didn't choose that queue, I just followed a link. [00:46:38] spagewmf: that's not the right user. which link did you follow [00:46:58] your user should just be a nickname, not an email address [00:47:11] otherwise this is your autocreated user without permissions [00:47:21] that you got from mailing RT in the past [00:48:21] mutante: that sounds right, I think it let me view tickets that "the machinery" created for me. I'll e-mail ops-requests, thanks [00:49:27] "the machinery" = email, yea [00:51:04] spagewmf: this is why "Autocreated when added as a watcher" [00:51:20] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 327 seconds [00:51:21] spagewmf: you probably never had an actual RT user [00:51:56] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 353 seconds [00:52:09] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: puppet fail [00:52:16] if it weren't for phabricator, i would say to request one [00:52:37] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:06] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:11:10] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:25:23] (03PS1) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 [01:28:21] (03PS2) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 (https://bugzilla.wikimedia.org/72072) [01:33:32] (03PS1) 10Tim Landscheidt: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/169981 [01:37:20] (03PS1) 10Ori.livneh: Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 [01:44:16] (03Abandoned) 10Chmarkine: lists - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169978 (https://bugzilla.wikimedia.org/72072) (owner: 10Chmarkine) [02:21:56] (03CR) 10Tim Starling: [C: 031] Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 (owner: 10Ori.livneh) [02:22:35] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%): [02:38:53] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 333 MB (3% inode=73%): [03:07:02] (03CR) 10Ori.livneh: [C: 032] Periodically restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/169982 (owner: 10Ori.livneh) [03:48:12] periodically restart it? [03:48:34] what about in-flight requests? [03:55:27] TimStarling: ^ [03:55:57] does it not have a graceful restart mechanism? [03:56:07] I don't know [03:56:23] we plan on restarting it after every scap, so it will need one even without that cron job [03:56:34] I doubt that this exists now [03:57:45] doesn't look like it [03:58:47] # When `hhvm.server.graceful_shutdown_wait` is set to a positive [03:58:48] # integer, HHVM will perform a graceful shutdown on SIGHUP. [03:58:48] kill signal HUP [03:59:09] /etc/hhvm/fcgi.ini:hhvm.server.graceful_shutdown_wait [04:01:16] 5 seconds doesn't' seem all that long [04:03:10] also, there is a window between shutdown & start during which apache would throw a 50x (503 most likely) [04:03:35] because apache is in the middle, pybal won't have its socket closed and so it will have to resort to polling to depool [04:03:38] true [04:04:09] finally, doesn't this mean the JIT would have to warm up again? [04:04:22] yes [04:04:26] that's the point of restarting it after scap [04:04:40] otherwise the JIT "cache" grows forever [04:05:25] paravoid: are you saying that it's desirable for pybal to depool the server during this time? [04:05:57] I'm still trying to figure out the effects of this tbh [04:06:26] so I'm not saying anything yet [04:06:43] [fluorine:/a/mw-log] $ grep -c 'Segmentation fault' apache2.log [04:06:43] 1999 [04:06:56] (that's zend) [04:07:10] and? [04:07:32] so it's not going to be the end of the world to serve a few 503s when a server restarts, though it's obviously something we should fix [04:07:52] yes, 503s happen right now [04:08:05] they're indicative of a problem though, they're not just there by design [04:08:18] there's also [04:08:33] also I'm pretty sure there are connections that live for > 5s [04:08:40] so at least that part is wrong [04:08:42] ideally it would depool in advance, right? [04:08:52] before it stops accepting connections [04:08:57] yes [04:09:09] but for that, pybal needs an API [04:09:21] in our traditional stack [04:09:30] pybal keeps a long-running TCP connection open [04:09:59] when you kill apache, the connection tears down gracefully and pybal immediately detects that and depools instantly [04:10:16] instantly as in milliseconds [04:10:21] well yes :) [04:10:44] we could apache2ctl graceful-stop ; restart hhvm ; start apache [04:11:17] I've said in the past that we (HHVM team) should just write pybal code [04:11:49] yeah, i'm up for that [04:11:53] e.g. pybal could have a TCP server and we could send a depool message to it [04:12:18] we've had several ideas on how to add some level of instrumentation to pybal [04:12:29] they've all have been on the backburner though [04:12:32] etcd was popular, iirc [04:12:47] yeah, that was my idea from a while back [04:13:11] i don't love that it uses http long polling to watch keys [04:14:04] it would be quite nice to have an init script that waits for pybal to acknowledge that it has depooled before shutting down the server [04:14:25] yes, we could do that [04:15:02] and then you'd still have to wait for all inflight requests to be over before actually killing the server [04:15:08] I'd suggest a much higher timeout than 5s [04:15:27] sure, make it 30 [04:15:40] fortunately, most downloads/large files (e.g. a video) are being served from swift [04:16:03] (typically, although MW stays in the path for private wikis IIRC) [04:17:38] /etc/init/hhvm-graceful.conf: task; start on stopping hhvm RESULT=ok [04:17:45] http://manpages.ubuntu.com/manpages/utopic/en/man7/stopping.7.html [04:18:41] varnish hopefully buffers the result from MediaWiki [04:18:55] how do you mean? [04:19:36] well, squid certainly used to download from MediaWiki as fast as MediaWiki could supply data, even for private requests [04:19:58] then it would close the MW connection and continue to send data to the end user [04:20:28] I thought varnish did the same [04:20:36] it does, yes [04:20:44] unless streaming is enabled [04:20:49] which for MW is not [04:21:06] or return (pipe) [04:21:18] yes, but not in this case [04:21:24] so you won't have connections open for 30s due to a user streaming a meeting video from office.wikimedia.org over dialup [04:21:51] nod [04:23:20] hm, varnish probably does keep the connection open and pipelines though [04:23:47] one connection != one request [04:25:40] TimStarling: why did we set retry=0 for apache's mod_proxy again? [04:25:49] that's correct [04:26:00] well, for the general case, not sure about HHVM [04:26:02] otherwise if you restart hhvm then it will serve 503s for 30 seconds [04:26:06] yeah, that [04:26:33] there should probably not be keepalive between varnish and MW [04:27:15] why are we going through all this trouble again? [04:27:18] are we giving up on leaks? :) [04:27:29] see the bug [04:27:33] we fixed one leak, it was fun, took a week or so [04:27:48] heh, ori said "it was not fun" [04:27:53] different perspectives ;) [04:28:01] fun for some [04:28:12] well, i will know what i'm doing the next time around [04:28:20] fixing more leaks would also be fun but would take more weeks, and we probably aren't leaking much memory anymore [04:28:27] there was a lot of hysterical running around for me this time [04:28:59] so not only should the app server depool itself before hhvm is stopped, it should also wait for the warmup requests regiment to complete before repooling itself [04:29:01] that would be nice [04:29:23] yes that's why I asked early in the conversation about warmup [04:30:35] it'll be fine [04:30:45] while we're still in this deployment phase, maybe we should let it run for a while and see if there are actually any leaks? [04:30:56] meanwhile in two hours, plus or minus 15 minutes, mod_passenger on palladium will crash and this channel will get a screenful of puppet failure alerts [04:31:25] I don't see how this is relevant [04:31:39] unless you want to point out we're incompetent or something :) [04:31:59] no, but we tend to gravitate to interesting problems rather than urgent ones [04:32:25] we could leave it and see if there are more leaks [04:32:36] if that was the case we would have fixed palladium [04:32:44] but if the answer is yes then the solution is probably going to be to restart hhvm periodically ;) [04:32:48] i think we're still leaking [04:32:56] a lot less than before, but still [04:33:53] I don't find the palladium issues more urgent than what happens in our production stack affecting real users fwiw [04:34:13] well, see the 2k segfaults from zend then [04:34:41] again, I don't see the point of this [04:35:05] ori: I don't think that is a productive line of argument [04:35:42] if you were proposing to introduce a zend change that was known to segfault and throw 503s periodically, I don't think we'd be discussing this at all [04:35:43] all right. i didn't mean to be jerky. [04:36:19] i don't think we should live with periodic restarts as the status quo [04:36:47] but even with a restart every 24-48hrs, that's a big enough window for diffing heap profiles and getting down to the bottom of leaks [04:36:59] I'm not necessarily saying we shouldn't restart [04:37:02] paravoid: you know for many years we had a daily restart for apache because of zend memory leaks [04:37:18] at one time it was even hourly IIRC [04:37:29] I just want to understand the implications of that and figure out how to make our stack deal with that [04:37:33] I didn't know that [04:37:43] was it then that mark added the idle tcp connection check to pybal? :) [04:38:22] the idle connection check was a great idea, I don't think it was motivated by any particular problem [04:38:31] :) [04:38:34] tcp canary [04:39:27] or https://en.wikipedia.org/wiki/Dead_Hand_(nuclear_war) [04:39:44] I think it was introduced not long after pybal itself, so mark was still thinking generally about ways to improve it [04:40:04] before pybal we had a sad little php script written by me, and it wasn't very easy to do event-driven things in it [04:40:35] pybal needs some love [04:40:52] it's amazing how much little engineering time we are giving in it compared to how much we're relying on it [04:40:54] i'm looking at the code, it's pretty nice actually [04:41:09] paravoid: what does it need? [04:41:17] other than what was already identified above [04:41:32] dunno, we've had various ideas over the year, I don't think they were documented anywhere [04:41:51] so I guess the answer is... "a bug tracker"? [04:42:38] but anyway, the implications of this change are not /just/ related to pybal [04:43:29] e.g. depending how large is that 503-window, we may also start getting icinga alerts [04:43:47] how much does it take for HHVM to boot up anyway? [04:43:52] it's very short [04:44:17] if it's short then we can just set the retry count appropriately [04:45:01] a single appserver going down is not a thing that needs urgent attention, it's only really urgent if the service address goes down [04:45:22] true [04:45:41] for icinga, that's the difference between "alert on IRC" and page [04:46:12] so a single appserver wouldn't page us but it would print spam here [04:46:22] only CRITICALs get printed here though [04:46:25] i don't think so [04:46:39] icinga has a multiple attempts mechanism as well [04:46:39] we have icinga alerts configured for it and normal service restarts haven't triggered them [04:46:55] you can say "if the check fails 3 consecutive times, then alert" [04:47:00] usually we do that [04:47:48] !log enabled heap profiling on mw1189 [04:47:55] Logged the message, Master [04:48:43] so aiui from this conversation, hhvm doesn't take much time to boot (so the stop/start 503 window is small), but it takes a longer while to be good performance-wise because of the JIT warmup [04:49:15] so that "being slow window" is longer and ideally our loadbalancing should also account for it and not send requests that way [04:49:53] it's still shorter than i expected, i'd say ~5 seconds though i haven't measured it precisely [04:50:44] but yeah, having the loadbalancer give the machine a break would be nice [04:50:59] it shouldn't be that fast [04:51:22] and it would come with other benefits, like the ability to do staggered deployments that abort mid-way and roll back if some error threshold is crossed [04:51:30] it can't be warming up the JIT in 5s [04:52:04] the apache graphs in ganglia might tell us [04:55:14] i can't find a clear indication [04:56:21] well, we can tail -f api.log | grep mw114 and look at the response time as we restart it [04:56:53] * ori tries [05:21:38] (03PS8) 10Faidon Liambotis: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 [05:21:50] (03CR) 10Faidon Liambotis: [C: 032] mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 (owner: 10Faidon Liambotis) [05:27:52] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:41] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:44] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 66275 bytes in 0.176 second response time [05:29:02] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:15] <_joe_> wat? [05:32:22] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [05:32:44] grumble [05:33:22] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.015 sec. response time [05:34:43] i didn't touch mw1023 or mw1027 [05:34:53] was that the auto-restart? [05:34:54] <_joe_> mw1027 is down in ganglia [05:34:59] <_joe_> but the machine is ok [05:35:07] meh that change's broken [05:35:32] (the sodium one I mean) [05:37:00] <_joe_> paravoid: I guess you're still awake [05:37:10] I woke up two hours ago [05:37:17] <_joe_> oh [05:37:22] <_joe_> very early :( [05:37:41] PROBLEM - Exim SMTP on sodium is CRITICAL: Connection refused [05:37:46] damn it, gnuplot [05:37:50] i want a vertical line [05:38:10] <_joe_> ori: mw1023 was the restart [05:38:16] <_joe_> not sure about mw1027 [05:39:09] <_joe_> ori: I will revert temporarily that periodic restart thing - for now I'd like to see when/if performance degrades [05:40:13] okay [05:40:22] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [05:40:42] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 66275 bytes in 0.163 second response time [05:41:40] <_joe_> mmmh quite strange indeed [05:43:55] TimStarling: you're handy with gnuplot, right? any idea why the following doesn't give me a vertical line for the restart? script: https://dpaste.de/ngpJ/raw graph: http://i.imgur.com/U0SieyI.png [05:44:41] (03PS1) 10Faidon Liambotis: Revert "mail: remove secondary MX role from sodium" [puppet] - 10https://gerrit.wikimedia.org/r/169986 [05:44:46] <_joe_> ori: I can help you with gnuplot maybe [05:45:02] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "mail: remove secondary MX role from sodium" [puppet] - 10https://gerrit.wikimedia.org/r/169986 (owner: 10Faidon Liambotis) [05:45:04] not really, I used it a bit in my PhD, a decade ago, not much since then [05:45:44] <_joe_> I stopped using it a few years later [05:46:02] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.013 sec. response time [05:46:24] <_joe_> ori: and I have no idea [05:46:36] <_joe_> why your set arrow doesn't work [05:47:20] data's at http://people.wikimedia.org/~ori/mw1114-restart-1414645218.tsv.gz (no private data, just timestampms) [05:47:25] bbiab [05:47:38] <_joe_> I rarely worked in gnuplot with time data where I converted unix epochs - maybe that's the problem? [05:47:49] <_joe_> (arrow not understanding the timeformat) [05:48:52] <_joe_> I'll bbl, I have a family to wake up and feed :) [06:02:44] _joe_: remember that if you revert that patch, you have to ensure => absent it, otherwise the cronjob will stay [06:03:55] heheh: http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/ [06:04:02] Most popular words for PHP developers: [06:04:13] entries 5 & 6: [06:04:18] localisation, translatewiki [06:04:25] Nikerabbit: ^ [06:04:42] what is "wp" there? [06:04:48] wordpress [06:04:50] ah [06:06:45] localisation and translatewiki are up there because of translation updater bot (e.g. https://github.com/wikimedia/mediawiki/commit/605e69ffedee38b77bfc14b0bdda199f30819b2d) [06:07:04] yeah I figured that [06:07:55] under "Most popular words for C developers" there's "free" [06:11:12] what the hell [06:11:31] Coren: around? [06:11:40] so labstore1001 is https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Labs+NFS+cluster+eqiad [06:11:44] saturated a bit [06:11:46] I investigate it [06:11:48] end up in toollabs [06:12:03] then I see a process called "vi" writing with 80-100M/s [06:12:06] 51716 733 65.3 68.8 3209664 2789432 ? D Oct29 527:08 vi languas.txt [06:12:15] someone is a fast typist [06:12:16] and then.. [06:12:17] root@tools-dev:/data/project/icelab# ls -lah /data/project/icelab/.final_mg.txt.swp [06:12:21] -rw-r--r-- 1 tools.icelab tools.icelab 3.5T Oct 30 06:11 /data/project/icelab/.final_mg.txt.swp [06:12:31] ...and has a lot to say [06:13:00] !?! [06:13:07] seriously [06:13:09] just kidding, i have no idea what that is [06:13:19] no I know [06:13:32] I'm just in awe [06:25:07] <_joe_> paravoid: ROTFL [06:25:52] RECOVERY - Disk space on ocg1001 is OK: DISK OK [06:26:33] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:26:54] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:28:23] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:13] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:35] <_joe_> it's weird. mod_passenger time happening one hour before [06:34:59] <_joe_> btw, the most awesome thing is vi handling a 3.5 TB file [06:35:08] <_joe_> I never thought it could [06:41:33] RECOVERY - check if salt-minion is running on ocg1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:41:42] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:13] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:33] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:52] i was off by two minutes [06:48:03] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:48:03] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:48:03] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:48:33] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 3 failures [06:49:42] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:37] lol ori, I'd say l10n-bot gamed the system ;) [06:51:38] _joe_: it's always at this time [06:56:02] DST? :) [06:58:17] that's cron.daily isn't it? [06:58:24] 06:26 [06:58:34] Debian's cron.daily default [07:00:02] (03PS1) 10Matanya: access: give spage access to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/169990 [07:00:10] [Thu Oct 30 06:25:58 2014] [notice] Graceful restart requested, doing restart [07:00:13] well duh [07:00:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:00:54] that's logrotate isn't it [07:01:20] # Rotate puppetmaster passenger (apache) logs [07:01:20] /var/log/apache2/*.log { [07:01:23] postrotate [07:01:23] /usr/sbin/service apache2 graceful [07:01:53] uhm... [07:04:55] so what is this mod_passenger issue you've talked about before? [07:05:07] I vaguelly recall someone saying something about a crash? [07:05:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:05:34] _joe_: ? [07:06:34] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:06:42] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:07:30] <_joe_> paravoid: sorry, just back [07:08:01] <_joe_> so, whenever apache is gracefully restarted, precise's version of mod passenger will not be happy [07:08:06] <_joe_> and crash some requests [07:08:33] <_joe_> so any hosts compiling in that moment [07:08:41] <_joe_> will have some catalog failures [07:10:11] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:10:15] <_joe_> I think they offered the version that gracefully restarted correctly as a paid option [07:15:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:15:54] (03PS1) 10Faidon Liambotis: puppetmaster: restart apache2 once on logrotate [puppet] - 10https://gerrit.wikimedia.org/r/169992 [07:16:00] this should help, perhaps even fix it [07:16:32] root@palladium:/var/log# lastcomm | grep apache2ctl |grep 06:25 |wc -l [07:16:35] 14 [07:16:46] (03CR) 10Faidon Liambotis: [C: 032] puppetmaster: restart apache2 once on logrotate [puppet] - 10https://gerrit.wikimedia.org/r/169992 (owner: 10Faidon Liambotis) [07:17:31] <_joe_> meh [07:18:12] <_joe_> that's why I favor using file_line and not messing with distro scripts whenever not needed [07:18:52] debian's logrotate has this [07:19:04] <_joe_> I know [07:19:08] <_joe_> I never thought we didn't [07:19:23] it's not the distro's fault if someone had a NIH moment and wrote their own logrotate instead of copying Debian's and changing weekly->daily [07:19:30] <_joe_> yes [07:19:44] <_joe_> as I said, file_line in puppet fixes that [07:20:01] <_joe_> (the NIH syndrome) [07:20:08] oh right, I misread you, sorry [07:20:18] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:20:43] so what might have happened is [07:21:01] apache kept getting gracefully restarted very quickly and while passenger was initializing [07:21:07] <_joe_> yes [07:21:19] <_joe_> so that's why we had so many failures [07:21:24] yeah possibly [07:21:26] <_joe_> instead of 1-2 at max [07:21:44] at the same time, there is an another orthogonal issue here [07:21:54] which is that we shouldn't get an alert for a single puppet failure [07:22:06] <_joe_> well, it depends [07:22:14] <_joe_> for misc servers, maybe [07:22:28] I mean, it can happen [07:22:32] <_joe_> because of the horrible puppet antipattern we so often experience on icinga [07:22:37] we also do apt-get update during the puppet run, that can fail too [07:22:44] <_joe_> yes [07:22:52] <_joe_> I've seen that causing a shower of alerts [07:23:18] I hadn't, but I would guess as much [07:25:38] <_joe_> !log raising the weight of mw1114 in the api pool to test the throughput it can withstand [07:25:43] Logged the message, Master [07:26:46] (03PS1) 10KartikMistry: Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 [07:31:33] (03CR) 10Santhosh: [C: 031] Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 (owner: 10KartikMistry) [07:49:48] (03PS9) 10Qgil: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [07:56:07] (03CR) 10Faidon Liambotis: "Note that this unconditional, which may present some security issues. On our production stack, we use an explicit whitelist of servers/sub" [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [08:04:44] <_joe_> !log doing the same with mw1189, to see how different appserver generations respond [08:04:52] Logged the message, Master [08:17:06] <_joe_> !log powercycling mw1189, enabling hyperthreading [08:17:15] Logged the message, Master [08:21:44] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Enable ca-pt and pt-ca language pairs in CX [puppet] - 10https://gerrit.wikimedia.org/r/169993 (owner: 10KartikMistry) [08:37:04] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [08:38:53] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [08:42:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:45:31] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:12] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:57:45] PROBLEM - Host 208.80.154.50 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:58:02] <_joe_> Error: /Stage[main]/Puppetmaster::Passenger/Service[puppetmaster]: Could not evaluate: Could not find init script or upstart conf file for 'puppetmaster' [08:58:04] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:58:05] <_joe_> mmmh [08:58:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:03:01] _joe_: already fixes [09:03:03] fixed* [09:03:22] more precisely... already caused the problem and fixed it as well [09:31:44] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM but we need to verify it will cause no problems on hosts having this class before merging" [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [09:32:11] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:35:35] (03CR) 10Alexandros Kosiaris: [C: 032] Fix bacula rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/169679 (owner: 10Alexandros Kosiaris) [09:35:53] (03PS2) 10Alexandros Kosiaris: Modularize backups.pp [puppet] - 10https://gerrit.wikimedia.org/r/169680 [09:36:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Ran in puppet compiler, it was a noop" [puppet] - 10https://gerrit.wikimedia.org/r/169680 (owner: 10Alexandros Kosiaris) [09:36:31] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:38:10] (03Abandoned) 10Alexandros Kosiaris: puppetmaster's pid ensured absent [puppet] - 10https://gerrit.wikimedia.org/r/166516 (owner: 10Alexandros Kosiaris) [09:48:33] <_joe_> !log load testing the hhvm appserver pool as well [09:48:41] Logged the message, Master [10:10:58] (03PS1) 10Filippo Giunchedi: syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 [10:11:15] I missed 170000 by two reviews :( [10:12:34] <_joe_> he [10:12:40] on for 180000 ! [10:13:09] * godog busy while preparing 10k puppet commits [10:13:11] <_joe_> let's automagically create 10000 PS's [10:13:31] <_joe_> godog: "retab" everything [10:13:40] hahha introduce tabs! [10:13:45] * godog grabs the transmogrifier [10:15:01] <_joe_> !log load test ended [10:15:09] Logged the message, Master [10:39:27] (03CR) 10Alexandros Kosiaris: [C: 032] syslog-ng: keep 30 days worth of logs [puppet] - 10https://gerrit.wikimedia.org/r/170002 (owner: 10Filippo Giunchedi) [10:50:28] (03PS1) 10Alexandros Kosiaris: Add a description parameter to ganglia::view [puppet] - 10https://gerrit.wikimedia.org/r/170003 [10:55:51] (03PS1) 10Alexandros Kosiaris: Add descriptions to DNS ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/170004 [11:05:51] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:14] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:00] <_joe_> mmmh something restarted hhvm here [11:11:12] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 66295 bytes in 0.288 second response time [11:11:36] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [11:13:52]