[16:21:22] (03CR) 10Ori.livneh: [C: 032 V: 032] "kept the .py file in place" [puppet] - 10https://gerrit.wikimedia.org/r/219102 (owner: 10Ori.livneh) [16:21:33] RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0] [16:22:03] PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:22:15] (03CR) 10Ottomata: [C: 032] Update jmxtrans module for jmxtrans release v250 [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/219391 (owner: 10Ottomata) [16:24:31] (03PS20) 10Paladox: Add json, erb and less highlight support to gitblit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [16:25:01] (03CR) 10Paladox: "@Dzahn all tested and works." [puppet] - 10https://gerrit.wikimedia.org/r/216421 (owner: 10Paladox) [16:25:44] RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0] [16:25:58] 6operations, 10Deployment-Systems, 7HHVM, 15User-Bd808-Test: Scap should restart HHVM - https://phabricator.wikimedia.org/T103008#1382863 (10Joe) [16:26:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:27:17] <_joe_> btw,/win 25 [16:27:20] <_joe_> argh [16:27:40] heheh [16:30:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0] [16:30:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:32:44] RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0] [16:33:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:01] (03PS4) 10Alexandros Kosiaris: role::cache: Move inclusion of lvs::configuration from base [puppet] - 10https://gerrit.wikimedia.org/r/217544 [16:34:03] (03PS6) 10Alexandros Kosiaris: lvs::configuration: Kill realm case checks [puppet] - 10https://gerrit.wikimedia.org/r/217289 [16:35:05] RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0] [16:35:32] arghhhhh somebody do something about these varnishkafka alerts pleaaaaaaaaase [16:35:44] what ori said. [16:35:46] ottomata: ^ [16:36:46] I could mass-ack them with "this shit is broken" [16:37:10] well downtime not ack I guess, since they flap [16:38:13] it may be that i'm much more distraction-prone than other people, but it still baffles me sometimes that others don't seem to mind this cognitive litter [16:38:36] no we already brought it up once today [16:38:47] it makes it hard to notice the real CRITICALs that we actually care about [16:39:00] the numbers are especially annoying -- a number in an alert commands attention [16:39:15] but the numbers in the graphite anomaly alerts are almost always goofy -- useless and weirdly specific [16:39:22] marktraceur: tgr|away gilles I'm disabling NFS for the multimedia project and bringing it back up at the moment. Let me know if you want to recover some files from there [16:39:44] PROBLEM - Translation cache space on mw1099 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:04] that alert is going away once puppet runs on neon [16:40:23] RECOVERY - Translation cache space on mw1074 is OK: HHVM_TC_SPACE OK TC sizes are OK [16:40:23] PROBLEM - Translation cache space on mw1149 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [16:41:04] speaking of annoying alerts... [16:41:08] 73 Matching Service Entries Displayed [16:41:26] grumble grumble [16:41:58] should be 73.33333% Matching Service Entries [16:42:58] YuviPanda: We aren't using it at the moment anyway, but thanks [16:43:02] YuviPanda: thanks! AFAIK none of those machines need NFS [16:43:09] sweet [16:43:14] RECOVERY - Translation cache space on mw1099 is OK: HHVM_TC_SPACE OK TC sizes are OK [16:43:38] tgr: marktraceur sweet. do delete unused instances as well :) [16:45:22] 6operations, 10ops-codfw: cp2024 console + disk issues - https://phabricator.wikimedia.org/T103090#1382930 (10Papaul) I Called Dell an i was transfer like 4 times just for a replacement disk. And no one told me why I was transfer for. first person say how may I am help you and than when i tell them what i need... [16:49:04] so, am i going to install a kafka cluster in esams then? :p [16:51:06] but ja, yall are right. [16:51:07] hm. [16:51:55] the problem isn't a lack of kafka cluster at esams, it's that kafka sucks at using a link with real latency on it [16:52:00] or something like that [16:52:42] (03CR) 10Faidon Liambotis: "Thcipriani, how does that fit with RelEng's short/mid/long term deployment system plans & strategy? We haven't heard anything about this s" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [16:52:44] documented, but that isn't kafka's problem, that is varnishkafka/librdkafka's problem. too much buffering [16:52:48] bad for caches [16:53:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp3030 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:53:19] or too little [16:53:26] or just bad protocol design? [16:53:29] who knows really [16:53:39] yah, it may be tunable. have tried and failed (a long tiem ago) [16:54:00] fwiw, I've been seeing intermittent cp10xx varnishkafka alerts too [16:54:00] but i have also been told that: A. esams latency shoudl be better. and B. Kafka should not be done cross DC [16:54:03] this week for example [16:54:04] yeah i see that too [16:54:10] !log updated/rebooted nescio/maerlant to 3.19 [16:54:12] all I know is we have tons of live prod-affecting traffic traversing those links just fine. the network is A-OK, but varnishkafka keeps bitching like the network is broke [16:54:12] which makes me less confident in my attitude :p [16:54:14] Logged the message, Master [16:54:22] clearly, it's not the network that is broke :) [16:54:40] e.g. now, cp1049 & cp1051 [16:54:51] WARNING: 11.11% of data above the warning threshold [0.0] [16:54:58] (whatever that means) [16:55:13] RECOVERY - Varnishkafka Delivery Errors per minute on cp3030 is OK Less than 1.00% above the threshold [0.0] [16:55:23] A. esams latency has a floor that we can't be too far from, speed of light through glass and all that [16:55:47] ori: what's with the TC cache alerts? [16:56:03] I don't know about B - is it true that kafka was implicitly designed to not be used across a high-latency link at all? [16:56:30] paravoid: https://gerrit.wikimedia.org/r/#/c/219102/ [16:56:48] they should go away once the change propagates to neon [16:56:48] hm, i think something is wrong with the alerting [16:56:50] ori: I thought you were restarting HHVM everytime these reached 100%? [16:56:55] paravoid: no [16:56:59] the drerr count for cp1049 has not increased in a long time [16:57:06] which means 0 drerr rate [16:57:12] it is using check graphite threshold [16:57:13] hm [16:57:34] been 0 drerrs on cp1049 since june 5 [16:57:37] paravoid: i did a couple of times, but i shouldn't have; it was a reaction to the alert spamming the channel [16:57:47] (03CR) 10Faidon Liambotis: "Don't forget to salt rm /usr/local/bin/check_tc_space if you haven't already." [puppet] - 10https://gerrit.wikimedia.org/r/219102 (owner: 10Ori.livneh) [16:58:10] ah will do [16:58:15] ori: so what happens once it fills? [16:58:20] garbage collects by itself? [16:58:24] HHVM SIGABRTs and restarts [16:58:27] lol how nice [16:58:41] it's gross but we'll fix it with https://phabricator.wikimedia.org/T103008 [16:58:51] "fix" it [16:59:08] has this been raised upstream? [16:59:18] yes [16:59:52] I mean, this plan could work for us, but certainly won't for their broader "open source" / "make it for everyone" strategy [16:59:53] (03PS5) 10Alexandros Kosiaris: lvs: Move the role manifests into the role module [puppet] - 10https://gerrit.wikimedia.org/r/217288 [16:59:55] but their take on it is to work hard to make repoauth easier for everyone [17:00:22] e.g. https://github.com/facebook/hhvm/commit/4bbae3bab9aae9647588637af5518f37f4091fc4 [17:00:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] lvs: Move the role manifests into the role module [puppet] - 10https://gerrit.wikimedia.org/r/217288 (owner: 10Alexandros Kosiaris) [17:00:27] how would repoauth fix this? [17:00:48] if that's their plan, it's a pretty bad plan [17:00:50] with RepoAuth there is no translation at run time [17:01:21] it is all done ahead of time [17:01:25] (if they really want to sell this to people running their own wordpress or mediawiki or something) [17:01:45] i think they'd like that but they're focusing on the big sites [17:01:53] (03PS6) 10Dzahn: WIP: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 [17:01:53] dailymotion.com is migrating now [17:01:57] 6operations, 10Wikimedia-Site-requests, 7Community-consensus-needed: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1382975 (10Glaisher) Why is this marked as #community-consensus-needed? As long as the existing preferences for users are not change... [17:02:13] I know, I sort of helped to plan the idea :P [17:02:16] server on the left is with hhvm: http://cl.ly/image/0b0O0U1f2z2u [17:02:21] paravoid: you did? [17:02:45] (03PS2) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) [17:03:49] hm 1073 actually had drerrs [17:05:09] (03PS4) 10Alexandros Kosiaris: Lint lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/217545 [17:05:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Lint lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/217545 (owner: 10Alexandros Kosiaris) [17:09:13] PROBLEM - Varnishkafka Delivery Errors per minute on cp1072 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:09:32] (03PS7) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 [17:09:39] (03PS8) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 [17:09:43] PROBLEM - Varnishkafka Delivery Errors per minute on cp1068 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:11:21] (03CR) 10Alexandros Kosiaris: [C: 032] lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 (owner: 10Alexandros Kosiaris) [17:11:25] (03PS4) 10Alexandros Kosiaris: lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 [17:11:30] (03CR) 10Alexandros Kosiaris: [V: 032] lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 (owner: 10Alexandros Kosiaris) [17:11:58] !log salt -t30 -G 'php:hhvm' cmd.run 'rm -f /usr/local/bin/check_tc_space' (https://gerrit.wikimedia.org/r/#/c/219102/) [17:12:03] Logged the message, Master [17:12:31] (03PS1) 10Ottomata: Increase critical threshold of varnishkafka drerr alert [puppet] - 10https://gerrit.wikimedia.org/r/219399 [17:12:45] ottomata: WARNINGs are also quite annoying [17:12:48] (03PS2) 10Ottomata: Increase critical threshold of varnishkafka drerr alert [puppet] - 10https://gerrit.wikimedia.org/r/219399 [17:12:54] do warnings ping here? [17:12:55] paravoid? [17:12:56] they're not echoed here but I regularly watch icinga's page [17:13:03] RECOVERY - Varnishkafka Delivery Errors per minute on cp1072 is OK Less than 1.00% above the threshold [0.0] [17:13:16] if they are actually a problem to warn about, then we should fix that problem, right? [17:13:32] RECOVERY - Varnishkafka Delivery Errors per minute on cp1068 is OK Less than 1.00% above the threshold [0.0] [17:13:34] if it's actually a warning, it should be actionable [17:13:37] yes. but to do so would require very much time and possibly new hardware, and it is not a high priority. so they should not happen. [17:13:39] but they do. [17:13:56] new hardware why? [17:14:11] kafka brokers in esams [17:14:21] aye, and other DCs too, if necessary [17:14:27] (03CR) 10Alexandros Kosiaris: [C: 032] Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 (owner: 10Alexandros Kosiaris) [17:14:32] (03PS4) 10Alexandros Kosiaris: Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 [17:14:32] no, we already established there are similar warnings for eqiad caches as well [17:14:44] (03CR) 10Alexandros Kosiaris: [V: 032] Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 (owner: 10Alexandros Kosiaris) [17:14:52] right now I see cp1056, cp1068, cp3009, cp3040 [17:15:04] two warnings, two crits [17:15:18] yes, but that doesn't mean we don't need remote kafka clusters. i think there may be multiple issues here, not sure. [17:15:25] that's why i said 'possibly' :) [17:15:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3040 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:16:29] hm, paravoid lemme poke around, i might be able to make a smarter alert, one that alerts on produced data rather than delivery errors [17:16:38] ok [17:16:42] well, that isn't really smarter, but a work around. we want to know if there is a serious problem at the moment [17:17:01] and short bursts of dropped messages isn't a huge issue. it is not good and i should solve it, but there are other things [17:17:17] so, if i can alert on say, produce rate dropping to 0, then that would be good enough for now [17:17:34] yes, a hundred alerts alerting us that we need "very much time and possibly new hardware that is not a high priority" isn't great :P [17:18:02] PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:18:07] and honestly, I don't buy this this whole "kafka over WAN is not supported", I don't see why kafka has to be latency sensitive [17:18:12] and we have evidence to the contrary, ^^^ [17:18:18] 6operations, 10Wikimedia-Site-requests, 7Community-consensus-needed: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383024 (10tomasz) Most likely a relict of the past since this request was never rejected due to insufficient community consensus bu... [17:18:40] 6operations, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383026 (10tomasz) [17:19:10] paravoid: i'm just going on what the creators and likely largest maintainers of kafka recommend there [17:19:30] the producers should not be the buffer [17:19:35] the brokers are meant to be the buffer of messages [17:19:43] RECOVERY - Varnishkafka Delivery Errors per minute on cp3040 is OK Less than 1.00% above the threshold [0.0] [17:19:44] so, high latency production means producers have to buffer [17:20:02] RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0] [17:20:36] (03PS10) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [17:21:41] (03PS9) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 [17:21:43] well for 70ms or so, sure [17:23:14] (03CR) 10Dzahn: [C: 032] switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 (owner: 10Dzahn) [17:24:20] !log install linux 3.19 on restbase100[789] [17:24:24] Logged the message, Master [17:25:36] kart_: I'm moving the language project off NFS. Let me know if / when you want it back on. Thanks [17:25:41] mutante: wikistats should be fine now [17:25:49] YuviPanda: cool, thanks! [17:26:16] YuviPanda: confirmed, logged in:) [17:26:41] mutante: no NFS in mount output? [17:26:52] PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 22.22% of data above the critical threshold [20000.0] [17:27:16] YuviPanda: nope [17:27:21] mutante: wonderful. [17:27:33] mutante: I can get you your files next week if that's ok? [17:27:39] YuviPanda: yea, it is [17:27:45] mutante: do remind me [17:27:46] thanks [17:27:53] alright, will do [17:27:55] can't handle the cables? [17:28:38] ? [17:33:14] (03CR) 10GWicke: Remove trebuchet setup from restbase config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [17:34:12] (03PS3) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) [17:35:07] (03CR) 10coren: [C: 031] "We prefer Jaime alive." [puppet] - 10https://gerrit.wikimedia.org/r/218870 (owner: 10Jcrespo) [17:36:28] (03PS2) 10Jcrespo: Add jcrespo to the dba nagios contact list [puppet] - 10https://gerrit.wikimedia.org/r/218870 [17:36:44] (03CR) 10Jcrespo: [C: 032] Add jcrespo to the dba nagios contact list [puppet] - 10https://gerrit.wikimedia.org/r/218870 (owner: 10Jcrespo) [17:37:53] RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0] [17:38:10] (03CR) 10Dzahn: [C: 032] icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) (owner: 10Dzahn) [17:38:34] (03PS4) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) [17:38:50] jynus: i merged your change on the master [17:39:23] oh, thank you, mutante [17:39:45] I will check neon [17:40:03] jynus: cool, it should also do another thing and give 20after4 permissions [17:41:37] should I test? [17:41:42] mutante, change applied correctly it seems (both9 [17:42:12] YuviPanda: looks good! Thank you! [17:42:21] jynus: :) [17:42:34] twentyafterfour: wanna try that icinga thing again when you got a minute [17:44:33] ottomata: port 9690 file ./aggregators/1041.conf [17:45:07] mutante: I'm off work today but I'll test it real quick before I leave ;) [17:45:09] andrewbogott: cool [17:45:57] !log krenair Synchronized private/PrivateSettings.php: sync 4a30446e for wikitech cleanup - T102361 (duration: 00m 12s) [17:46:00] danke mutante [17:46:01] Logged the message, Master [17:46:10] twentyafterfour: i hope it likes usernames starting with numbers:) and have a nice day off [17:46:19] ottomata: yw [17:46:47] mutante: what IP? [17:47:04] carbon's? [17:47:13] ottomata: yes, carbon [17:47:14] ok [17:47:15] thanks [17:47:43] 6operations, 10RESTBase-Cassandra: don't start cassandra at boot - https://phabricator.wikimedia.org/T103134#1383182 (10fgiunchedi) 3NEW a:3fgiunchedi [17:48:04] mutante: works [17:48:35] twentyafterfour: :) yay, have a nice weekend then [17:48:42] thanks [17:49:08] I went ahead and scheduled next week's maintenance window in icinga. I take it there isn't a way to do recurring ones? [17:49:12] (03PS1) 10Alex Monk: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) [17:49:20] (03CR) 10jenkins-bot: [V: 04-1] Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [17:49:20] 6operations, 7Icinga, 5Patch-For-Review: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383195 (10Dzahn) a:5mmodell>3Dzahn [17:49:34] 6operations, 7Icinga, 5Patch-For-Review: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383196 (10Dzahn) 5Open>3Resolved 10:49 < twentyafterfour> mutante: works [17:49:35] (03PS2) 10Alex Monk: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) [17:49:36] 6operations, 10RESTBase-Cassandra: don't start cassandra at boot - https://phabricator.wikimedia.org/T103134#1383198 (10fgiunchedi) ``` 18:39 + it's potentially data loss dangerous 18:40 gwicke: even on a fully bootstrapped node? 18:40 + yes, especially on a fully bootstrapped node 18:... [17:49:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp3030 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:50:07] 6operations, 7Icinga: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383199 (10Dzahn) [17:51:23] ottomata: conf1001 already shows up [17:51:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp3030 is OK Less than 1.00% above the threshold [0.0] [17:51:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:51:39] in analytics cluster [17:52:19] in analytics?! [17:52:48] 6operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#1383206 (10fgiunchedi) +1, thanks @valhallasw. I think the only problematic bot might be logmsgbot if it is parsing `!log` only from privmsg and not notice as well (or some solution of course) [17:55:10] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1383222 (10fgiunchedi) we've seen more `sdb` errors on `restbase1008` too, but nothing on `sda` so far. To rule out further things like... [17:55:21] ottomata: yea, for some reason it's in the analytics cluster [17:55:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0] [17:58:36] 6operations, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383245 (10Whatamidoing-WMF) > was never rejected due to insufficient community consensus but due to very specific technical reasons. "Very specific technical rea... [18:01:44] (03PS1) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 [18:01:49] paravoid: ^ ? [18:01:56] oops [18:02:03] didn't mean for jmxtrans to be in there, one sec [18:02:27] (03CR) 10jenkins-bot: [V: 04-1] Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 (owner: 10Ottomata) [18:02:46] (03PS2) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 [18:03:06] (03PS3) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 [18:03:13] (03Abandoned) 10Andrew Bogott: Exclude labs private IPs from dmz_cidr. [puppet] - 10https://gerrit.wikimedia.org/r/210720 (owner: 10Andrew Bogott) [18:04:03] (03CR) 10Filippo Giunchedi: "I don't think we plan to run racktables on HHVM, in this case I don't see a reason to start depending on mod_php specifically either. Also" [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi) [18:04:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: install openjdk-7-jdk [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219296 (https://phabricator.wikimedia.org/T102996) (owner: 10Filippo Giunchedi) [18:06:34] (03CR) 10Andrew Bogott: [C: 031] "This needs to be preceded by a mw-config patch, right? So we don't refer to this no-longer-puppetized file?" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [18:08:53] marktraceur: https://phabricator.wikimedia.org/T103137 [18:09:31] (03PS4) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 [18:09:33] YuviPanda: orgcharts is dead anyway [18:09:41] marktraceur: shall I delete the project? [18:09:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp1074 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:09:51] marktraceur: do you have the code elsewhere? [18:10:14] :( [18:10:25] (03PS1) 10Filippo Giunchedi: cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/219410 [18:11:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:12:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/219410 (owner: 10Filippo Giunchedi) [18:12:30] (03PS1) 10Ottomata: Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 [18:12:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:13:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0] [18:13:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp1074 is OK Less than 1.00% above the threshold [0.0] [18:13:28] (03PS2) 10Ottomata: Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 [18:14:17] (03CR) 10Ottomata: [C: 032] Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 (owner: 10Ottomata) [18:14:41] * YuviPanda pokes sad marktraceur [18:15:37] (03CR) 10Filippo Giunchedi: "thoughts on naming? d-i-console might work just fine or install-console or sth like that" [puppet] - 10https://gerrit.wikimedia.org/r/217016 (owner: 10Filippo Giunchedi) [18:15:51] (03CR) 10Alex Monk: "Nope, it was included from PrivateSettings.php (Why? I have no idea.), which I already removed and synced earlier. WikitechPrivateLDAPSett" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [18:16:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0] [18:17:10] (03CR) 10Filippo Giunchedi: [C: 031] configure additional Cassandra metric alerts [puppet] - 10https://gerrit.wikimedia.org/r/218408 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [18:17:36] Jamesofur|cloud: hi! do you want the sugarcrm project recovered? [18:18:54] (03CR) 10Filippo Giunchedi: "@Aaron, you reckon restarting jobchron for log rotation at the same time might cause thundering herds or issues like that?" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [18:18:58] Error: /Stage[main]/Mediawiki::Scap/Package[scap]/ensure ... to latest failed: Could not get latest version: 403 Forbidden [18:19:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:19:33] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: install openjdk-7-jdk on Cassandra nodes - https://phabricator.wikimedia.org/T102996#1383326 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged [18:22:06] 6operations, 6Labs, 6Release-Engineering, 10wikitech.wikimedia.org: silver / scap - Could not get latest version: 403 Forbidden - https://phabricator.wikimedia.org/T103138#1383336 (10Dzahn) [18:22:07] (03PS5) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 [18:22:14] (03CR) 10Ottomata: [C: 032 V: 032] Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 (owner: 10Ottomata) [18:22:38] (03CR) 10Aaron Schulz: "I'm not worried. The last patch vastly reduced the CPU of this module for a number of reasons and the deploy/restart went without problem " [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [18:23:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0] [18:23:46] (03CR) 10GWicke: Remove trebuchet setup from restbase config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [18:25:23] (03PS1) 10Ottomata: Fix typo in graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219416 [18:25:27] ACKNOWLEDGEMENT - salt-minion processes on conf1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn @ottomata - setup [18:25:27] ACKNOWLEDGEMENT - puppet last run on conf1002 is CRITICAL Puppet has 1 failures daniel_zahn @ottomata - setup [18:25:27] ACKNOWLEDGEMENT - puppet last run on conf1003 is CRITICAL Puppet has 1 failures daniel_zahn @ottomata - setup [18:25:37] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo in graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219416 (owner: 10Ottomata) [18:26:08] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:26:58] PROBLEM - puppet last run on cp1047 is CRITICAL puppet fail [18:27:07] PROBLEM - puppet last run on cp3012 is CRITICAL puppet fail [18:27:25] my fault ^ [18:27:25] fixing. [18:27:52] (03CR) 10Aaron Schulz: [C: 031] jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [18:27:58] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [18:28:28] PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail [18:28:47] PROBLEM - puppet last run on cp2005 is CRITICAL puppet fail [18:28:57] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [18:28:58] PROBLEM - puppet last run on cp1056 is CRITICAL puppet fail [18:29:08] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [18:29:08] ACKNOWLEDGEMENT - puppet last run on silver is CRITICAL Puppet has 1 failures daniel_zahn T103138 [18:29:08] PROBLEM - puppet last run on cp3016 is CRITICAL puppet fail [18:29:08] PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail [18:29:17] PROBLEM - puppet last run on cp1061 is CRITICAL puppet fail [18:29:18] PROBLEM - puppet last run on cp2001 is CRITICAL puppet fail [18:30:07] PROBLEM - puppet last run on cp4008 is CRITICAL puppet fail [18:30:08] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [18:30:08] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [18:30:29] PROBLEM - puppet last run on cp3037 is CRITICAL puppet fail [18:31:08] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [18:31:36] (03CR) 10Filippo Giunchedi: [C: 04-1] "thanks Aaron!" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [18:31:54] PROBLEM - puppet last run on cp2013 is CRITICAL puppet fail [18:32:05] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [18:32:34] PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail [18:32:42] !log stop cassandra on restbase1008 [18:32:46] Logged the message, Master [18:32:48] (03Abandoned) 10Ori.livneh: carbon-cache: enable manhole interface [puppet] - 10https://gerrit.wikimedia.org/r/219226 (owner: 10Ori.livneh) [18:32:54] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [18:33:04] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [18:33:14] PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail [18:33:17] my manhole was not attractive enough for godog [18:33:35] (03PS1) 10Dzahn: labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 [18:33:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp1072 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:33:44] PROBLEM - puppet last run on cp2014 is CRITICAL puppet fail [18:34:26] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4001 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:34:32] (03PS2) 10Dzahn: labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 [18:34:34] ori: was good in theory, in practice I didn't need it [18:34:34] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:35:05] RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:35:05] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:35:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp1072 is OK Less than 1.00% above the threshold [0.0] [18:35:25] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1050 is CRITICAL Anomaly detected: 52 data above and 30 below the confidence bounds [18:35:26] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:35:35] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:35:44] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:35:44] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:35:44] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:35:55] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [18:35:55] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4018 is CRITICAL Anomaly detected: 75 data above and 0 below the confidence bounds [18:36:05] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:05] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:36:05] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3005 is CRITICAL Anomaly detected: 46 data above and 34 below the confidence bounds [18:36:15] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:36:15] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:36:25] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:25] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:36:25] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:25] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [18:36:35] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:43] uhm.. and now the master died on top of that ? sigh [18:36:46] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:54] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3040 is CRITICAL Anomaly detected: 64 data above and 34 below the confidence bounds [18:37:05] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:37:12] ottomata: is the puppetmaster thing because salt on * ? [18:37:15] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:37:15] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:37:24] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:45] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:45] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:37:52] hmm, as long as it recovers.. [18:37:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp1068 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:38:14] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 5.411 second response time [18:38:15] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:38:15] haha, clearly the anomoaly thing isn't helping? [18:38:16] geez [18:38:16] haha [18:38:19] needs adjusting [18:38:32] or, maybe it is adjusting? [18:38:32] 6operations, 10RESTBase-Cassandra: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1383467 (10fgiunchedi) [18:38:33] oof, i dunno [18:38:39] godog: are you familiar with tuning that? [18:38:50] 6operations, 10RESTBase-Cassandra: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1383182 (10fgiunchedi) for the same reasons, puppet shouldn't `ensure => 'running'` [18:38:56] godog: oh i was just making a lewd joke [18:39:19] ottomata: nope but we can take a look! [18:39:39] ori: hehe yeah, I wasn't sure how to reply without another lewd joke [18:39:42] TIL: lewd [18:40:11] godog: http://grafana.wikimedia.org/#/dashboard/db/kafkatest?panelId=5&fullscreen&edit [18:40:54] PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:40:55] PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [18:41:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp1068 is OK Less than 1.00% above the threshold [0.0] [18:42:30] so, godog, i'm looking at hot winters bands for this metric [18:42:45] and it looks to me like the real number is always under both bands during a spike [18:43:08] so i'm not sure why this alert would fire [18:43:11] i'm looking at cp3040 [18:43:55] (03CR) 10Dzahn: [C: 032] labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 (owner: 10Dzahn) [18:45:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2017 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2011 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1070 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2023 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3038 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:31] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1059 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:39] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3003 is CRITICAL Anomaly detected: 51 data above and 20 below the confidence bounds [18:45:39] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4004 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:39] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3036 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:46] yeah [18:45:47] hmph [18:45:57] heh hot winters [18:45:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1055 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3008 is CRITICAL Anomaly detected: 55 data above and 19 below the confidence bounds [18:45:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2004 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3014 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:45:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2010 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:00] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1054 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3004 is CRITICAL Anomaly detected: 45 data above and 29 below the confidence bounds [18:46:10] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:10] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2009 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:11] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2016 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:19] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3037 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:22] sigh. [18:46:26] what is this? [18:46:30] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1052 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:30] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1053 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:30] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1069 is CRITICAL Anomaly detected: 66 data above and 14 below the confidence bounds [18:46:30] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2002 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:31] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2003 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:31] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2015 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:31] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3013 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:32] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3022 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:32] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3035 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:46:38] OH [18:46:40] i'm surprised to see cp1xxx in there, meaning it's not a WAN issue [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1049 is CRITICAL Anomaly detected: 57 data above and 17 below the confidence bounds [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3021 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3018 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3006 is CRITICAL Anomaly detected: 68 data above and 0 below the confidence bounds [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3034 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:07] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1061 is CRITICAL Anomaly detected: 11 data above and 47 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1057 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4002 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1099 is CRITICAL Anomaly detected: 55 data above and 18 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1074 is CRITICAL Anomaly detected: 64 data above and 34 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1066 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2007 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2006 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2025 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3017 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3033 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2013 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2018 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:23] (03PS1) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 [18:47:29] (03PS2) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 [18:47:42] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1065 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:42] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1047 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:51] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1063 is CRITICAL Anomaly detected: 53 data above and 18 below the confidence bounds [18:47:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1071 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1073 is CRITICAL Anomaly detected: 26 data above and 69 below the confidence bounds [18:47:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2012 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:47:59] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1046 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1064 is CRITICAL Anomaly detected: 48 data above and 21 below the confidence bounds [18:48:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1043 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1060 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3032 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3016 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:10] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4011 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:10] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4012 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:17] (03PS3) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 [18:48:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3005 is CRITICAL Anomaly detected: 45 data above and 35 below the confidence bounds [18:48:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3015 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4017 is CRITICAL Anomaly detected: 73 data above and 0 below the confidence bounds [18:48:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3044 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:20] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1056 is CRITICAL Anomaly detected: 4 data above and 82 below the confidence bounds [18:48:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2005 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:21] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3031 is CRITICAL Anomaly detected: 62 data above and 35 below the confidence bounds [18:48:22] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3039 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds [18:48:51] (03CR) 10Ottomata: [C: 032 V: 032] Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 (owner: 10Ottomata) [18:49:30] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1048 is CRITICAL Anomaly detected: 44 data above and 45 below the confidence bounds [18:49:51] RECOVERY - Cassandra database on restbase1008 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [18:51:09] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3030 is CRITICAL Anomaly detected: 55 data above and 43 below the confidence bounds [18:51:11] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1383539 (10fgiunchedi) FWIW, relevant patch upstream http://archive.linuxvirtualserver.org/html/lvs-devel/2013-06/msg00055.html to support choos... [18:53:06] lotta varnishkafka errors lately, ottomata is that you? [18:54:00] PROBLEM - puppet last run on cp2026 is CRITICAL puppet fail [18:54:00] PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail [18:54:29] PROBLEM - puppet last run on cp2002 is CRITICAL puppet fail [18:54:40] PROBLEM - puppet last run on cp2020 is CRITICAL puppet fail [18:54:41] chasemp: yes, the most recent flood was me trying to quiet them a bit [18:54:49] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [18:55:00] ok sounds good, best of luck then [18:55:02] haha [18:55:03] thanks :/ [18:55:11] PROBLEM - puppet last run on cp2012 is CRITICAL puppet fail [18:55:20] PROBLEM - puppet last run on cp1072 is CRITICAL puppet fail [18:55:20] PROBLEM - puppet last run on cp1057 is CRITICAL puppet fail [18:55:20] PROBLEM - puppet last run on cp2025 is CRITICAL puppet fail [18:55:29] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3009 is CRITICAL Anomaly detected: 19 data above and 46 below the confidence bounds [18:55:44] bblack, re mobile redirects: SHIT HIT FAN!!1 :P https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Wikipedia_on_a_mobile_browser_is_not_showing_the_mobile_version_of_the_page [18:55:51] jdlrobson, Krinkle ^ [18:56:10] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1050 is CRITICAL Anomaly detected: 32 data above and 46 below the confidence bounds [18:56:20] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:57:00] PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [18:57:45] i wish that anomaly monitor's error text was less vague. like if it had units and values instead of just counts. [18:58:35] <_joe_> jgage: If I hadn't written it in athens in 3 days, probably, it would be better [18:58:38] <_joe_> :P [18:58:47] :) [18:58:51] i remain hopeful for the future [18:58:56] which i suppose means i should open a ticket [18:58:58] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1383556 (10Eevans) Regarding the two non-hardware blockers, (the bootstrap/streaming failures, and metrics reporting): https://issues.a... [18:59:00] RECOVERY - puppet last run on cp2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:09] RECOVERY - puppet last run on cp1057 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:59:40] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:52] ottomata: I think the previous checks were ok, but we should be using percentage => 80 or sth like that to make sure most datapoints in the last 10m are over threshold [19:00:00] labsdb1001-1003 = MySQL eqiad labsdb1004 = Misc eqiad [19:00:09] * mutante finds all these little inconsistencies [19:00:27] mutante, there will be a mysql there soon [19:00:27] ? [19:00:35] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1056 is OK No anomaly detected [19:00:52] jynus: ah, 1004 is not installed yet? that would explain i guess, and "misc" is just default [19:00:56] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3041 is OK No anomaly detected [19:00:59] the only reason why it is not there yet is that I need coordination with labs, and they are a bit busy [19:01:05] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1073 is OK No anomaly detected [19:01:07] jynus: got it:) thx [19:01:16] oh percentage, hum [19:01:22] it will be [19:01:24] RECOVERY - puppet last run on cp1072 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:01:30] tools slave or something like that [19:01:31] godog: is that instead of from => '10m'? [19:01:45] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1061 is OK No anomaly detected [19:02:03] ottomata: nope, in addition to, I'm looking at graphite_threshold [19:02:07] right [19:02:13] forgetting anomaly for a second [19:02:14] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1050 is OK No anomaly detected [19:02:18] mayve these things just need better documentation and examples [19:02:21] it is pretty confusing [19:02:40] from is total num of datapoints to get [19:02:41] or [19:02:46] get all datapoints in last timeperiod [19:03:14] RECOVERY - puppet last run on cp2025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:03:25] would series make more sense then? godog, i just want alert on consistent errors [19:03:27] not spikes [19:03:40] like, if the last 10 minutes all had a large number of errors, THEN alert [19:04:01] hm, or series would be more spikey i guess [19:04:02] hm [19:04:06] so, percentage [19:04:18] since logster sends metrics every minute [19:04:21] if i'm looking at 10 minutes [19:04:24] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:04:30] and default percentainge is 1% of datapoints [19:04:37] then any spike would cause an alert at all [19:05:13] yep [19:05:27] andrewbogott: virt1005-1007 are still to be installed i assume, right? [19:05:42] eh, re-installed or something [19:05:45] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3031 is CRITICAL Anomaly detected: 44 data above and 53 below the confidence bounds [19:05:47] hm, ok. so 80% would mean that at least 8 of last 10 minutes would have to have had # of errors > threshoold [19:06:55] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1072 is CRITICAL Anomaly detected: 54 data above and 44 below the confidence bounds [19:06:59] Krinkle, the part you asked about is https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-frontend.inc.vcl.erb#L48 [19:07:20] sure, 8 minutes is fine. [19:07:21] 80% [19:07:22] (03PS1) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 [19:07:22] hm [19:07:28] mutante: probably ripped out and sent back to cisco. [19:07:35] Certainly don’t need monitoring [19:07:49] 7Puppet, 6Mobile-Web: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383579 (10Jdlrobson) 3NEW [19:08:06] MaxSem: interesting [19:08:30] 7Puppet, 6Mobile-Web: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383588 (10Jdlrobson) Note the redirect for mobile is done in: https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-frontend.inc.vcl.erb#L48 and I'd rather avoid editing i... [19:08:49] (03PS1) 10Ottomata: Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 [19:08:51] 6operations: graphite1002 - RAID degraded - https://phabricator.wikimedia.org/T103159#1383589 (10Dzahn) 3NEW [19:08:58] godog: https://gerrit.wikimedia.org/r/#/c/219430/ [19:08:59] ? [19:09:24] RECOVERY - puppet last run on cp2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:25] andrewbogott: oh, it seems the pattern with the "odd" hosts i find is all "they are Cisco", not just in labs [19:09:40] (03PS2) 10Ottomata: Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 [19:09:44] PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1068 is CRITICAL Anomaly detected: 54 data above and 43 below the confidence bounds [19:09:45] https://www.google.co.uk/search?q=http+666 -> https://en.wikipedia.org/?title=666_(number) [19:09:49] it's more and more common [19:09:53] where do these come from ... [19:10:17] mutante: curious! [19:10:45] (03CR) 10Filippo Giunchedi: [C: 031] Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 (owner: 10Ottomata) [19:11:14] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:11:22] (03CR) 10Ottomata: [C: 032] Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 (owner: 10Ottomata) [19:11:34] and why do we have an index.php in webserver root in the first place??? [19:11:46] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3030 is OK No anomaly detected [19:12:11] api.php [19:12:17] err, wrng window [19:13:33] MaxSem: We don't, it's just rewrite handling [19:13:43] yup, just found out [19:13:49] MaxSem: OK. I've got an idea for a fix [19:13:51] too many ways to screw up [19:14:15] Nemo_bis: hi? [19:14:24] Nemo_bis: I'm disabling NFS on the ttmserver project and bringing it back up [19:14:41] Nemo_bis: you are the only person with files in their homedir [19:15:35] PROBLEM - Host labcontrol1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:55] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:16:54] heh, new icinga issues show up faster than we can ack them [19:17:00] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1048 is OK No anomaly detected [19:17:03] the labs stuff is maintenance? [19:17:19] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [19:17:19] RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3031 is OK No anomaly detected [19:17:28] RECOVERY - Host labcontrol1001 is UPING OK - Packet loss = 0%, RTA = 0.43 ms [19:17:44] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1383626 (10Tgr) > Just tell people to trust their OS' CA store, anything else is just insecure. Telling people to trust t... [19:17:44] etherpad too .. wow [19:17:56] mutante: see _security for soem of it [19:17:59] ACKNOWLEDGEMENT - RAID on graphite1002 is CRITICAL 1 failed LD(s) (Degraded) daniel_zahn https://phabricator.wikimedia.org/T103159 [19:18:08] RECOVERY - Host labs-ns0.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 1.31 ms [19:18:28] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7924 bytes in 0.062 second response time [19:18:40] chasemp: ok :) thx [19:22:47] (03PS2) 10Dzahn: labcontrol2001,nembus,neptunium: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/219079 [19:23:37] (03PS3) 10Dzahn: labcontrol2001,nembus,neptunium: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/219079 [19:23:52] (03CR) 10Dzahn: "FWIW - these were NOT in ganglia before this change" [puppet] - 10https://gerrit.wikimedia.org/r/219079 (owner: 10Dzahn) [19:24:11] (03CR) 10Dzahn: [C: 032] "FWIW - these were NOT in ganglia before this change" [puppet] - 10https://gerrit.wikimedia.org/r/219079 (owner: 10Dzahn) [19:24:55] (03Abandoned) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) (owner: 10Dzahn) [19:25:24] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1383664 (10Dzahn) so all is left here is OTRS it seems [19:26:12] have any of you folks been to poppetconfs? [19:26:13] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [19:26:26] byron is interested in going, and wants to ask what people thought [19:26:32] this year is in Oregon [19:26:37] Thoughts? [19:28:02] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7936 bytes in 7.988 second response time [19:28:55] (03PS2) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 [19:30:13] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [19:31:46] (03PS1) 10Ottomata: Update jmxmodule for jessie package, change ganglia host and port for zookeeper on conf100[123] [puppet] - 10https://gerrit.wikimedia.org/r/219433 [19:31:53] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.412 second response time [19:34:32] PROBLEM - puppet last run on zirconium is CRITICAL puppet fail [19:35:53] PROBLEM - puppet last run on mw2210 is CRITICAL puppet fail [19:35:55] (03CR) 10Ottomata: [C: 032 V: 032] Update jmxmodule for jessie package, change ganglia host and port for zookeeper on conf100[123] [puppet] - 10https://gerrit.wikimedia.org/r/219433 (owner: 10Ottomata) [19:39:52] RECOVERY - puppet last run on conf1002 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:41:00] RECOVERY - puppet last run on conf1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:20] RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:32] phew, thanks godog, icinga cleaner at the moment, we will see if the intermittent alerts stop happening now [19:42:00] ottomata: dr_err all the things! [19:42:22] where did the "zirconium enabled" come from? [19:43:31] (03CR) 10Dzahn: [C: 032] ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 (owner: 10Dzahn) [19:43:51] (03PS3) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 [19:44:03] (03PS4) 10Dzahn: ganglia: set "virt" cluster for all labs hosts in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 [19:47:26] ottomata: did you see legitimate dr_err in the past for actual problems btw? I'm wondering if it isn't bound to fire at all now [19:50:04] godog: they are real drerrs, but [19:50:14] i am not currently interested in short spikes of them [19:50:27] i would like to solve that problem, but it is low priority and kind of hard. [19:50:41] so the intermittent alerts were just annoying and ignored by all [19:50:44] which is not a useful alert [19:51:10] agreed, did you see persistent drerr too in the past? [19:51:30] oh, i see, um, i can't remember ever seeing any, but they woudl happen if there were serious network problems, or if kafka brokers had problems [19:52:35] hm, i think i remember occasional periods where all of esams would get a big latency hit for some reason and cause all the VKs there to drerr, but those were still shortish periods, only a few minutes long max [19:52:42] but, that was a while ago [19:52:47] hasn't happened recently [19:53:00] PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar [19:53:01] kk, well that should alarm only on longer timespans now [19:53:17] (03PS4) 10Dzahn: mysql: set ganglia cluster in hiera, not site.pp [puppet] - 10https://gerrit.wikimedia.org/r/219074 [19:53:19] hopefully, yes. [19:53:30] RECOVERY - puppet last run on mw2210 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:53:58] (03CR) 10Dzahn: [C: 032] mysql: set ganglia cluster in hiera, not site.pp [puppet] - 10https://gerrit.wikimedia.org/r/219074 (owner: 10Dzahn) [19:55:01] RECOVERY - Cassandra database on restbase1007 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [19:57:50] RECOVERY - Cassanda CQL query interface on restbase1007 is OK: TCP OK - 0.003 second response time on port 9042 [19:59:00] PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar [20:01:01] PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar [20:02:06] OO [20:02:07] interseting. [20:03:51] PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar [20:04:39] (03PS2) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 [20:05:21] (03CR) 10Dzahn: "status of ganglia in fundraising is unclear" [puppet] - 10https://gerrit.wikimedia.org/r/219077 (owner: 10Dzahn) [20:07:15] oh come on [20:09:28] did someone commented all my crontabs on tools-labs? [20:09:34] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:44] PROBLEM - DPKG on analytics1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:44] PROBLEM - DPKG on analytics1034 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:55] PROBLEM - DPKG on analytics1031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:55] PROBLEM - DPKG on analytics1037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:55] PROBLEM - DPKG on analytics1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:55] PROBLEM - DPKG on analytics1040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:55] PROBLEM - DPKG on analytics1041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:56] PROBLEM - DPKG on analytics1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:57] Mjbmr: yes, all crontabs were commented out. [20:09:58] ^ me again, jmxtrans package [20:10:15] PROBLEM - DPKG on analytics1019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:19] YuviPanda: for everyone? [20:10:24] PROBLEM - DPKG on analytics1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:24] PROBLEM - DPKG on analytics1036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:34] PROBLEM - DPKG on analytics1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:41] Mjbmr: yes. see https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage and the linked labs mailing list post [20:10:45] PROBLEM - DPKG on analytics1010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:55] PROBLEM - DPKG on analytics1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:55] PROBLEM - DPKG on analytics1033 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:11:05] PROBLEM - DPKG on analytics1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:11:07] YuviPanda: how do I know which one were already commented out? [20:11:14] PROBLEM - DPKG on analytics1030 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:11:15] PROBLEM - DPKG on analytics1035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:11:15] RECOVERY - DPKG on analytics1003 is OK: All packages OK [20:11:25] Mjbmr: ah, I have a backup I can provide you [20:11:40] Mjbmr: can you open a bug with the list of tools you want the crontabs for and I'll link you to them? [20:11:42] what the heck is this about [20:11:44] RECOVERY - DPKG on analytics1015 is OK: All packages OK [20:11:44] RECOVERY - DPKG on analytics1037 is OK: All packages OK [20:12:05] RECOVERY - DPKG on analytics1019 is OK: All packages OK [20:12:14] RECOVERY - DPKG on analytics1020 is OK: All packages OK [20:12:24] RECOVERY - DPKG on analytics1028 is OK: All packages OK [20:12:27] YuviPanda: that's the best you could do? I think I'm gonna find out my self. thanks. [20:12:36] RECOVERY - DPKG on analytics1010 is OK: All packages OK [20:12:36] you are welcome [20:12:45] RECOVERY - DPKG on analytics1029 is OK: All packages OK [20:12:45] RECOVERY - DPKG on analytics1033 is OK: All packages OK [20:12:54] RECOVERY - DPKG on analytics1026 is OK: All packages OK [20:12:55] RECOVERY - DPKG on analytics1030 is OK: All packages OK [20:13:04] RECOVERY - DPKG on analytics1035 is OK: All packages OK [20:13:14] RECOVERY - DPKG on analytics1027 is OK: All packages OK [20:13:14] RECOVERY - DPKG on analytics1034 is OK: All packages OK [20:13:25] RECOVERY - DPKG on analytics1031 is OK: All packages OK [20:13:25] RECOVERY - DPKG on analytics1040 is OK: All packages OK [20:13:25] RECOVERY - DPKG on analytics1041 is OK: All packages OK [20:13:25] RECOVERY - DPKG on analytics1039 is OK: All packages OK [20:13:55] RECOVERY - DPKG on analytics1036 is OK: All packages OK [20:15:29] YuviPanda: newly commented have extra space, that was easy. [20:16:27] ah, good to know [20:19:55] (03PS1) 10Ottomata: Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 [20:20:04] (03PS2) 10Ottomata: Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 [20:20:50] (03CR) 10Ottomata: [C: 032] Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 (owner: 10Ottomata) [20:22:17] RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [20:22:46] RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [20:22:56] RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [20:23:26] RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [20:25:59] ottomata: i'd try switching analytics_kafka to ganglia_new too.. worst case they disappear from ganglia for a while.. i did have issues with the regular analytics cluster though [20:27:08] mutante: i think its fine, the kafka alerts are based on graphite checks anyway [20:29:18] ottomata: ok, good [20:34:05] (03PS3) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 [20:35:03] (03CR) 10Dzahn: [C: 032] analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 (owner: 10Dzahn) [20:35:26] PROBLEM - puppet last run on es2003 is CRITICAL puppet fail [20:36:37] PROBLEM - puppet last run on es2002 is CRITICAL puppet fail [20:37:24] (03PS1) 10Jdlrobson: Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) [20:40:46] (03CR) 10Bmansurov: [C: 04-1] Enable browse prototype on English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [20:41:54] (03PS2) 10Jdlrobson: Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) [20:43:38] (03CR) 10Bmansurov: [C: 031] Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [20:44:12] w00t [20:47:47] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1383923 (10Ottomata) [20:47:50] 6operations, 10Analytics-Cluster: Create jmxtrans Jessie package - https://phabricator.wikimedia.org/T103106#1383921 (10Ottomata) 5Open>3Resolved Imported v250 from: http://central.maven.org/maven2/org/jmxtrans/jmxtrans/250/ [20:48:15] 6operations, 5Patch-For-Review, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1383924 (10Ottomata) jmxtrans package updated and installed. [20:48:45] DOH, I lied mutante [20:48:56] i thought i had done away with monitoring ganglia...i guess not! [20:49:21] ottomata: where does that monitoring live? [20:50:03] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/kafka.pp#L213 [20:50:43] it is in graphite though, i'll move them ove rnow [20:50:50] my head is already in this stuff anyway [20:51:05] ottomata: ah, i see it in Icinga now, i didnt at first because they are just WARN, not CRIT [20:51:21] well, it should recover after puppet ran once on the 4 hosts [20:51:23] but .. [20:51:48] it doesnt work, unlike with all the other clusters and like the regular analytics [20:52:06] yeah [20:52:13] hm, it doesn't work? [20:52:33] nope, same issue i had with analytics [20:52:52] the hosts don't show up in ganglia-web after being switched to use ganglia_new [20:53:10] firewalling maybe? [20:53:30] oh likely, the ganglia holes were probbably speicifc [20:53:34] you will need those ports open [20:53:34] eh, no, i had already checked iptables [20:53:38] to carbon? [20:53:38] no [20:53:47] it has network VLAN level ACLs [20:53:48] yes, now to carbon [20:53:51] unlike before [20:54:03] that's probably it [20:54:17] ja, anything you want the Analytics VLAN to talk to you have to explicitly ask a network admin to open [20:54:37] (03PS1) 10Dzahn: Revert "analytics_kafka: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/219464 [20:55:06] (03CR) 10Dzahn: [C: 032] Revert "analytics_kafka: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/219464 (owner: 10Dzahn) [20:56:07] hold on, the icinga checks should recover after that [20:56:42] mutante: ok, but i do want to change these anyway, we don't want to use monitoring::ganglia anymroe [20:57:12] ottomata: ok, but you still want them in ganglia, besides the monitoring [20:57:26] otherwise that would just be an empty cluster [20:57:47] yes, we want the values in ganglia [20:57:50] just not the alerts based on ganglia [20:58:05] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520Kafka%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [20:58:07] there it is again ^ [20:58:14] see the gap but it continues now [20:58:58] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1383943 (10JAufrecht) [20:59:03] so i need to know the correct port _before_ applying a change [20:59:15] but the change picks the right port for me ..hmm [21:00:42] (03PS1) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 [21:01:01] mutante: haha, is the assigned port not static? [21:01:03] how does it get assigned? [21:01:24] (03CR) 10jenkins-bot: [V: 04-1] Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 (owner: 10Ottomata) [21:01:45] 6operations, 7Wikimedia-log-errors: Memcached error for key "WANCache:v:enwiki:image_redirect:254363f3d14af58bbe12c644ee69ccf7" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T102916#1383954 (10aaron) Note that since nutcracker is just a proxy, it coul... [21:02:09] (03PS2) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 [21:02:15] ottomata: something like $base_port + prefix for datacenter + X .. [21:02:18] (03PS3) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 [21:02:29] ah but at least you can calculate it? [21:02:47] yea, trying to find out [21:03:01] (03CR) 10jenkins-bot: [V: 04-1] Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 (owner: 10Ottomata) [21:03:19] $gmond_port = $ganglia_new::configuration::base_port + $id [21:03:23] mutante: hm, i think i'm going to wait to merge that alert change til next week, almost time for me to peace out :) [21:03:38] ottomata: the existing monitoring all recovered [21:03:42] cool [21:03:43] ok [21:03:44] thank you [21:03:44] so yea [21:04:09] same here, i will have to ask for the network gear changes next week [21:04:15] ok ja, i'm out then, have a good weekeeeend. aye :) [21:04:24] you too, cya ! [21:17:56] 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383982 (10Dzahn) [21:20:06] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.B9l7SsRk/mnt is not accessible: Permission denied [21:21:01] (03PS1) 10BBlack: Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 [21:21:56] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [21:22:08] RECOVERY - puppet last run on es2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:22:45] 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383995 (10Dzahn) Hi, done, based on: https://wikimediafoundation.org/wiki/User:KFrancis_%28WMF%29 and the Staff and contractors page. before: -legal-tm-vio: slaporte, ywelinder, rstallman, mbrar, jroger... [21:23:01] 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383996 (10Dzahn) 5Open>3Resolved [21:23:26] RECOVERY - puppet last run on es2002 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [21:27:29] (03CR) 10MaxSem: [C: 031] Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (owner: 10BBlack) [21:29:32] (03PS2) 10BBlack: Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (https://phabricator.wikimedia.org/T103158) [21:29:43] ^ just fixed commitmsg for bug ref [21:30:40] (03CR) 10BBlack: [C: 032] Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (https://phabricator.wikimedia.org/T103158) (owner: 10BBlack) [21:36:37] PROBLEM - puppet last run on cp1052 is CRITICAL Puppet has 1 failures [21:37:01] !log upgraded cassandra on 1003 to 2.1.7 (pre-release, likely going out on Monday) [21:37:05] Logged the message, Master [21:38:47] PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 1 failures [21:39:36] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 1 failures [21:39:47] PROBLEM - puppet last run on cp2016 is CRITICAL Puppet has 1 failures [21:39:56] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [21:39:56] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 1 failures [21:40:08] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures [21:40:17] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 1 failures [21:40:56] PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 1 failures [21:40:56] PROBLEM - puppet last run on cp3013 is CRITICAL Puppet has 1 failures [21:41:07] PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 1 failures [21:41:08] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 1 failures [21:41:17] blag [21:41:26] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 1 failures [21:41:37] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [21:41:47] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [21:41:47] bblag [21:41:50] :P [21:41:54] that was a typo'd blarg [21:42:06] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 1 failures [21:42:07] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures [21:42:27] PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 1 failures [21:42:38] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [21:42:57] PROBLEM - puppet last run on cp2010 is CRITICAL Puppet has 1 failures [21:42:57] PROBLEM - puppet last run on cp3006 is CRITICAL Puppet has 1 failures [21:42:57] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [21:43:06] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [21:43:06] PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 1 failures [21:43:36] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 1 failures [21:43:37] PROBLEM - puppet last run on cp2019 is CRITICAL Puppet has 1 failures [21:43:54] (03PS1) 10BBlack: Bugfix for d22e8bc6 [puppet] - 10https://gerrit.wikimedia.org/r/219478 [21:43:56] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 1 failures [21:43:57] PROBLEM - puppet last run on cp3003 is CRITICAL Puppet has 1 failures [21:44:06] PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 1 failures [21:44:07] PROBLEM - puppet last run on cp3004 is CRITICAL Puppet has 1 failures [21:44:11] (03CR) 10BBlack: [C: 032 V: 032] Bugfix for d22e8bc6 [puppet] - 10https://gerrit.wikimedia.org/r/219478 (owner: 10BBlack) [21:44:17] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 1 failures [21:44:38] PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 1 failures [21:44:46] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7926 bytes in 5.122 second response time [21:44:47] PROBLEM - puppet last run on cp3009 is CRITICAL Puppet has 1 failures [21:44:47] PROBLEM - puppet last run on cp3031 is CRITICAL Puppet has 1 failures [21:44:57] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 1 failures [21:45:07] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [21:45:27] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [21:45:37] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [21:47:06] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:47:16] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:47:26] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:48:17] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:48:17] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:48:18] isn't it cool how the set of puppetfails covers cp[1234] now instead of just cp[134] though? [21:48:26] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:48:37] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:48:46] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:07] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:16] RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:16] RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:16] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [21:49:17] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:49:27] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:27] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:38] RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:49:38] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:49:48] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:48] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:50:16] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [21:50:16] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:50:38] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:50:38] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:50:46] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:50:57] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:50:58] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:51:37] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:51:49] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1384106 (10csteipp) Yeah, should be fine. [21:51:56] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7928 bytes in 2.853 second response time [21:51:56] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:51:56] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:51:56] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:51:57] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:06] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:52:07] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:08] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:52:27] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:53:07] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:53:25] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384112 (10fgiunchedi) [21:53:38] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384114 (10fgiunchedi) a:5fgiunchedi>3Cmjohnson [21:54:07] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:54:16] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:02:57] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [22:03:34] 6operations, 10Wikimedia-Bugzilla: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1384131 (10Dzahn) 3NEW [22:04:21] 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Niedzielski) 3NEW [22:04:22] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1384138 (10Dzahn) [22:04:41] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Dzahn) [22:05:12] (03PS1) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [22:05:23] 6operations, 10Wikimedia-Bugzilla: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1384131 (10Dzahn) [22:05:43] 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384153 (10hashar) [22:05:54] (03PS3) 10Dzahn: switch old-bugzilla to apache cluster [dns] - 10https://gerrit.wikimedia.org/r/216736 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis) [22:05:56] (03CR) 10jenkins-bot: [V: 04-1] WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [22:06:02] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384154 (10faidon) Our CA for production is GlobalSign. It is one of the big (in terms of websites using it) and oldest CA... [22:06:29] (03PS1) 10Rush: WIP: lvs 'text' and 'text-https' for etcd [puppet] - 10https://gerrit.wikimedia.org/r/219482 [22:06:36] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7926 bytes in 8.135 second response time [22:06:56] (03PS6) 10Dzahn: redirect old- to static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/216734 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis) [22:07:58] 10Ops-Access-Requests, 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384159 (10hashar) That will let @Niedzielski get access to the Jenkins configuration which rely on users being in the `wmf` LDAP group :-} [22:08:03] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384161 (10fgiunchedi) @cmjohnson I'd like to test a theory re: sdb, can you swap sda and sdb? I'd like to see if the error moves too ``` [14908.351693] ata2.00: exception Emask 0x0 SAct 0x... [22:08:26] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.02USaYFV/mnt/tmp/ccache is not accessible: Permission denied [22:09:21] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1384162 (10ori) [22:10:01] hashar: ^ (nodepool disk) [22:10:05] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384164 (10Krenair) [22:10:08] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [22:10:49] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Krenair) I don't think you need the 'nda' group, just the 'wmf' one? I think 'nda' is only really for people who signed the volunteer NDA... [22:12:31] (03PS2) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [22:13:53] (03PS3) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [22:14:36] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384172 (10Niedzielski) Ok, let's try wmf then. Thanks! //The first rule of NDA is: you do not talk about NDA.// [22:17:23] 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384181 (10Dzahn) 3NEW [22:17:56] 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384191 (10Dzahn) a:3Dzahn [22:18:18] 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384181 (10Dzahn) [22:18:21] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1384194 (10Dzahn) [22:20:41] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384199 (10hashar) [22:21:56] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384200 (10Krenair) [22:22:49] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384203 (10Krenair) [22:23:01] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384204 (10hashar) p:5Triage>3Normal Thanks for the task! Lets wait for {T103191}. Once confirmed, you can be added via https:/... [22:24:17] 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384222 (10Dzahn) [22:24:47] PROBLEM - Cassanda CQL query interface on restbase1009 is CRITICAL: Connection refused [22:26:34] Anyone available to help with a simple file permissions thing on lutetium (in the fundraising cluster)? [22:27:02] I need chmod -R g+w(s) in /srv/org.wikimedia.civicrm ... [22:28:36] PROBLEM - Cassandra database on restbase1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [22:29:53] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1384229 (10GWicke) > These pertain to MobileWeb, not MobileApps from what I can tell. My understanding is that they are relevant to both... [22:30:19] the 1009 alert is expected, no need to worry [22:34:30] YuviPanda: ok, thanks; IIRC they're not important things but I don't remember. [22:34:39] Nemo_bis: :) [22:34:43] not important at all then [22:40:49] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1384252 (10Niedzielski) 5Resolved>3Open [22:41:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1253680 (10Niedzielski) Reopened issue as I still can't access stat1002 / stat1003. [22:41:18] 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1384256 (10GWicke) p:5Low>3Normal [22:42:22] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384258 (10Dzahn) [22:43:18] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Dzahn) refs: https://wikimediafoundation.org/wiki/User:SNiedzielski_%28WMF%29 https://wikimediafoundation.org/w/index.php?title=Staff_and_con... [22:44:15] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1384267 (10GWicke) > The only reason that I've leaned toward TCP here, is that our Cassandra nodes are quite a bit chattier than... [22:44:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1384270 (10GWicke) [22:50:33] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384280 (10Dzahn) caveat here: Sniedzielski is the WMF account: mail: sniedzielski@wikimedia cn: Sniedzielski uidNumber: 12119 Niedzielski is the priv... [22:51:37] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384283 (10Dzahn) [22:51:41] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384281 (10Dzahn) 5Open>3Resolved a:3Dzahn [23:11:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1384322 (10Dzahn) @Niedzielski Ohh, hi! Sorry, i didn't see this comment until the ticket was reopnened, so you did right. First let me confirm that: - on stat... [23:18:25] who was going to disable my account? [23:18:56] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384345 (10Dzahn) checked access logs on silver. yes, wikitech-static tries getting the files: ``` 16:06 wikitech-stat... [23:20:08] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1009 to cassandra::seeds [puppet] - 10https://gerrit.wikimedia.org/r/219496 (https://phabricator.wikimedia.org/T102015) [23:21:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1009 to cassandra::seeds [puppet] - 10https://gerrit.wikimedia.org/r/219496 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [23:21:53] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384349 (10Dzahn) on the wikitech-static side: in /srv/imports ``` 0 -rw-r--r-- 1 root root 0 Jun 13 16:43 labswiki-20150... [23:22:22] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384350 (10Tgr) [23:23:27] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:24] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1367794 (10Tgr) @faidon: fair enough (although https://www.ssllabs.com/ssltest/analyze.html?d=en.wikipedia.com claims IE6... [23:25:07] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [23:30:30] !log starting cassandra bootstrap on restbase1009 [23:30:35] Logged the message, Master [23:30:45] godog: ^^ [23:30:56] RECOVERY - Cassandra database on restbase1009 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [23:31:14] gwicke: ack [23:31:39] 6operations, 10Deployment-Systems: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1384356 (10Mattflaschen) 3NEW [23:31:46] failed quickly while streaming from 1006 [23:31:56] I noticed that tin doesn't have access to the same memcached as other servers: https://phabricator.wikimedia.org/T103198 [23:32:43] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384366 (10BBlack) @Tgr that's because we don't speak the SSLv3 protocol anymore, because of the [[ https://en.wikipedia.o... [23:32:46] !log upgraded restbase1006 to cassandra 2.1.7 [23:32:50] Logged the message, Master [23:33:06] 6operations, 10Deployment-Systems: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1384376 (10EBernhardson) tin config: ``` memcached: auto_eject_hosts: true distribution: ketama hash: md5... [23:42:10] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384404 (10Dzahn) the import script was running several times: ``` root@wikitech-static:/srv/imports# ps aux | grep import-wikit... [23:47:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [23:50:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [23:59:27] PROBLEM - puppet last run on mw2060 is CRITICAL puppet fail