[16:21:22] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] "kept the .py file in place" [puppet] - 10https://gerrit.wikimedia.org/r/219102 (owner: 10Ori.livneh)
[16:21:33] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0]
[16:22:03] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[16:22:15] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Update jmxtrans module for jmxtrans release v250 [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/219391 (owner: 10Ottomata)
[16:24:31] <grrrit-wm>	 (03PS20) 10Paladox: Add json, erb and less highlight support to gitblit [puppet] - 10https://gerrit.wikimedia.org/r/216421 
[16:25:01] <grrrit-wm>	 (03CR) 10Paladox: "@Dzahn all tested and works." [puppet] - 10https://gerrit.wikimedia.org/r/216421 (owner: 10Paladox)
[16:25:44] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0]
[16:25:58] <wikibugs>	 6operations, 10Deployment-Systems, 7HHVM, 15User-Bd808-Test: Scap should restart HHVM - https://phabricator.wikimedia.org/T103008#1382863 (10Joe)
[16:26:34] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[16:27:17] <_joe_>	 btw,/win 25
[16:27:20] <_joe_>	 argh
[16:27:40] <ori>	 heheh
[16:30:23] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0]
[16:30:53] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[16:32:44] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0]
[16:33:14] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[16:34:01] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: role::cache: Move inclusion of lvs::configuration from base [puppet] - 10https://gerrit.wikimedia.org/r/217544 
[16:34:03] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: lvs::configuration: Kill realm case checks [puppet] - 10https://gerrit.wikimedia.org/r/217289 
[16:35:05] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0]
[16:35:32] <ori>	 arghhhhh somebody do something about these varnishkafka alerts pleaaaaaaaaase
[16:35:44] <paravoid>	 what ori said.
[16:35:46] <paravoid>	 ottomata: ^
[16:36:46] <bblack>	 I could mass-ack them with "this shit is broken"
[16:37:10] <bblack>	 well downtime not ack I guess, since they flap
[16:38:13] <ori>	 it may be that i'm much more distraction-prone than other people, but it still baffles me sometimes that others don't seem to mind this cognitive litter
[16:38:36] <bblack>	 no we already brought it up once today
[16:38:47] <bblack>	 it makes it hard to notice the real CRITICALs that we actually care about
[16:39:00] <ori>	 the numbers are especially annoying -- a number in an alert commands attention
[16:39:15] <ori>	 but the numbers in the graphite anomaly alerts are almost always goofy -- useless and weirdly specific
[16:39:22] <YuviPanda>	 marktraceur: tgr|away gilles I'm disabling NFS for the multimedia project and bringing it back up at the moment. Let me know if you want to recover some files from there
[16:39:44] <icinga-wm>	 PROBLEM - Translation cache space on mw1099 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:40:04] <ori>	 that alert is going away once puppet runs on neon
[16:40:23] <icinga-wm>	 RECOVERY - Translation cache space on mw1074 is OK: HHVM_TC_SPACE OK TC sizes are OK
[16:40:23] <icinga-wm>	 PROBLEM - Translation cache space on mw1149 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91%
[16:41:04] <paravoid>	 speaking of annoying alerts...
[16:41:08] <paravoid>	 73 Matching Service Entries Displayed
[16:41:26] <paravoid>	 grumble grumble
[16:41:58] <ori>	 should be 73.33333% Matching Service Entries
[16:42:58] <marktraceur>	 YuviPanda: We aren't using it at the moment anyway, but thanks
[16:43:02] <tgr>	 YuviPanda: thanks! AFAIK none of those machines need NFS
[16:43:09] <YuviPanda>	 sweet
[16:43:14] <icinga-wm>	 RECOVERY - Translation cache space on mw1099 is OK: HHVM_TC_SPACE OK TC sizes are OK
[16:43:38] <YuviPanda>	 tgr: marktraceur sweet. do delete unused instances as well :)
[16:45:22] <wikibugs>	 6operations, 10ops-codfw: cp2024 console + disk issues - https://phabricator.wikimedia.org/T103090#1382930 (10Papaul) I Called Dell an i was transfer like 4 times just for a replacement disk. And no one told me why I was transfer for. first person say how may I am help you and than when i tell them what i need...
[16:49:04] <ottomata>	 so, am i going to install a kafka cluster in esams then? :p
[16:51:06] <ottomata>	 but ja, yall are right.
[16:51:07] <ottomata>	 hm.
[16:51:55] <bblack>	 the problem isn't a lack of kafka cluster at esams, it's that kafka sucks at using a link with real latency on it
[16:52:00] <bblack>	 or something like that
[16:52:42] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "Thcipriani, how does that fit with RelEng's short/mid/long term deployment system plans & strategy? We haven't heard anything about this s" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke)
[16:52:44] <ottomata>	 documented, but that isn't kafka's problem, that is varnishkafka/librdkafka's problem.  too much buffering
[16:52:48] <ottomata>	 bad for caches
[16:53:15] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3030 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[16:53:19] <bblack>	 or too little
[16:53:26] <bblack>	 or just bad protocol design?
[16:53:29] <bblack>	 who knows really
[16:53:39] <ottomata>	 yah, it may be tunable.  have tried and failed (a long tiem ago)
[16:54:00] <paravoid>	 fwiw, I've been seeing intermittent cp10xx varnishkafka alerts too
[16:54:00] <ottomata>	 but i have also been told that:  A.  esams latency shoudl be better.  and B.  Kafka should not be done cross DC
[16:54:03] <paravoid>	 this week for example
[16:54:04] <ottomata>	 yeah i see that too
[16:54:10] <moritzm>	 !log updated/rebooted nescio/maerlant to 3.19
[16:54:12] <bblack>	 all I know is we have tons of live prod-affecting traffic traversing those links just fine.  the network is A-OK, but varnishkafka keeps bitching like the network is broke
[16:54:12] <ottomata>	 which makes me less confident in my attitude :p
[16:54:14] <morebots>	 Logged the message, Master
[16:54:22] <bblack>	 clearly, it's not the network that is broke :)
[16:54:40] <paravoid>	 e.g. now, cp1049 & cp1051
[16:54:51] <paravoid>	 WARNING: 11.11% of data above the warning threshold [0.0] 
[16:54:58] <paravoid>	 (whatever that means)
[16:55:13] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3030 is OK Less than 1.00% above the threshold [0.0]
[16:55:23] <bblack>	 A. esams latency has a floor that we can't be too far from, speed of light through glass and all that
[16:55:47] <paravoid>	 ori: what's with the TC cache alerts?
[16:56:03] <bblack>	 I don't know about B - is it true that kafka was implicitly designed to not be used across a high-latency link at all?
[16:56:30] <ori>	 paravoid: https://gerrit.wikimedia.org/r/#/c/219102/
[16:56:48] <ori>	 they should go away once the change propagates to neon
[16:56:48] <ottomata>	 hm, i think something is wrong with the alerting
[16:56:50] <paravoid>	 ori: I thought you were restarting HHVM everytime these reached 100%?
[16:56:55] <ori>	 paravoid: no
[16:56:59] <ottomata>	 the drerr count for cp1049 has not increased in a long time
[16:57:06] <ottomata>	 which means 0 drerr rate
[16:57:12] <ottomata>	 it is using check graphite threshold
[16:57:13] <ottomata>	 hm
[16:57:34] <ottomata>	 been 0 drerrs on cp1049 since june 5
[16:57:37] <ori>	 paravoid: i did a couple of times, but i shouldn't have; it was a reaction to the alert spamming the channel
[16:57:47] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "Don't forget to salt rm /usr/local/bin/check_tc_space if you haven't already." [puppet] - 10https://gerrit.wikimedia.org/r/219102 (owner: 10Ori.livneh)
[16:58:10] <ori>	 ah will do
[16:58:15] <paravoid>	 ori: so what happens once it fills?
[16:58:20] <paravoid>	 garbage collects by itself?
[16:58:24] <ori>	 HHVM SIGABRTs and restarts
[16:58:27] <paravoid>	 lol how nice
[16:58:41] <ori>	 it's gross but we'll fix it with https://phabricator.wikimedia.org/T103008
[16:58:51] <paravoid>	 "fix" it
[16:59:08] <paravoid>	 has this been raised upstream?
[16:59:18] <ori>	 yes
[16:59:52] <paravoid>	 I mean, this plan could work for us, but certainly won't for their broader "open source" / "make it for everyone" strategy
[16:59:53] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: lvs: Move the role manifests into the role module [puppet] - 10https://gerrit.wikimedia.org/r/217288 
[16:59:55] <ori>	 but their take on it is to work hard to make repoauth easier for everyone
[17:00:22] <ori>	 e.g. https://github.com/facebook/hhvm/commit/4bbae3bab9aae9647588637af5518f37f4091fc4
[17:00:24] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] lvs: Move the role manifests into the role module [puppet] - 10https://gerrit.wikimedia.org/r/217288 (owner: 10Alexandros Kosiaris)
[17:00:27] <paravoid>	 how would repoauth fix this?
[17:00:48] <paravoid>	 if that's their plan, it's a pretty bad plan
[17:00:50] <ori>	 with RepoAuth there is no translation at run time
[17:01:21] <ori>	 it is all done ahead of time
[17:01:25] <paravoid>	 (if they really want to sell this to people running their own wordpress or mediawiki or something)
[17:01:45] <ori>	 i think they'd like that but they're focusing on the big sites
[17:01:53] <grrrit-wm>	 (03PS6) 10Dzahn: WIP: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 
[17:01:53] <ori>	 dailymotion.com is migrating now
[17:01:57] <wikibugs>	 6operations, 10Wikimedia-Site-requests, 7Community-consensus-needed: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1382975 (10Glaisher) Why is this marked as #community-consensus-needed? As long as the existing preferences for users are not change...
[17:02:13] <paravoid>	 I know, I sort of helped to plan the idea :P
[17:02:16] <ori>	 server on the left is with hhvm: http://cl.ly/image/0b0O0U1f2z2u
[17:02:21] <ori>	 paravoid: you did?
[17:02:45] <grrrit-wm>	 (03PS2) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) 
[17:03:49] <ottomata>	 hm 1073 actually had drerrs
[17:05:09] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: Lint lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/217545 
[17:05:14] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Lint lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/217545 (owner: 10Alexandros Kosiaris)
[17:09:13] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1072 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:09:32] <grrrit-wm>	 (03PS7) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 
[17:09:39] <grrrit-wm>	 (03PS8) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 
[17:09:43] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1068 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:11:21] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 (owner: 10Alexandros Kosiaris)
[17:11:25] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 
[17:11:30] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] lvs::balancer: Remove old absent system::role resource [puppet] - 10https://gerrit.wikimedia.org/r/217546 (owner: 10Alexandros Kosiaris)
[17:11:58] <ori>	 !log salt -t30 -G 'php:hhvm' cmd.run 'rm -f /usr/local/bin/check_tc_space' (https://gerrit.wikimedia.org/r/#/c/219102/)
[17:12:03] <morebots>	 Logged the message, Master
[17:12:31] <grrrit-wm>	 (03PS1) 10Ottomata: Increase critical threshold of varnishkafka drerr alert [puppet] - 10https://gerrit.wikimedia.org/r/219399 
[17:12:45] <paravoid>	 ottomata: WARNINGs are also quite annoying
[17:12:48] <grrrit-wm>	 (03PS2) 10Ottomata: Increase critical threshold of varnishkafka drerr alert [puppet] - 10https://gerrit.wikimedia.org/r/219399 
[17:12:54] <ottomata>	 do warnings ping here?
[17:12:55] <ottomata>	 paravoid?
[17:12:56] <paravoid>	 they're not echoed here but I regularly watch icinga's page
[17:13:03] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1072 is OK Less than 1.00% above the threshold [0.0]
[17:13:16] <paravoid>	 if they are actually a problem to warn about, then we should fix that problem, right?
[17:13:32] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1068 is OK Less than 1.00% above the threshold [0.0]
[17:13:34] <paravoid>	 if it's actually a warning, it should be actionable
[17:13:37] <ottomata>	 yes. but to do so would require very much time and possibly new hardware, and it is not a high priority. so they should not happen.
[17:13:39] <ottomata>	 but they do.
[17:13:56] <paravoid>	 new hardware why?
[17:14:11] <ori>	 kafka brokers in esams
[17:14:21] <ottomata>	 aye, and other DCs too, if necessary
[17:14:27] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 (owner: 10Alexandros Kosiaris)
[17:14:32] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 
[17:14:32] <paravoid>	 no, we already established there are similar warnings for eqiad caches as well
[17:14:44] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Lint lvs::balancer [puppet] - 10https://gerrit.wikimedia.org/r/217547 (owner: 10Alexandros Kosiaris)
[17:14:52] <paravoid>	 right now I see cp1056, cp1068, cp3009, cp3040
[17:15:04] <paravoid>	 two warnings, two crits
[17:15:18] <ottomata>	 yes, but that doesn't mean we don't need remote kafka clusters.  i think there may be multiple issues here, not sure.  
[17:15:25] <ottomata>	 that's why i said 'possibly' :)
[17:15:53] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3040 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:16:29] <ottomata>	 hm, paravoid lemme poke around, i might be able to make a smarter alert, one that alerts on produced data rather than delivery errors
[17:16:38] <paravoid>	 ok
[17:16:42] <ottomata>	 well, that isn't really smarter, but a work around.  we want to know if there is a serious problem at the moment
[17:17:01] <ottomata>	 and short bursts of dropped messages isn't a huge issue.  it is not good and i should solve it, but there are other things
[17:17:17] <ottomata>	 so, if i can alert on say, produce rate dropping to 0, then that would be good enough for now
[17:17:34] <paravoid>	 yes, a hundred alerts alerting us that we need "very much time and possibly new hardware that is not a high priority" isn't great :P
[17:18:02] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:18:07] <paravoid>	 and honestly, I don't buy this this whole "kafka over WAN is not supported", I don't see why kafka has to be latency sensitive
[17:18:12] <paravoid>	 and we have evidence to the contrary, ^^^
[17:18:18] <wikibugs>	 6operations, 10Wikimedia-Site-requests, 7Community-consensus-needed: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383024 (10tomasz) Most likely a relict of the past since this request was never rejected due to insufficient community consensus bu...
[17:18:40] <wikibugs>	 6operations, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383026 (10tomasz)
[17:19:10] <ottomata>	 paravoid: i'm just going on what the creators and likely largest maintainers of kafka recommend there
[17:19:30] <ottomata>	 the producers should not be the buffer
[17:19:35] <ottomata>	 the brokers are meant to be the buffer of messages
[17:19:43] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3040 is OK Less than 1.00% above the threshold [0.0]
[17:19:44] <ottomata>	 so, high latency production means producers have to buffer
[17:20:02] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0]
[17:20:36] <grrrit-wm>	 (03PS10) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) 
[17:21:41] <grrrit-wm>	 (03PS9) 10Dzahn: switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 
[17:21:43] <paravoid>	 well for 70ms or so, sure
[17:23:14] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] switch misc cluster to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/217214 (owner: 10Dzahn)
[17:24:20] <godog>	 !log install linux 3.19 on restbase100[789]
[17:24:24] <morebots>	 Logged the message, Master
[17:25:36] <YuviPanda>	 kart_: I'm moving the language project off NFS. Let me know if / when you want it back on. Thanks
[17:25:41] <YuviPanda>	 mutante: wikistats should be fine now
[17:25:49] <mutante>	 YuviPanda: cool, thanks!
[17:26:16] <mutante>	 YuviPanda: confirmed, logged in:)
[17:26:41] <YuviPanda>	 mutante: no NFS in mount output?
[17:26:52] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 22.22% of data above the critical threshold [20000.0]
[17:27:16] <mutante>	 YuviPanda: nope
[17:27:21] <YuviPanda>	 mutante: wonderful.
[17:27:33] <YuviPanda>	 mutante: I can get you your files next week if that's ok?
[17:27:39] <mutante>	 YuviPanda: yea, it is
[17:27:45] <YuviPanda>	 mutante: do remind me
[17:27:46] <YuviPanda>	 thanks
[17:27:53] <mutante>	 alright, will do
[17:27:55] <Mjbmr>	 can't handle the cables?
[17:28:38] <mutante>	 ?
[17:33:14] <grrrit-wm>	 (03CR) 10GWicke: Remove trebuchet setup from restbase config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke)
[17:34:12] <grrrit-wm>	 (03PS3) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) 
[17:35:07] <grrrit-wm>	 (03CR) 10coren: [C: 031] "We prefer Jaime alive." [puppet] - 10https://gerrit.wikimedia.org/r/218870 (owner: 10Jcrespo)
[17:36:28] <grrrit-wm>	 (03PS2) 10Jcrespo: Add jcrespo to the dba nagios contact list [puppet] - 10https://gerrit.wikimedia.org/r/218870 
[17:36:44] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add jcrespo to the dba nagios contact list [puppet] - 10https://gerrit.wikimedia.org/r/218870 (owner: 10Jcrespo)
[17:37:53] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0]
[17:38:10] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) (owner: 10Dzahn)
[17:38:34] <grrrit-wm>	 (03PS4) 10Dzahn: icinga: give 20after4 permissions in cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/219299 (https://phabricator.wikimedia.org/T102830) 
[17:38:50] <mutante>	 jynus: i merged your change on the master
[17:39:23] <jynus>	 oh, thank you, mutante 
[17:39:45] <jynus>	 I will check neon
[17:40:03] <mutante>	 jynus: cool, it should also do another thing and give 20after4 permissions
[17:41:37] <twentyafterfour>	 should I test?
[17:41:42] <jynus>	 mutante, change applied correctly it seems (both9
[17:42:12] <andrewbogott>	 YuviPanda: looks good!  Thank you!
[17:42:21] <mutante>	 jynus: :)
[17:42:34] <mutante>	 twentyafterfour: wanna try that icinga thing again when you got a minute
[17:44:33] <mutante>	 ottomata: port 9690  file ./aggregators/1041.conf
[17:45:07] <twentyafterfour>	 mutante: I'm off work today but I'll test it real quick before I leave ;)
[17:45:09] <YuviPanda>	 andrewbogott: cool
[17:45:57] <logmsgbot>	 !log krenair Synchronized private/PrivateSettings.php: sync 4a30446e for wikitech cleanup - T102361 (duration: 00m 12s)
[17:46:00] <ottomata>	 danke mutante
[17:46:01] <morebots>	 Logged the message, Master
[17:46:10] <mutante>	 twentyafterfour: i hope it likes usernames starting with numbers:) and have a nice day off
[17:46:19] <mutante>	 ottomata: yw
[17:46:47] <ottomata>	 mutante: what IP?
[17:47:04] <ottomata>	 carbon's?
[17:47:13] <mutante>	 ottomata: yes, carbon
[17:47:14] <ottomata>	 ok
[17:47:15] <ottomata>	 thanks
[17:47:43] <wikibugs>	 6operations, 10RESTBase-Cassandra: don't start cassandra at boot - https://phabricator.wikimedia.org/T103134#1383182 (10fgiunchedi) 3NEW a:3fgiunchedi
[17:48:04] <twentyafterfour>	 mutante: works
[17:48:35] <mutante>	 twentyafterfour: :) yay, have a nice weekend then
[17:48:42] <twentyafterfour>	 thanks
[17:49:08] <twentyafterfour>	 I went ahead and scheduled next week's maintenance window in icinga. I take it there isn't a way to do recurring ones?
[17:49:12] <grrrit-wm>	 (03PS1) 10Alex Monk: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) 
[17:49:20] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk)
[17:49:20] <wikibugs>	 6operations, 7Icinga, 5Patch-For-Review: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383195 (10Dzahn) a:5mmodell>3Dzahn
[17:49:34] <wikibugs>	 6operations, 7Icinga, 5Patch-For-Review: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383196 (10Dzahn) 5Open>3Resolved 10:49 < twentyafterfour> mutante: works
[17:49:35] <grrrit-wm>	 (03PS2) 10Alex Monk: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) 
[17:49:36] <wikibugs>	 6operations, 10RESTBase-Cassandra: don't start cassandra at boot - https://phabricator.wikimedia.org/T103134#1383198 (10fgiunchedi) ``` 18:39 +<gwicke> it's potentially data loss dangerous 18:40  <godog> gwicke: even on a fully bootstrapped node? 18:40 +<gwicke> yes, especially on a fully bootstrapped node 18:...
[17:49:48] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3030 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:50:07] <wikibugs>	 6operations, 7Icinga: create icinga user for Mukunda - https://phabricator.wikimedia.org/T102830#1383199 (10Dzahn)
[17:51:23] <mutante>	 ottomata: conf1001 already shows up
[17:51:38] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3030 is OK Less than 1.00% above the threshold [0.0]
[17:51:38] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:51:39] <mutante>	 in analytics cluster
[17:52:19] <ottomata>	 in analytics?!
[17:52:48] <wikibugs>	 6operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#1383206 (10fgiunchedi) +1, thanks @valhallasw. I think the only problematic bot might be logmsgbot if it is parsing `!log` only from privmsg and not notice as well (or some solution of course)
[17:55:10] <wikibugs>	 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1383222 (10fgiunchedi) we've seen more `sdb` errors on `restbase1008` too, but nothing on `sda` so far. To rule out further things like...
[17:55:21] <mutante>	 ottomata: yea, for some reason it's in the analytics cluster
[17:55:27] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0]
[17:58:36] <wikibugs>	 6operations, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1383245 (10Whatamidoing-WMF) > was never rejected due to insufficient community consensus but due to very specific technical reasons.  "Very specific technical rea...
[18:01:44] <grrrit-wm>	 (03PS1) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 
[18:01:49] <ottomata>	 paravoid: ^ ?
[18:01:56] <ottomata>	 oops
[18:02:03] <ottomata>	 didn't mean for jmxtrans to be in there, one sec
[18:02:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 (owner: 10Ottomata)
[18:02:46] <grrrit-wm>	 (03PS2) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 
[18:03:06] <grrrit-wm>	 (03PS3) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 
[18:03:13] <grrrit-wm>	 (03Abandoned) 10Andrew Bogott: Exclude labs private IPs from dmz_cidr. [puppet] - 10https://gerrit.wikimedia.org/r/210720 (owner: 10Andrew Bogott)
[18:04:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "I don't think we plan to run racktables on HHVM, in this case I don't see a reason to start depending on mod_php specifically either. Also" [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi)
[18:04:22] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: install openjdk-7-jdk [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219296 (https://phabricator.wikimedia.org/T102996) (owner: 10Filippo Giunchedi)
[18:06:34] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] "This needs to be preceded by a mw-config patch, right? So we don't refer to this no-longer-puppetized file?" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk)
[18:08:53] <YuviPanda>	 marktraceur: https://phabricator.wikimedia.org/T103137
[18:09:31] <grrrit-wm>	 (03PS4) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 
[18:09:33] <marktraceur>	 YuviPanda: orgcharts is dead anyway
[18:09:41] <YuviPanda>	 marktraceur: shall I delete the project?
[18:09:48] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1074 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:09:51] <YuviPanda>	 marktraceur: do you have the code elsewhere?
[18:10:14] <greg-g>	 :(
[18:10:25] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/219410 
[18:11:38] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3031 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:12:02] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/219410 (owner: 10Filippo Giunchedi)
[18:12:30] <grrrit-wm>	 (03PS1) 10Ottomata: Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 
[18:12:37] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1073 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:13:28] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3031 is OK Less than 1.00% above the threshold [0.0]
[18:13:28] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1074 is OK Less than 1.00% above the threshold [0.0]
[18:13:28] <grrrit-wm>	 (03PS2) 10Ottomata: Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 
[18:14:17] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Use class {} instead of include to include classes in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/219412 (owner: 10Ottomata)
[18:14:41] * YuviPanda pokes sad marktraceur
[18:15:37] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "thoughts on naming? d-i-console might work just fine or install-console or sth like that" [puppet] - 10https://gerrit.wikimedia.org/r/217016 (owner: 10Filippo Giunchedi)
[18:15:51] <grrrit-wm>	 (03CR) 10Alex Monk: "Nope, it was included from PrivateSettings.php (Why? I have no idea.), which I already removed and synced earlier. WikitechPrivateLDAPSett" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk)
[18:16:08] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1073 is OK Less than 1.00% above the threshold [0.0]
[18:17:10] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] configure additional Cassandra metric alerts [puppet] - 10https://gerrit.wikimedia.org/r/218408 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans)
[18:17:36] <YuviPanda>	 Jamesofur|cloud: hi! do you want the sugarcrm project recovered?
[18:18:54] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "@Aaron, you reckon restarting jobchron for log rotation at the same time might cause thundering herds or issues like that?" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya)
[18:18:58] <mutante>	 Error: /Stage[main]/Mediawiki::Scap/Package[scap]/ensure ... to latest failed: Could not get latest version: 403 Forbidden
[18:19:27] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:19:33] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: install openjdk-7-jdk on Cassandra nodes - https://phabricator.wikimedia.org/T102996#1383326 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged
[18:22:06] <wikibugs>	 6operations, 6Labs, 6Release-Engineering, 10wikitech.wikimedia.org: silver / scap -  Could not get latest version: 403 Forbidden - https://phabricator.wikimedia.org/T103138#1383336 (10Dzahn)
[18:22:07] <grrrit-wm>	 (03PS5) 10Ottomata: Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 
[18:22:14] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Use graphite_anomoly to alert on varnishkafka drrers rather than graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/219408 (owner: 10Ottomata)
[18:22:38] <grrrit-wm>	 (03CR) 10Aaron Schulz: "I'm not worried. The last patch vastly reduced the CPU of this module for a number of reasons and the deploy/restart went without problem " [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya)
[18:23:08] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp3041 is OK Less than 1.00% above the threshold [0.0]
[18:23:46] <grrrit-wm>	 (03CR) 10GWicke: Remove trebuchet setup from restbase config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke)
[18:25:23] <grrrit-wm>	 (03PS1) 10Ottomata: Fix typo in graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219416 
[18:25:27] <icinga-wm>	 ACKNOWLEDGEMENT - salt-minion processes on conf1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn @ottomata - setup
[18:25:27] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on conf1002 is CRITICAL Puppet has 1 failures daniel_zahn @ottomata - setup
[18:25:27] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on conf1003 is CRITICAL Puppet has 1 failures daniel_zahn @ottomata - setup
[18:25:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Fix typo in graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219416 (owner: 10Ottomata)
[18:26:08] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:26:58] <icinga-wm>	 PROBLEM - puppet last run on cp1047 is CRITICAL puppet fail
[18:27:07] <icinga-wm>	 PROBLEM - puppet last run on cp3012 is CRITICAL puppet fail
[18:27:25] <ottomata>	 my fault ^
[18:27:25] <ottomata>	 fixing.
[18:27:52] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya)
[18:27:58] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING
[18:28:28] <icinga-wm>	 PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail
[18:28:47] <icinga-wm>	 PROBLEM - puppet last run on cp2005 is CRITICAL puppet fail
[18:28:57] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail
[18:28:58] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL puppet fail
[18:29:08] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail
[18:29:08] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on silver is CRITICAL Puppet has 1 failures daniel_zahn T103138
[18:29:08] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL puppet fail
[18:29:08] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail
[18:29:17] <icinga-wm>	 PROBLEM - puppet last run on cp1061 is CRITICAL puppet fail
[18:29:18] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL puppet fail
[18:30:07] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL puppet fail
[18:30:08] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail
[18:30:08] <icinga-wm>	 PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail
[18:30:29] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL puppet fail
[18:31:08] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail
[18:31:36] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "thanks Aaron!" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya)
[18:31:54] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL puppet fail
[18:32:05] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail
[18:32:34] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail
[18:32:42] <godog>	 !log stop cassandra on restbase1008
[18:32:46] <morebots>	 Logged the message, Master
[18:32:48] <grrrit-wm>	 (03Abandoned) 10Ori.livneh: carbon-cache: enable manhole interface [puppet] - 10https://gerrit.wikimedia.org/r/219226 (owner: 10Ori.livneh)
[18:32:54] <icinga-wm>	 PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail
[18:33:04] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail
[18:33:14] <icinga-wm>	 PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail
[18:33:17] <ori>	 my manhole was not attractive enough for godog
[18:33:35] <grrrit-wm>	 (03PS1) 10Dzahn: labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 
[18:33:36] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1072 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:33:44] <icinga-wm>	 PROBLEM - puppet last run on cp2014 is CRITICAL puppet fail
[18:34:26] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4001 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:34:32] <grrrit-wm>	 (03PS2) 10Dzahn: labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 
[18:34:34] <godog>	 ori: was good in theory, in practice I didn't need it
[18:34:34] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:35:05] <icinga-wm>	 RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures
[18:35:05] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[18:35:25] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1072 is OK Less than 1.00% above the threshold [0.0]
[18:35:25] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1050 is CRITICAL Anomaly detected: 52 data above and 30 below the confidence bounds
[18:35:26] <icinga-wm>	 RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures
[18:35:35] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures
[18:35:44] <icinga-wm>	 RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[18:35:44] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures
[18:35:44] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:35:55] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[18:35:55] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4018 is CRITICAL Anomaly detected: 75 data above and 0 below the confidence bounds
[18:36:05] <icinga-wm>	 RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:05] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures
[18:36:05] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3005 is CRITICAL Anomaly detected: 46 data above and 34 below the confidence bounds
[18:36:15] <icinga-wm>	 RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures
[18:36:15] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures
[18:36:25] <icinga-wm>	 RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:25] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures
[18:36:25] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:25] <icinga-wm>	 PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds
[18:36:35] <icinga-wm>	 RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:43] <mutante>	 uhm.. and now the master died on top of that ? sigh
[18:36:46] <icinga-wm>	 RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:36:54] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3040 is CRITICAL Anomaly detected: 64 data above and 34 below the confidence bounds
[18:37:05] <icinga-wm>	 RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[18:37:12] <mutante>	 ottomata: is the puppetmaster thing because salt on * ?
[18:37:15] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures
[18:37:15] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:37:24] <icinga-wm>	 RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:37:45] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:37:45] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures
[18:37:52] <mutante>	 hmm, as long as it recovers..
[18:37:56] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp1068 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:38:14] <icinga-wm>	 RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 5.411 second response time
[18:38:15] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:38:15] <ottomata>	 haha, clearly the anomoaly thing isn't helping?
[18:38:16] <ottomata>	 geez
[18:38:16] <ottomata>	 haha
[18:38:19] <ottomata>	 needs adjusting
[18:38:32] <ottomata>	 or, maybe it is adjusting?
[18:38:32] <wikibugs>	 6operations, 10RESTBase-Cassandra: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1383467 (10fgiunchedi)
[18:38:33] <ottomata>	 oof, i dunno
[18:38:39] <ottomata>	 godog: are you familiar with tuning that?
[18:38:50] <wikibugs>	 6operations, 10RESTBase-Cassandra: don't start cassandra at boot or puppet - https://phabricator.wikimedia.org/T103134#1383182 (10fgiunchedi) for the same reasons, puppet shouldn't `ensure => 'running'`
[18:38:56] <ori>	 godog: oh i was just making a lewd joke
[18:39:19] <godog>	 ottomata: nope but we can take a look!
[18:39:39] <godog>	 ori: hehe yeah, I wasn't sure how to reply without another lewd joke
[18:39:42] <godog>	 TIL: lewd
[18:40:11] <ottomata>	 godog: http://grafana.wikimedia.org/#/dashboard/db/kafkatest?panelId=5&fullscreen&edit
[18:40:54] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp3041 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[18:40:55] <icinga-wm>	 PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon
[18:41:45] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp1068 is OK Less than 1.00% above the threshold [0.0]
[18:42:30] <ottomata>	 so, godog, i'm looking at hot winters bands for this metric
[18:42:45] <ottomata>	 and it looks to me like the real number is always under both bands during a spike
[18:43:08] <ottomata>	 so i'm not sure why this alert would fire
[18:43:11] <ottomata>	 i'm looking at cp3040
[18:43:55] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] labmon1001: move to correct ganglia cluster "virt" [puppet] - 10https://gerrit.wikimedia.org/r/219418 (owner: 10Dzahn)
[18:45:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2017 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2011 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1070 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2023 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3038 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:31] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1059 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3003 is CRITICAL Anomaly detected: 51 data above and 20 below the confidence bounds
[18:45:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4004 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3036 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:46] <ottomata>	 yeah
[18:45:47] <ottomata>	 hmph
[18:45:57] <jgage>	 heh hot winters
[18:45:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1055 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3008 is CRITICAL Anomaly detected: 55 data above and 19 below the confidence bounds
[18:45:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2004 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3014 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:45:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2010 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:00] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1054 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3004 is CRITICAL Anomaly detected: 45 data above and 29 below the confidence bounds
[18:46:10] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:10] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2009 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:11] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2016 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:19] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3037 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:22] <ottomata>	 sigh.
[18:46:26] <jgage>	 what is this?
[18:46:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1052 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1053 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1069 is CRITICAL Anomaly detected: 66 data above and 14 below the confidence bounds
[18:46:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2002 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:31] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2003 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:31] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2015 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:31] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3013 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:32] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3022 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:32] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3035 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:46:38] <ottomata>	 OH
[18:46:40] <jgage>	 i'm surprised to see cp1xxx in there, meaning it's not a WAN issue
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1049 is CRITICAL Anomaly detected: 57 data above and 17 below the confidence bounds
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3021 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3018 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3006 is CRITICAL Anomaly detected: 68 data above and 0 below the confidence bounds
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3034 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:07] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4020 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1061 is CRITICAL Anomaly detected: 11 data above and 47 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2019 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1057 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4002 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1099 is CRITICAL Anomaly detected: 55 data above and 18 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1074 is CRITICAL Anomaly detected: 64 data above and 34 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1066 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2007 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2006 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2025 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3017 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3033 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2013 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2018 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:23] <grrrit-wm>	 (03PS1) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 
[18:47:29] <grrrit-wm>	 (03PS2) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 
[18:47:42] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1065 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:42] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1047 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:51] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1063 is CRITICAL Anomaly detected: 53 data above and 18 below the confidence bounds
[18:47:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1071 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1073 is CRITICAL Anomaly detected: 26 data above and 69 below the confidence bounds
[18:47:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2012 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:47:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1046 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1064 is CRITICAL Anomaly detected: 48 data above and 21 below the confidence bounds
[18:48:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1043 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1060 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3032 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3016 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:10] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4011 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:10] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4012 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:17] <grrrit-wm>	 (03PS3) 10Ottomata: Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 
[18:48:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3005 is CRITICAL Anomaly detected: 45 data above and 35 below the confidence bounds
[18:48:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3015 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp4017 is CRITICAL Anomaly detected: 73 data above and 0 below the confidence bounds
[18:48:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3044 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1056 is CRITICAL Anomaly detected: 4 data above and 82 below the confidence bounds
[18:48:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp2005 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:21] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3031 is CRITICAL Anomaly detected: 62 data above and 35 below the confidence bounds
[18:48:22] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3039 is CRITICAL Anomaly detected: 100 data above and 0 below the confidence bounds
[18:48:51] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Fix over => true for vk drerr anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/219423 (owner: 10Ottomata)
[18:49:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1048 is CRITICAL Anomaly detected: 44 data above and 45 below the confidence bounds
[18:49:51] <icinga-wm>	 RECOVERY - Cassandra database on restbase1008 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon
[18:51:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3030 is CRITICAL Anomaly detected: 55 data above and 43 below the confidence bounds
[18:51:11] <wikibugs>	 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1383539 (10fgiunchedi) FWIW, relevant patch upstream http://archive.linuxvirtualserver.org/html/lvs-devel/2013-06/msg00055.html to support choos...
[18:53:06] <chasemp>	 lotta varnishkafka errors lately, ottomata is that you?
[18:54:00] <icinga-wm>	 PROBLEM - puppet last run on cp2026 is CRITICAL puppet fail
[18:54:00] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail
[18:54:29] <icinga-wm>	 PROBLEM - puppet last run on cp2002 is CRITICAL puppet fail
[18:54:40] <icinga-wm>	 PROBLEM - puppet last run on cp2020 is CRITICAL puppet fail
[18:54:41] <ottomata>	 chasemp: yes, the most recent flood was me trying to quiet them a bit
[18:54:49] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail
[18:55:00] <chasemp>	 ok sounds good, best of luck then
[18:55:02] <ottomata>	 haha
[18:55:03] <ottomata>	 thanks :/
[18:55:11] <icinga-wm>	 PROBLEM - puppet last run on cp2012 is CRITICAL puppet fail
[18:55:20] <icinga-wm>	 PROBLEM - puppet last run on cp1072 is CRITICAL puppet fail
[18:55:20] <icinga-wm>	 PROBLEM - puppet last run on cp1057 is CRITICAL puppet fail
[18:55:20] <icinga-wm>	 PROBLEM - puppet last run on cp2025 is CRITICAL puppet fail
[18:55:29] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3009 is CRITICAL Anomaly detected: 19 data above and 46 below the confidence bounds
[18:55:44] <MaxSem>	 bblack, re mobile redirects: SHIT HIT FAN!!1 :P https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Wikipedia_on_a_mobile_browser_is_not_showing_the_mobile_version_of_the_page
[18:55:51] <MaxSem>	 jdlrobson, Krinkle ^
[18:56:10] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1050 is CRITICAL Anomaly detected: 32 data above and 46 below the confidence bounds
[18:56:20] <icinga-wm>	 RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[18:57:00] <icinga-wm>	 PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon
[18:57:45] <jgage>	 i wish that anomaly monitor's error text was less vague. like if it had units and values instead of just counts.
[18:58:35] <_joe_>	 jgage: If I hadn't written it in athens in 3 days, probably, it would be better
[18:58:38] <_joe_>	 :P
[18:58:47] <jgage>	 :)
[18:58:51] <jgage>	 i remain hopeful for the future
[18:58:56] <jgage>	 which i suppose means i should open a ticket
[18:58:58] <wikibugs>	 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1383556 (10Eevans) Regarding the two non-hardware blockers, (the bootstrap/streaming failures, and metrics reporting):  https://issues.a...
[18:59:00] <icinga-wm>	 RECOVERY - puppet last run on cp2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:59:09] <icinga-wm>	 RECOVERY - puppet last run on cp1057 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures
[18:59:40] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:59:52] <godog>	 ottomata: I think the previous checks were ok, but we should be using percentage => 80 or sth like that to make sure most datapoints in the last 10m are over threshold
[19:00:00] <mutante>	 labsdb1001-1003 = MySQL eqiad   labsdb1004 = Misc eqiad
[19:00:09] * mutante finds all these little inconsistencies 
[19:00:27] <jynus>	 mutante, there will be a mysql there soon
[19:00:27] <ottomata>	 ?
[19:00:35] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1056 is OK No anomaly detected
[19:00:52] <mutante>	 jynus: ah, 1004 is not installed yet? that would explain i guess, and "misc" is just default
[19:00:56] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3041 is OK No anomaly detected
[19:00:59] <jynus>	 the only reason why it is not there yet is that I need coordination with labs, and they are a bit busy
[19:01:05] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1073 is OK No anomaly detected
[19:01:07] <mutante>	 jynus: got it:) thx
[19:01:16] <ottomata>	 oh percentage, hum
[19:01:22] <jynus>	 it will be
[19:01:24] <icinga-wm>	 RECOVERY - puppet last run on cp1072 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures
[19:01:30] <jynus>	 tools slave or something like that
[19:01:31] <ottomata>	 godog: is that instead of from => '10m'?
[19:01:45] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1061 is OK No anomaly detected
[19:02:03] <godog>	 ottomata: nope, in addition to, I'm looking at graphite_threshold
[19:02:07] <ottomata>	 right
[19:02:13] <godog>	 forgetting anomaly for a second
[19:02:14] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1050 is OK No anomaly detected
[19:02:18] <ottomata>	 mayve these things just need better documentation and examples
[19:02:21] <ottomata>	 it is pretty confusing
[19:02:40] <ottomata>	 from is total num of datapoints to get
[19:02:41] <ottomata>	 or
[19:02:46] <ottomata>	 get all datapoints in last timeperiod
[19:03:14] <icinga-wm>	 RECOVERY - puppet last run on cp2025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:03:25] <ottomata>	 would series make more sense then?  godog, i just want alert on consistent errors
[19:03:27] <ottomata>	 not spikes
[19:03:40] <ottomata>	 like, if the last 10 minutes all had a large number of errors, THEN alert
[19:04:01] <ottomata>	 hm, or series would be more spikey i guess
[19:04:02] <ottomata>	 hm
[19:04:06] <ottomata>	 so, percentage
[19:04:18] <ottomata>	 since logster sends metrics every minute
[19:04:21] <ottomata>	 if i'm looking at 10 minutes
[19:04:24] <icinga-wm>	 RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures
[19:04:30] <ottomata>	 and default percentainge is 1% of datapoints
[19:04:37] <ottomata>	 then any spike would cause an alert at all
[19:05:13] <godog>	 yep
[19:05:27] <mutante>	 andrewbogott: virt1005-1007 are still to be installed i assume, right?
[19:05:42] <mutante>	 eh, re-installed or something
[19:05:45] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp3031 is CRITICAL Anomaly detected: 44 data above and 53 below the confidence bounds
[19:05:47] <ottomata>	 hm, ok.  so 80% would mean that at least 8 of last 10 minutes would have to have had # of errors > threshoold
[19:06:55] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1072 is CRITICAL Anomaly detected: 54 data above and 44 below the confidence bounds
[19:06:59] <MaxSem>	 Krinkle, the part you asked about is https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-frontend.inc.vcl.erb#L48
[19:07:20] <ottomata>	 sure, 8 minutes is fine.
[19:07:21] <ottomata>	 80%
[19:07:22] <grrrit-wm>	 (03PS1) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 
[19:07:22] <ottomata>	 hm
[19:07:28] <andrewbogott>	 mutante: probably ripped out and sent back to cisco.
[19:07:35] <andrewbogott>	 Certainly don’t need monitoring
[19:07:49] <wikibugs>	 7Puppet, 6Mobile-Web: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383579 (10Jdlrobson) 3NEW
[19:08:06] <Krinkle>	 MaxSem: interesting
[19:08:30] <wikibugs>	 7Puppet, 6Mobile-Web: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383588 (10Jdlrobson) Note the redirect for mobile is done in:  https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-frontend.inc.vcl.erb#L48  and I'd rather avoid editing i...
[19:08:49] <grrrit-wm>	 (03PS1) 10Ottomata: Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 
[19:08:51] <wikibugs>	 6operations: graphite1002 - RAID degraded - https://phabricator.wikimedia.org/T103159#1383589 (10Dzahn) 3NEW
[19:08:58] <ottomata>	 godog: https://gerrit.wikimedia.org/r/#/c/219430/
[19:08:59] <ottomata>	 ?
[19:09:24] <icinga-wm>	 RECOVERY - puppet last run on cp2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:09:25] <mutante>	 andrewbogott: oh, it seems the pattern with the "odd" hosts i find is all "they are Cisco", not just in labs
[19:09:40] <grrrit-wm>	 (03PS2) 10Ottomata: Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 
[19:09:44] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute anomaly on cp1068 is CRITICAL Anomaly detected: 54 data above and 43 below the confidence bounds
[19:09:45] <Krinkle>	 https://www.google.co.uk/search?q=http+666 -> https://en.wikipedia.org/?title=666_(number) 
[19:09:49] <Krinkle>	 it's more and more common
[19:09:53] <Krinkle>	 where do these come from ...
[19:10:17] <andrewbogott>	 mutante: curious!
[19:10:45] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 (owner: 10Ottomata)
[19:11:14] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures
[19:11:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Switch back to varnishkafka graphite_threshold with percentage check, disable graphite_anomaly [puppet] - 10https://gerrit.wikimedia.org/r/219430 (owner: 10Ottomata)
[19:11:34] <MaxSem>	 and why do we have an index.php in webserver root in the first place???
[19:11:46] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3030 is OK No anomaly detected
[19:12:11] <MaxSem>	 api.php
[19:12:17] <MaxSem>	 err, wrng window
[19:13:33] <Krinkle>	 MaxSem: We don't, it's just rewrite handling
[19:13:43] <MaxSem>	 yup, just found out
[19:13:49] <Krinkle>	 MaxSem: OK. I've got an idea for a fix
[19:13:51] <MaxSem>	 too many ways to screw up
[19:14:15] <YuviPanda>	 Nemo_bis: hi?
[19:14:24] <YuviPanda>	 Nemo_bis: I'm disabling NFS on the ttmserver project and bringing it back up
[19:14:41] <YuviPanda>	 Nemo_bis: you are the only person with files in their homedir
[19:15:35] <icinga-wm>	 PROBLEM - Host labcontrol1001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:15:55] <icinga-wm>	 PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[19:16:54] <mutante>	 heh, new icinga issues show up faster than we can ack them
[19:17:00] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp1048 is OK No anomaly detected
[19:17:03] <mutante>	 the labs stuff is maintenance?
[19:17:19] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds
[19:17:19] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute anomaly on cp3031 is OK No anomaly detected
[19:17:28] <icinga-wm>	 RECOVERY - Host labcontrol1001 is UPING OK - Packet loss = 0%, RTA = 0.43 ms
[19:17:44] <wikibugs>	 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1383626 (10Tgr) > Just tell people to trust their OS' CA store, anything else is just insecure.  Telling people to trust t...
[19:17:44] <mutante>	 etherpad too .. wow
[19:17:56] <chasemp>	 mutante: see _security for soem of it
[19:17:59] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on graphite1002 is CRITICAL 1 failed LD(s) (Degraded) daniel_zahn https://phabricator.wikimedia.org/T103159
[19:18:08] <icinga-wm>	 RECOVERY - Host labs-ns0.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 1.31 ms
[19:18:28] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7924 bytes in 0.062 second response time
[19:18:40] <mutante>	 chasemp: ok :) thx
[19:22:47] <grrrit-wm>	 (03PS2) 10Dzahn: labcontrol2001,nembus,neptunium: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/219079 
[19:23:37] <grrrit-wm>	 (03PS3) 10Dzahn: labcontrol2001,nembus,neptunium: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/219079 
[19:23:52] <grrrit-wm>	 (03CR) 10Dzahn: "FWIW - these were NOT in ganglia before this change" [puppet] - 10https://gerrit.wikimedia.org/r/219079 (owner: 10Dzahn)
[19:24:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "FWIW - these were NOT in ganglia before this change" [puppet] - 10https://gerrit.wikimedia.org/r/219079 (owner: 10Dzahn)
[19:24:55] <grrrit-wm>	 (03Abandoned) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) (owner: 10Dzahn)
[19:25:24] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1383664 (10Dzahn) so all is left here is OTRS it seems
[19:26:12] <cajoel>	 have any of you folks been to poppetconfs?
[19:26:13] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds
[19:26:26] <cajoel>	 byron is interested in going, and wants to ask what people thought
[19:26:32] <cajoel>	 this year is in Oregon
[19:26:37] <byron>	 Thoughts?
[19:28:02] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7936 bytes in 7.988 second response time
[19:28:55] <grrrit-wm>	 (03PS2) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 
[19:30:13] <icinga-wm>	 PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds
[19:31:46] <grrrit-wm>	 (03PS1) 10Ottomata: Update jmxmodule for jessie package, change ganglia host and port for zookeeper on conf100[123] [puppet] - 10https://gerrit.wikimedia.org/r/219433 
[19:31:53] <icinga-wm>	 RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.412 second response time
[19:34:32] <icinga-wm>	 PROBLEM - puppet last run on zirconium is CRITICAL puppet fail
[19:35:53] <icinga-wm>	 PROBLEM - puppet last run on mw2210 is CRITICAL puppet fail
[19:35:55] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Update jmxmodule for jessie package, change ganglia host and port for zookeeper on conf100[123] [puppet] - 10https://gerrit.wikimedia.org/r/219433 (owner: 10Ottomata)
[19:39:52] <icinga-wm>	 RECOVERY - puppet last run on conf1002 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[19:41:00] <icinga-wm>	 RECOVERY - puppet last run on conf1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:41:20] <icinga-wm>	 RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:41:32] <ottomata>	 phew, thanks godog, icinga cleaner at the moment, we will see if the intermittent alerts stop happening now
[19:42:00] <godog>	 ottomata: dr_err all the things!
[19:42:22] <mutante>	 where did the "zirconium enabled" come from?
[19:43:31] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 (owner: 10Dzahn)
[19:43:51] <grrrit-wm>	 (03PS3) 10Dzahn: ganglia: set "virt" cluster for all in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 
[19:44:03] <grrrit-wm>	 (03PS4) 10Dzahn: ganglia: set "virt" cluster for all labs hosts in regex [puppet] - 10https://gerrit.wikimedia.org/r/219429 
[19:47:26] <godog>	 ottomata: did you see legitimate dr_err in the past for actual problems btw? I'm wondering if it isn't bound to fire at all now
[19:50:04] <ottomata>	 godog: they are real drerrs, but
[19:50:14] <ottomata>	 i am not currently interested in short spikes of them
[19:50:27] <ottomata>	 i would like to solve that problem, but it is low priority and kind of hard.
[19:50:41] <ottomata>	 so the intermittent alerts were just annoying and ignored by all
[19:50:44] <ottomata>	 which is not a useful alert
[19:51:10] <godog>	 agreed, did you see persistent drerr too in the past?
[19:51:30] <ottomata>	 oh, i see, um, i can't remember ever seeing any, but they woudl happen if there were serious network problems, or if kafka brokers had problems
[19:52:35] <ottomata>	 hm, i think i remember occasional periods where all of esams would get a big latency hit for some reason and cause all the VKs there to drerr, but those were still shortish periods, only a few minutes long max
[19:52:42] <ottomata>	 but, that was a while ago
[19:52:47] <ottomata>	 hasn't happened recently
[19:53:00] <icinga-wm>	 PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar
[19:53:01] <godog>	 kk, well that should alarm only on longer timespans now
[19:53:17] <grrrit-wm>	 (03PS4) 10Dzahn: mysql: set ganglia cluster in hiera, not site.pp [puppet] - 10https://gerrit.wikimedia.org/r/219074 
[19:53:19] <ottomata>	 hopefully, yes.
[19:53:30] <icinga-wm>	 RECOVERY - puppet last run on mw2210 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures
[19:53:58] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] mysql: set ganglia cluster in hiera, not site.pp [puppet] - 10https://gerrit.wikimedia.org/r/219074 (owner: 10Dzahn)
[19:55:01] <icinga-wm>	 RECOVERY - Cassandra database on restbase1007 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon
[19:57:50] <icinga-wm>	 RECOVERY - Cassanda CQL query interface on restbase1007 is OK: TCP OK - 0.003 second response time on port 9042
[19:59:00] <icinga-wm>	 PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar
[20:01:01] <icinga-wm>	 PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar
[20:02:06] <ottomata>	 OO
[20:02:07] <ottomata>	 interseting.
[20:03:51] <icinga-wm>	 PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args -jar jmxtrans-all.jar
[20:04:39] <grrrit-wm>	 (03PS2) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 
[20:05:21] <grrrit-wm>	 (03CR) 10Dzahn: "status of ganglia in fundraising is unclear" [puppet] - 10https://gerrit.wikimedia.org/r/219077 (owner: 10Dzahn)
[20:07:15] <mutante>	 oh come on
[20:09:28] <Mjbmr>	 did someone commented all my crontabs on tools-labs?
[20:09:34] <icinga-wm>	 PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:44] <icinga-wm>	 PROBLEM - DPKG on analytics1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:44] <icinga-wm>	 PROBLEM - DPKG on analytics1034 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:55] <icinga-wm>	 PROBLEM - DPKG on analytics1031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:55] <icinga-wm>	 PROBLEM - DPKG on analytics1037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:55] <icinga-wm>	 PROBLEM - DPKG on analytics1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:55] <icinga-wm>	 PROBLEM - DPKG on analytics1040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:55] <icinga-wm>	 PROBLEM - DPKG on analytics1041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:56] <icinga-wm>	 PROBLEM - DPKG on analytics1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:09:57] <YuviPanda>	 Mjbmr: yes, all crontabs were commented out.
[20:09:58] <ottomata>	 ^ me again, jmxtrans package 
[20:10:15] <icinga-wm>	 PROBLEM - DPKG on analytics1019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:19] <Mjbmr>	 YuviPanda: for everyone?
[20:10:24] <icinga-wm>	 PROBLEM - DPKG on analytics1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:24] <icinga-wm>	 PROBLEM - DPKG on analytics1036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:34] <icinga-wm>	 PROBLEM - DPKG on analytics1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:41] <YuviPanda>	 Mjbmr: yes. see https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage and the linked labs mailing list post
[20:10:45] <icinga-wm>	 PROBLEM - DPKG on analytics1010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:55] <icinga-wm>	 PROBLEM - DPKG on analytics1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:10:55] <icinga-wm>	 PROBLEM - DPKG on analytics1033 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:11:05] <icinga-wm>	 PROBLEM - DPKG on analytics1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:11:07] <Mjbmr>	 YuviPanda: how do I know which one were already commented out?
[20:11:14] <icinga-wm>	 PROBLEM - DPKG on analytics1030 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:11:15] <icinga-wm>	 PROBLEM - DPKG on analytics1035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:11:15] <icinga-wm>	 RECOVERY - DPKG on analytics1003 is OK: All packages OK
[20:11:25] <YuviPanda>	 Mjbmr: ah, I have a backup I can provide you
[20:11:40] <YuviPanda>	 Mjbmr: can you open a bug with the list of tools you want the crontabs for and I'll link you to them?
[20:11:42] <Bsadowski1>	 what the heck is this about
[20:11:44] <icinga-wm>	 RECOVERY - DPKG on analytics1015 is OK: All packages OK
[20:11:44] <icinga-wm>	 RECOVERY - DPKG on analytics1037 is OK: All packages OK
[20:12:05] <icinga-wm>	 RECOVERY - DPKG on analytics1019 is OK: All packages OK
[20:12:14] <icinga-wm>	 RECOVERY - DPKG on analytics1020 is OK: All packages OK
[20:12:24] <icinga-wm>	 RECOVERY - DPKG on analytics1028 is OK: All packages OK
[20:12:27] <Mjbmr>	 YuviPanda: that's the best you could do? I think I'm gonna find out my self. thanks.
[20:12:36] <icinga-wm>	 RECOVERY - DPKG on analytics1010 is OK: All packages OK
[20:12:36] <YuviPanda>	 you are welcome
[20:12:45] <icinga-wm>	 RECOVERY - DPKG on analytics1029 is OK: All packages OK
[20:12:45] <icinga-wm>	 RECOVERY - DPKG on analytics1033 is OK: All packages OK
[20:12:54] <icinga-wm>	 RECOVERY - DPKG on analytics1026 is OK: All packages OK
[20:12:55] <icinga-wm>	 RECOVERY - DPKG on analytics1030 is OK: All packages OK
[20:13:04] <icinga-wm>	 RECOVERY - DPKG on analytics1035 is OK: All packages OK
[20:13:14] <icinga-wm>	 RECOVERY - DPKG on analytics1027 is OK: All packages OK
[20:13:14] <icinga-wm>	 RECOVERY - DPKG on analytics1034 is OK: All packages OK
[20:13:25] <icinga-wm>	 RECOVERY - DPKG on analytics1031 is OK: All packages OK
[20:13:25] <icinga-wm>	 RECOVERY - DPKG on analytics1040 is OK: All packages OK
[20:13:25] <icinga-wm>	 RECOVERY - DPKG on analytics1041 is OK: All packages OK
[20:13:25] <icinga-wm>	 RECOVERY - DPKG on analytics1039 is OK: All packages OK
[20:13:55] <icinga-wm>	 RECOVERY - DPKG on analytics1036 is OK: All packages OK
[20:15:29] <Mjbmr>	 YuviPanda: newly commented have extra space, that was easy.
[20:16:27] <YuviPanda>	 ah, good to know
[20:19:55] <grrrit-wm>	 (03PS1) 10Ottomata: Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 
[20:20:04] <grrrit-wm>	 (03PS2) 10Ottomata: Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 
[20:20:50] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Fix jmxtrans process check command for newer version of jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/219447 (owner: 10Ottomata)
[20:22:17] <icinga-wm>	 RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[20:22:46] <icinga-wm>	 RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[20:22:56] <icinga-wm>	 RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[20:23:26] <icinga-wm>	 RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[20:25:59] <mutante>	 ottomata: i'd try switching analytics_kafka to ganglia_new too.. worst case they disappear from ganglia for a while.. i did have issues with the regular analytics cluster though
[20:27:08] <ottomata>	 mutante: i think its fine, the kafka alerts are based on graphite checks anyway
[20:29:18] <mutante>	 ottomata: ok, good
[20:34:05] <grrrit-wm>	 (03PS3) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 
[20:35:03] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219075 (owner: 10Dzahn)
[20:35:26] <icinga-wm>	 PROBLEM - puppet last run on es2003 is CRITICAL puppet fail
[20:36:37] <icinga-wm>	 PROBLEM - puppet last run on es2002 is CRITICAL puppet fail
[20:37:24] <grrrit-wm>	 (03PS1) 10Jdlrobson: Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) 
[20:40:46] <grrrit-wm>	 (03CR) 10Bmansurov: [C: 04-1] Enable browse prototype on English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson)
[20:41:54] <grrrit-wm>	 (03PS2) 10Jdlrobson: Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) 
[20:43:38] <grrrit-wm>	 (03CR) 10Bmansurov: [C: 031] Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson)
[20:44:12] <jdlrobson>	 w00t
[20:47:47] <wikibugs>	 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie  and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1383923 (10Ottomata)
[20:47:50] <wikibugs>	 6operations, 10Analytics-Cluster: Create jmxtrans Jessie package - https://phabricator.wikimedia.org/T103106#1383921 (10Ottomata) 5Open>3Resolved Imported v250 from: http://central.maven.org/maven2/org/jmxtrans/jmxtrans/250/
[20:48:15] <wikibugs>	 6operations, 5Patch-For-Review, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1383924 (10Ottomata) jmxtrans package updated and installed.
[20:48:45] <ottomata>	 DOH, I lied mutante
[20:48:56] <ottomata>	 i thought i had done away with monitoring ganglia...i guess not!
[20:49:21] <mutante>	 ottomata: where does that monitoring live?
[20:50:03] <ottomata>	 https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/kafka.pp#L213
[20:50:43] <ottomata>	 it is in graphite though, i'll move them ove rnow
[20:50:50] <ottomata>	 my head is already in this stuff anyway
[20:51:05] <mutante>	 ottomata: ah, i see it in Icinga now, i didnt at first because they are just WARN, not CRIT
[20:51:21] <mutante>	 well, it should recover after puppet ran once on the 4 hosts
[20:51:23] <mutante>	 but ..
[20:51:48] <mutante>	 it doesnt work, unlike with all the other clusters and like the regular analytics
[20:52:06] <ottomata>	 yeah
[20:52:13] <ottomata>	 hm, it doesn't work?
[20:52:33] <mutante>	 nope, same issue i had with analytics
[20:52:52] <mutante>	 the hosts don't show up in ganglia-web after being switched to use ganglia_new
[20:53:10] <mutante>	 firewalling maybe?
[20:53:30] <ottomata>	 oh  likely, the ganglia holes were probbably speicifc
[20:53:34] <ottomata>	 you will need those ports open
[20:53:34] <mutante>	 eh, no, i had already checked iptables 
[20:53:38] <ottomata>	 to carbon?
[20:53:38] <ottomata>	 no
[20:53:47] <ottomata>	 it has network VLAN level ACLs
[20:53:48] <mutante>	 yes, now to carbon
[20:53:51] <mutante>	 unlike before
[20:54:03] <mutante>	 that's probably it
[20:54:17] <ottomata>	 ja, anything you want the Analytics VLAN to talk to you have to explicitly ask a network admin to open
[20:54:37] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "analytics_kafka: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/219464 
[20:55:06] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Revert "analytics_kafka: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/219464 (owner: 10Dzahn)
[20:56:07] <mutante>	 hold on, the icinga checks should recover after that
[20:56:42] <ottomata>	 mutante: ok, but i do want to change these anyway, we don't want to use monitoring::ganglia anymroe
[20:57:12] <mutante>	 ottomata: ok, but you still want them in ganglia, besides the monitoring
[20:57:26] <mutante>	 otherwise that would just be an empty cluster
[20:57:47] <ottomata>	 yes, we want the values in ganglia
[20:57:50] <ottomata>	 just not the alerts based on ganglia
[20:58:05] <mutante>	 http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520Kafka%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false
[20:58:07] <mutante>	 there it is again ^
[20:58:14] <mutante>	 see the gap but it continues now
[20:58:58] <wikibugs>	 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1383943 (10JAufrecht)
[20:59:03] <mutante>	 so i need to know the correct port _before_ applying a change
[20:59:15] <mutante>	 but the change picks the right port for me ..hmm
[21:00:42] <grrrit-wm>	 (03PS1) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 
[21:01:01] <ottomata>	 mutante: haha, is the assigned port not static?
[21:01:03] <ottomata>	 how does it get assigned?
[21:01:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 (owner: 10Ottomata)
[21:01:45] <wikibugs>	 6operations, 7Wikimedia-log-errors: Memcached error for key "WANCache:v:enwiki:image_redirect:254363f3d14af58bbe12c644ee69ccf7" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T102916#1383954 (10aaron) Note that since nutcracker is just a proxy, it coul...
[21:02:09] <grrrit-wm>	 (03PS2) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 
[21:02:15] <mutante>	 ottomata: something like $base_port + prefix for datacenter + X .. 
[21:02:18] <grrrit-wm>	 (03PS3) 10Ottomata: Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 
[21:02:29] <ottomata>	 ah but at least you can calculate it?
[21:02:47] <mutante>	 yea, trying to find out
[21:03:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Use graphite_threshold instead of ganglia for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/219465 (owner: 10Ottomata)
[21:03:19] <mutante>	  $gmond_port = $ganglia_new::configuration::base_port + $id
[21:03:23] <ottomata>	 mutante: hm, i think i'm going to wait to merge that alert change til next week, almost time for me to peace out :)
[21:03:38] <mutante>	 ottomata: the existing monitoring all recovered 
[21:03:42] <ottomata>	 cool
[21:03:43] <ottomata>	 ok
[21:03:44] <ottomata>	 thank you
[21:03:44] <mutante>	 so yea
[21:04:09] <mutante>	 same here, i will have to ask for the network gear changes next week
[21:04:15] <ottomata>	 ok ja, i'm out then, have a good weekeeeend.  aye :)
[21:04:24] <mutante>	 you too, cya !
[21:17:56] <wikibugs>	 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383982 (10Dzahn)
[21:20:06] <icinga-wm>	 PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.B9l7SsRk/mnt is not accessible: Permission denied
[21:21:01] <grrrit-wm>	 (03PS1) 10BBlack: Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 
[21:21:56] <icinga-wm>	 RECOVERY - Disk space on labnodepool1001 is OK: DISK OK
[21:22:08] <icinga-wm>	 RECOVERY - puppet last run on es2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:22:45] <wikibugs>	 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383995 (10Dzahn) Hi,  done, based on:  https://wikimediafoundation.org/wiki/User:KFrancis_%28WMF%29 and the Staff and contractors page.  before: -legal-tm-vio:  slaporte, ywelinder, rstallman, mbrar, jroger...
[21:23:01] <wikibugs>	 6operations, 7Mail: add kfrancis to legal-tm-vio mail alias - https://phabricator.wikimedia.org/T103029#1383996 (10Dzahn) 5Open>3Resolved
[21:23:26] <icinga-wm>	 RECOVERY - puppet last run on es2002 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures
[21:27:29] <grrrit-wm>	 (03CR) 10MaxSem: [C: 031] Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (owner: 10BBlack)
[21:29:32] <grrrit-wm>	 (03PS2) 10BBlack: Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (https://phabricator.wikimedia.org/T103158) 
[21:29:43] <bblack>	 ^ just fixed commitmsg for bug ref
[21:30:40] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Mobile redirects for non-canonical article URLs [puppet] - 10https://gerrit.wikimedia.org/r/219471 (https://phabricator.wikimedia.org/T103158) (owner: 10BBlack)
[21:36:37] <icinga-wm>	 PROBLEM - puppet last run on cp1052 is CRITICAL Puppet has 1 failures
[21:37:01] <gwicke>	 !log upgraded cassandra on 1003 to 2.1.7 (pre-release, likely going out on Monday)
[21:37:05] <morebots>	 Logged the message, Master
[21:38:47] <icinga-wm>	 PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 1 failures
[21:39:36] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 1 failures
[21:39:47] <icinga-wm>	 PROBLEM - puppet last run on cp2016 is CRITICAL Puppet has 1 failures
[21:39:56] <icinga-wm>	 PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures
[21:39:56] <icinga-wm>	 PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 1 failures
[21:40:08] <icinga-wm>	 PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures
[21:40:17] <icinga-wm>	 PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 1 failures
[21:40:56] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 1 failures
[21:40:56] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL Puppet has 1 failures
[21:41:07] <icinga-wm>	 PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 1 failures
[21:41:08] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 1 failures
[21:41:17] <bblack>	 blag
[21:41:26] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 1 failures
[21:41:37] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures
[21:41:47] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures
[21:41:47] <ori>	 bblag
[21:41:50] <bblack>	 :P
[21:41:54] <bblack>	 that was a typo'd blarg
[21:42:06] <icinga-wm>	 PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 1 failures
[21:42:07] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures
[21:42:27] <icinga-wm>	 PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 1 failures
[21:42:38] <icinga-wm>	 PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures
[21:42:57] <icinga-wm>	 PROBLEM - puppet last run on cp2010 is CRITICAL Puppet has 1 failures
[21:42:57] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL Puppet has 1 failures
[21:42:57] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds
[21:43:06] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures
[21:43:06] <icinga-wm>	 PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 1 failures
[21:43:36] <icinga-wm>	 PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 1 failures
[21:43:37] <icinga-wm>	 PROBLEM - puppet last run on cp2019 is CRITICAL Puppet has 1 failures
[21:43:54] <grrrit-wm>	 (03PS1) 10BBlack: Bugfix for d22e8bc6 [puppet] - 10https://gerrit.wikimedia.org/r/219478 
[21:43:56] <icinga-wm>	 PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 1 failures
[21:43:57] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL Puppet has 1 failures
[21:44:06] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 1 failures
[21:44:07] <icinga-wm>	 PROBLEM - puppet last run on cp3004 is CRITICAL Puppet has 1 failures
[21:44:11] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Bugfix for d22e8bc6 [puppet] - 10https://gerrit.wikimedia.org/r/219478 (owner: 10BBlack)
[21:44:17] <icinga-wm>	 PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 1 failures
[21:44:38] <icinga-wm>	 PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 1 failures
[21:44:46] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7926 bytes in 5.122 second response time
[21:44:47] <icinga-wm>	 PROBLEM - puppet last run on cp3009 is CRITICAL Puppet has 1 failures
[21:44:47] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL Puppet has 1 failures
[21:44:57] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 1 failures
[21:45:07] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures
[21:45:27] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures
[21:45:37] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures
[21:47:06] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures
[21:47:16] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures
[21:47:26] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures
[21:48:17] <icinga-wm>	 RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures
[21:48:17] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[21:48:18] <bblack>	 isn't it cool how the set of puppetfails covers cp[1234] now instead of just cp[134] though?
[21:48:26] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures
[21:48:37] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures
[21:48:46] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:07] <icinga-wm>	 RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:16] <icinga-wm>	 RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:16] <icinga-wm>	 RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:16] <icinga-wm>	 RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures
[21:49:17] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures
[21:49:27] <icinga-wm>	 RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:27] <icinga-wm>	 RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:38] <icinga-wm>	 RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures
[21:49:38] <icinga-wm>	 RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures
[21:49:48] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:48] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:50:16] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds
[21:50:16] <icinga-wm>	 RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures
[21:50:38] <icinga-wm>	 RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures
[21:50:38] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:50:46] <icinga-wm>	 RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures
[21:50:57] <icinga-wm>	 RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:50:58] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:51:37] <icinga-wm>	 RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures
[21:51:49] <wikibugs>	 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1384106 (10csteipp) Yeah, should be fine.
[21:51:56] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7928 bytes in 2.853 second response time
[21:51:56] <icinga-wm>	 RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[21:51:56] <icinga-wm>	 RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures
[21:51:56] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:51:57] <icinga-wm>	 RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:52:06] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures
[21:52:07] <icinga-wm>	 RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:52:08] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[21:52:27] <icinga-wm>	 RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures
[21:53:07] <icinga-wm>	 RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures
[21:53:25] <wikibugs>	 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384112 (10fgiunchedi)
[21:53:38] <wikibugs>	 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384114 (10fgiunchedi) a:5fgiunchedi>3Cmjohnson
[21:54:07] <icinga-wm>	 RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:54:16] <icinga-wm>	 RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures
[22:02:57] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds
[22:03:34] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1384131 (10Dzahn) 3NEW
[22:04:21] <wikibugs>	 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Niedzielski) 3NEW
[22:04:22] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1384138 (10Dzahn)
[22:04:41] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Dzahn)
[22:05:12] <grrrit-wm>	 (03PS1) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 
[22:05:23] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1384131 (10Dzahn)
[22:05:43] <wikibugs>	 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384153 (10hashar)
[22:05:54] <grrrit-wm>	 (03PS3) 10Dzahn: switch old-bugzilla to apache cluster [dns] - 10https://gerrit.wikimedia.org/r/216736 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis)
[22:05:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush)
[22:06:02] <wikibugs>	 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384154 (10faidon) Our CA for production is GlobalSign. It is one of the big (in terms of websites using it) and oldest CA...
[22:06:29] <grrrit-wm>	 (03PS1) 10Rush: WIP: lvs 'text' and 'text-https' for etcd [puppet] - 10https://gerrit.wikimedia.org/r/219482 
[22:06:36] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7926 bytes in 8.135 second response time
[22:06:56] <grrrit-wm>	 (03PS6) 10Dzahn: redirect old- to static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/216734 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis)
[22:07:58] <wikibugs>	 10Ops-Access-Requests, 6operations: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384159 (10hashar) That will let @Niedzielski get access to the Jenkins configuration which rely on users being in the `wmf` LDAP group :-}
[22:08:03] <wikibugs>	 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1384161 (10fgiunchedi) @cmjohnson I'd like to test a theory re: sdb, can you swap sda and sdb? I'd like to see if the error moves too  ``` [14908.351693] ata2.00: exception Emask 0x0 SAct 0x...
[22:08:26] <icinga-wm>	 PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.02USaYFV/mnt/tmp/ccache is not accessible: Permission denied
[22:09:21] <wikibugs>	 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1384162 (10ori)
[22:10:01] <godog>	 hashar: ^ (nodepool disk)
[22:10:05] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384164 (10Krenair)
[22:10:08] <icinga-wm>	 RECOVERY - Disk space on labnodepool1001 is OK: DISK OK
[22:10:49] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Krenair) I don't think you need the 'nda' group, just the 'wmf' one? I think 'nda' is only really for people who signed the volunteer NDA...
[22:12:31] <grrrit-wm>	 (03PS2) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 
[22:13:53] <grrrit-wm>	 (03PS3) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 
[22:14:36] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384172 (10Niedzielski) Ok, let's try wmf then. Thanks!  //The first rule of NDA is: you do not talk about NDA.//
[22:17:23] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384181 (10Dzahn) 3NEW
[22:17:56] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384191 (10Dzahn) a:3Dzahn
[22:18:18] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384181 (10Dzahn)
[22:18:21] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1384194 (10Dzahn)
[22:20:41] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" and "nda" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384199 (10hashar)
[22:21:56] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384200 (10Krenair)
[22:22:49] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384203 (10Krenair)
[22:23:01] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384204 (10hashar) p:5Triage>3Normal Thanks for the task!  Lets wait for {T103191}. Once confirmed, you can be added via https:/...
[22:24:17] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1384222 (10Dzahn)
[22:24:47] <icinga-wm>	 PROBLEM - Cassanda CQL query interface on restbase1009 is CRITICAL: Connection refused
[22:26:34] <awight>	 Anyone available to help with a simple file permissions thing on lutetium (in the fundraising cluster)?
[22:27:02] <awight>	 I need chmod -R g+w(s) in /srv/org.wikimedia.civicrm ...
[22:28:36] <icinga-wm>	 PROBLEM - Cassandra database on restbase1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon
[22:29:53] <wikibugs>	 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1384229 (10GWicke) > These pertain to MobileWeb, not MobileApps from what I can tell.  My understanding is that they are relevant to both...
[22:30:19] <gwicke>	 the 1009 alert is expected, no need to worry
[22:34:30] <Nemo_bis>	 YuviPanda: ok, thanks; IIRC they're not important things but I don't remember.
[22:34:39] <YuviPanda>	 Nemo_bis: :)
[22:34:43] <YuviPanda>	 not important at all then
[22:40:49] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1384252 (10Niedzielski) 5Resolved>3Open
[22:41:14] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1253680 (10Niedzielski) Reopened issue as I still can't access stat1002 / stat1003.
[22:41:18] <wikibugs>	 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1384256 (10GWicke) p:5Low>3Normal
[22:42:22] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384258 (10Dzahn)
[22:43:18] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384141 (10Dzahn)  refs:  https://wikimediafoundation.org/wiki/User:SNiedzielski_%28WMF%29          https://wikimediafoundation.org/w/index.php?title=Staff_and_con...
[22:44:15] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1384267 (10GWicke) > The only reason that I've leaned toward TCP here, is that our Cassandra nodes are quite a bit chattier than...
[22:44:44] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1384270 (10GWicke)
[22:50:33] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384280 (10Dzahn) caveat here:  Sniedzielski is the WMF account:   mail: sniedzielski@wikimedia  cn: Sniedzielski         uidNumber: 12119  Niedzielski is the priv...
[22:51:37] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Request Jenkins shell access for account "sniedzielski" - https://phabricator.wikimedia.org/T103192#1384283 (10Dzahn)
[22:51:41] <wikibugs>	 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1384281 (10Dzahn) 5Open>3Resolved a:3Dzahn
[23:11:43] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1384322 (10Dzahn) @Niedzielski  Ohh, hi! Sorry, i didn't see this comment until the ticket was reopnened, so you did right.  First let me confirm that:  - on stat...
[23:18:25] <Mjbmr>	 who was going to disable my account?
[23:18:56] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384345 (10Dzahn) checked access logs on silver. yes, wikitech-static tries getting the files:   ``` 16:06 <mutante> wikitech-stat...
[23:20:08] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add restbase1009 to cassandra::seeds [puppet] - 10https://gerrit.wikimedia.org/r/219496 (https://phabricator.wikimedia.org/T102015) 
[23:21:15] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1009 to cassandra::seeds [puppet] - 10https://gerrit.wikimedia.org/r/219496 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi)
[23:21:53] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384349 (10Dzahn) on the wikitech-static side:  in /srv/imports   ```    0 -rw-r--r-- 1 root root    0 Jun 13 16:43 labswiki-20150...
[23:22:22] <wikibugs>	 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384350 (10Tgr)
[23:23:27] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:24:24] <wikibugs>	 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1367794 (10Tgr) @faidon: fair enough (although https://www.ssllabs.com/ssltest/analyze.html?d=en.wikipedia.com claims IE6...
[23:25:07] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING
[23:30:30] <gwicke>	 !log starting cassandra bootstrap on restbase1009
[23:30:35] <morebots>	 Logged the message, Master
[23:30:45] <gwicke>	 godog: ^^
[23:30:56] <icinga-wm>	 RECOVERY - Cassandra database on restbase1009 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon
[23:31:14] <godog>	 gwicke: ack
[23:31:39] <wikibugs>	 6operations, 10Deployment-Systems: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1384356 (10Mattflaschen) 3NEW
[23:31:46] <gwicke>	 failed quickly while streaming from 1006
[23:31:56] <matt_flaschen>	 I noticed that tin doesn't have access to the same memcached as other servers: https://phabricator.wikimedia.org/T103198
[23:32:43] <wikibugs>	 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1384366 (10BBlack) @Tgr that's because we don't speak the SSLv3 protocol anymore, because of the [[ https://en.wikipedia.o...
[23:32:46] <gwicke>	 !log upgraded restbase1006 to cassandra 2.1.7
[23:32:50] <morebots>	 Logged the message, Master
[23:33:06] <wikibugs>	 6operations, 10Deployment-Systems: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1384376 (10EBernhardson) tin config: ``` memcached:    auto_eject_hosts: true    distribution: ketama    hash: md5...
[23:42:10] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1384404 (10Dzahn) the import script was running several times:   ``` root@wikitech-static:/srv/imports# ps aux | grep import-wikit...
[23:47:27] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds
[23:50:36] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0]
[23:59:27] <icinga-wm>	 PROBLEM - puppet last run on mw2060 is CRITICAL puppet fail