[00:00:02] <mutante>	 !log mw1293 - restart hhvm 
[00:00:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:00:28] <jdlrobson>	 ohh yeh sorry got distracted :)
[00:00:29] <wikibugs__>	 (03Merged) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson)
[00:00:39] <wikibugs>	 (03CR) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson)
[00:00:40] <wikibugs__>	 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181989 (10Andrew) I think these 'failed to parse' issues are something about the host on which the puppet compiler is running.  '/conftool/v1/v1/pools is not a directory'
[00:01:12] <Dereckson>	     308 Undefined variable: wmgRelatedArticlesFooterBlacklistedSkins in /srv/mediawiki/wmf-config/CommonSettings.php on line 2878
[00:01:24] <Dereckson>	 oh you've already have it
[00:02:13] <logmsgbot>	 !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 41s)
[00:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:19] <stashbot>	 T162201: Cleanup artifacts of related pages desktop beta feature - https://phabricator.wikimedia.org/T162201
[00:03:11] <logmsgbot>	 !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 41s)
[00:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:22] <Niharika>	 jdlrobson: Synced everywhere. 
[00:03:44] <mutante>	 !log mw1297 - restart hhvm/apache
[00:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:19] <Dereckson>	 Niharika: may I add to SWAT https://gerrit.wikimedia.org/r/348174 - Fix Abuse Filter configuration for tr.wikipedia? It's a follow-up for a change deployed earlier.
[00:05:20] <jdlrobson>	 Niharika: wooop
[00:05:21] <jdlrobson>	 thank you
[00:05:47] <logmsgbot>	 !log niharika29@tin Started scap: Reword ORES preferences (T162831), Put ORES r behind a preference (T162831), Deploy Special:Autoblocklist (T146414)
[00:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:56] <stashbot>	 T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release  - https://phabricator.wikimedia.org/T162831
[00:05:56] <stashbot>	 T146414: Create Special:AutoblockList - https://phabricator.wikimedia.org/T146414
[00:06:40] <Amir1>	 ebernhardson: Hey, One quick question since I want to fix https://phabricator.wikimedia.org/T161563. Does logstash gzip compressed logs?
[00:06:51] <Amir1>	 *Does accept
[00:09:34] <Niharika>	 bd808: Who's the other designer "jgs" behind the scap piggy? 
[00:10:04] <bd808>	 Niharika: https://en.wikipedia.org/wiki/Joan_Stark
[00:10:05] <mutante>	 Joan G. Stark
[00:10:09] <Niharika>	 Dereckson: Sure. 
[00:10:24] <wikibugs__>	 (03CR) 10Niharika29: [C: 032] Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson)
[00:10:34] <ebernhardson>	 Amir1: hmm, lemme check what it does
[00:10:36] <mutante>	 that's a good edit for Wikidata , occupation: ASCII artist, heh
[00:11:03] <ebernhardson>	 Amir1: "A GELF message is a GZIP’d or ZLIB’d JSON string with the following fields:
[00:11:31] <Amir1>	 Great
[00:11:46] <bd808>	 Niharika: scappy started life as the flying pig from https://web.archive.org/web/20091027211201/http://www.geocities.com/SoHo/7373/farm.htm#pig
[00:12:20] <Niharika>	 bd808: You colored it and added the "MW"?
[00:12:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson)
[00:12:40] <ebernhardson>	 Amir1: i suppose a semi-easy way to check, it looks like logstash expects the message to start with either: 0x78 0x9c (zlib), or 0x1f 0x8b (gzip)
[00:12:46] <wikibugs__>	 (03CR) 10jenkins-bot: Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson)
[00:12:49] <ebernhardson>	 those headers should come from the library itself afaik
[00:13:02] <bd808>	 Niharika: among other things yeah. there were a lot of little tweaks from the original.
[00:13:44] <Niharika>	 :) Why did you pick the pig?
[00:13:57] <Reedy>	 because pigs might fly
[00:14:05] <Amir1>	 ebernhardson: hmm, I'm not sure if I have access to logstash inputs (server-side)
[00:14:24] <bd808>	 because the original bash scripts were a horrible mess and I dressed them up and made them fly
[00:14:39] <bd808>	 but still a mess at heart
[00:14:46] <Niharika>	 Ah. :)
[00:14:50] <Sagan>	 quiddity: what do you mean with the chan will get moderated?
[00:14:57] <Dereckson>	 Niharika: thanks
[00:15:05] <bd808>	 also pigs are second only to unicorns in terms of awesome animals :)
[00:15:24] <ebernhardson>	 Amir1: probably not, but if you were testing locally you could probably check with wireshark or something to see what udp messages are sending, using something simple like `nc -l -u >/dev/null` or some such to receive the messages
[00:15:40] * ebernhardson realizes that needs a port too...bad example :P
[00:17:14] <mutante>	 did the pig get approval from the BoC of the WCA ?
[00:17:19] <quiddity>	 Sagan, I assume that means it will be made "/mode +m" ("-Only opped/voiced users may talk in channel.")
[00:17:41] <Sagan>	 quiddity: ok :)
[00:17:46] <Reedy>	 the bots shall inherit this channel
[00:21:02] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6158/" [puppet] - 10https://gerrit.wikimedia.org/r/348172 (https://phabricator.wikimedia.org/T162183) (owner: 10Dzahn)
[00:21:09] <wikibugs>	 (03PS2) 10Dzahn: tendril: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348172 (https://phabricator.wikimedia.org/T162183)
[00:24:39] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987460 keys, up 21 days 8 hours - replication_delay is 0
[00:26:48] <icinga-wm>	 RECOVERY - DPKG on naos is OK: All packages OK
[00:26:49] <wikibugs__>	 (03PS1) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563)
[00:26:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on naos is OK: OK: nf_conntrack is 0 % full
[00:26:58] <icinga-wm>	 RECOVERY - salt-minion processes on naos is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:26:59] <icinga-wm>	 RECOVERY - dhclient process on naos is OK: PROCS OK: 0 processes with command name dhclient
[00:27:08] <icinga-wm>	 RECOVERY - Disk space on naos is OK: DISK OK
[00:27:26] <Amir1>	 ebernhardson: Where can I find error reports of logstash in beta cluster?
[00:27:38] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on naos is OK: OK: synced at Fri 2017-04-14 00:27:29 UTC.
[00:27:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on naos is OK: OK ferm input default policy is set
[00:27:43] <Amir1>	 I want to cherry-pick https://gerrit.wikimedia.org/r/348184 and see if it fixes 
[00:27:48] <icinga-wm>	 RECOVERY - MD RAID on naos is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[00:28:15] <kaldari>	 RainbowSprinkles: We were going to wait, but Niharika wanted to try deploying a real feature as part of her deployment training
[00:28:24] <bd808>	 Amir1: deployment-logstash2
[00:28:34] <RainbowSprinkles>	 kaldari: Eh, ok I guess.
[00:29:11] <Amir1>	 bd808: okay, but in any particular directory?
[00:29:15] <kaldari>	 RainbowSprinkles: And since I already had the backport sitting there it seemed convenient :P
[00:29:23] <bd808>	  /var/log/logstash
[00:29:37] <Amir1>	 nice, in it
[00:29:48] <wikibugs__>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182142 (10Dzahn) 05Resolved>03Open re-opening since Icinga has many alerts:  https://icinga.wikimedia.org/cgi...
[00:30:32] <logmsgbot>	 !log niharika29@tin Finished scap: Reword ORES preferences (T162831), Put ORES r behind a preference (T162831), Deploy Special:Autoblocklist (T146414) (duration: 24m 44s)
[00:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:40] <stashbot>	 T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release  - https://phabricator.wikimedia.org/T162831
[00:30:40] <stashbot>	 T146414: Create Special:AutoblockList - https://phabricator.wikimedia.org/T146414
[00:31:10] <ebernhardson>	 Amir1: the errors are on the logstash machines themselves unfortunately, we didn't want to make some loop where logstash logs to itself
[00:31:45] <Niharika>	 Dereckson: Your patch is on mwdebug1002. 
[00:31:50] <Niharika>	 Anything to check?
[00:32:34] <ebernhardson>	 Amir1: so like logtash1002.eqiad.wmnet:/var/log/logstash/logstash-plain.log is where they would end up
[00:33:03] <Amir1>	 ebernhardson: even in beta cluster?
[00:33:24] <ebernhardson>	 Amir1: deployment-logstash2.eqiad.wmflabs i believe
[00:33:32] <wikibugs>	 06Operations, 10DBA, 10Icinga, 10Monitoring, 13Patch-For-Review: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182160 (10Dzahn) fixed.  false positives are gone, the real check stays and is OK  https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_stri...
[00:33:32] <Amir1>	 yeah
[00:33:38] <wikibugs>	 06Operations, 10DBA, 10Icinga, 10Monitoring, 13Patch-For-Review: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182163 (10Dzahn) 05Open>03Resolved
[00:33:38] <icinga-wm>	 RECOVERY - nutcracker port on naos is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[00:33:38] <icinga-wm>	 RECOVERY - nutcracker process on naos is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker
[00:33:54] <Amir1>	 I'm in that folder now but there is no ores in it
[00:34:01] <Niharika>	 Dereckson: Are you around?
[00:34:01] <Amir1>	 I'm trying to produce some :D
[00:34:08] <wikibugs__>	 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn)
[00:35:55] <ebernhardson>	 Amir1: the messages that look like this are (most likely) ores: [2017-04-14T00:31:04,915][WARN ][logstash.inputs.gelf     ] Gelfd failed to parse a message skipping {:exception=>#<Gelfd::UnknownHeaderError: Could not find parser for header: [123, 34]>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/1.9/gems/gelfd-0.2.0/lib/gelfd/parser.rb:14:in `parse'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-gelf-
[00:35:55] <Amir1>	 ebernhardson: it seems the beta cluster is trying to write over production: https://phabricator.wikimedia.org/T161563
[00:36:10] <Amir1>	 "host":"deployment-sca03"
[00:36:45] <ebernhardson>	 Amir1: i had to find those messages with tcpdump, logstash logging doesn't actually report what the invalid message ways (partially because it expects it to have been binary, but malformed i suppose)
[00:37:05] <Amir1>	 okay
[00:37:35] <Dereckson>	 Niharika: ping
[00:37:51] <Dereckson>	 logs look good to me
[00:37:54] <bd808>	 Amir1: uhh.. are you jsut sending to the wrong port? does uwsgi really know how to do GELF?
[00:37:59] <Niharika>	 Dereckson: Okay. Syncing. 
[00:38:14] <Amir1>	 bd808: it seems it's not compressed 
[00:38:34] <ebernhardson>	 bd808: it doesn't really do gelf, it's just sending json to udp 12201
[00:38:51] <Amir1>	 okay now it's time to cherry-pick it
[00:38:54] <ebernhardson>	 bd808: so it won't chunk correctly, but the main problem right now is its also not sending it compressed, and logstash requires the udp messages to be compressed
[00:39:25] <logmsgbot>	 !log niharika29@tin Synchronized wmf-config/abusefilter.php: Fix Abuse Filter configuration for tr.wikipedia (T161960) (duration: 00m 42s)
[00:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:33] <stashbot>	 T161960: Enable the blocking feature of AbuseFilter on trwiki - https://phabricator.wikimedia.org/T161960
[00:39:43] <kaldari>	 Looks like https://dbtree.wikimedia.org/ is down and also I get 401 Unauthorized error when I try to go to Kibana.
[00:39:49] <Niharika>	 All done! Woohoo. 
[00:40:06] <Dereckson>	 Thanks for the deploy Niharika 
[00:40:14] <ebernhardson>	 kaldari: hmm, fwiw kibana loaded for me. But that doesn't mean a whole lot
[00:40:14] <bd808>	 ebernhardson: yeah, but it should really be sending to port 11514
[00:40:24] <Niharika>	 Thanks to RoanKattouw. :)
[00:40:30] <bd808>	 which is line oriented json
[00:40:33] <kaldari>	 Niharika: \m/
[00:40:34] <wikibugs>	 (03PS2) 10Dzahn: Use new pageassessments dblist to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/348173 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari)
[00:40:37] <ebernhardson>	 bd808: oh, i didn't realize that
[00:41:10] <kaldari>	 ebernhardson: Are you using https://logstash.wikimedia.org/app/kibana ?
[00:41:21] <ebernhardson>	 kaldari: yes
[00:41:25] <bd808>	 Amir1: I'm pretty sure your port is the problem
[00:41:25] <kaldari>	 weird
[00:41:41] <bd808>	 striker sends to 11514
[00:42:15] <bd808>	 there are a gazillion different input ports on the logstash boxes for different input types
[00:42:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Use new pageassessments dblist to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/348173 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari)
[00:42:24] <Amir1>	 it's strange because the port is coming from hiera 
[00:42:25] <bd808>	 and 11514 is line oriented json
[00:42:41] <bd808>	 somebody put the wrong value in :)
[00:42:58] <bd808>	 hiera isn't magic, gut config
[00:43:05] <bd808>	 *just config
[00:43:26] <Amir1>	 I mean the weird part is someone didn't notice it before 
[00:43:48] <kaldari>	 ebernhardson: even weirder. It loads fine in Safari, but not Firefox. Guess I'll go file a bug.
[00:44:14] <bd808>	 kaldari: works in FF for me
[00:44:17] <kaldari>	 I swear it worked a couple weeks ago
[00:44:32] <bd808>	 I've got FF 52.0.2
[00:44:45] <bd808>	 from the ESR channel
[00:45:12] <ebernhardson>	 kaldari: hmm, probably worthwhile i suppose. That authorization denied would be coming from the apache instance, rather than kibana itself. http auth is very standardized so not sure what could be wrong
[00:46:05] <bd808>	 Amir1: it's using service::configuration::logstash_port_logback which is the GELF port that the nodejs app use
[00:46:52] <Amir1>	 shoot, we should fix it
[00:47:33] <wikibugs>	 (03CR) 10Dzahn: "this is a bit tricky. I _do_ agree that using regex to match an exact commandline is normally better to monitor a running service and we d" [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox)
[00:52:48] <bd808>	 Amir1: uhhh well... modules/service/manifests/configuration.pp says "$logstash_port_logback = 11514" so I don't know  how you are getting to port 12201
[00:52:50] <wikibugs>	 (03PS2) 10Dzahn: standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023
[00:53:21] <mutante>	 last kaldari
[00:53:25] <mutante>	 oops
[00:53:39] <mutante>	 i just wanted to say the pageassesment thing is merged
[00:53:40] <wikibugs>	 (03PS2) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563)
[00:53:51] <mutante>	 and look at auth issue
[00:56:20] <mutante>	 kaldari: kibana should really let you login, does it ask you for login? the wikitech credentials should do it since you are in "wmf" group
[00:57:01] <mutante>	 and dbtree loads for me.. maybe a little slow at the beginning
[00:57:05] <kaldari>	 mutante: It isn't even asking for credentials, it just immediately gives me an Unauthorized. Lemme try rebooting FF.
[00:57:42] <kaldari>	 mutante: When I try to go to https://dbtree.wikimedia.org/ it says "database connection to tendril on tendril-backend.eqiad.wmnetfailed"
[00:58:11] <mutante>	 weird, i dont see that error
[00:58:17] <mutante>	 and i should also be using eqiad
[00:58:34] <kaldari>	 mutante: I'm just not having any luck today :)
[00:59:08] <mutante>	 yea.. hmm.. works for me.. but that's not an error that sounds like local
[00:59:40] <kaldari>	 mutante: OK, rebooted FF and now Kibana is working fine for me
[00:59:56] <mutante>	 ok! that's something:)
[01:00:03] <kaldari>	 mutante: https://dbtree.wikimedia.org/ is still broken though
[01:00:03] <Zppix>	 dbtree is working for me kaldari 
[01:00:34] <Amir1>	 bd808: found the issue behind it. It was an old cherry-pick in beta cluster
[01:00:46] <Zppix>	 kaldari:  if possible goto cmd prompt and type ipconfig /flushdns and then ipconfig /renew
[01:01:06] <Zppix>	 i had the issue with enwiki before and it fixed it
[01:01:25] <bd808>	 dbtree is busted for me too. same "database connection to tendril on tendril-backend.eqiad.wmnetfailed" message
[01:01:45] <bd808>	 x-cache header says "cp2006 miss, cp4001 hit/2, cp4003 hit/3"
[01:01:52] <mutante>	 so dbtree uses misc-varnish, the director is "noc"
[01:01:59] <mutante>	 "noc" has 2 backends, terbium and wasat
[01:02:09] <kaldari>	 hah, I'm not crazy!
[01:02:15] <kaldari>	 :)
[01:02:35] <bd808>	 the varnish head I'm hitting is [2620:0:863:ed1a::3:d]:443
[01:03:08] <icinga-wm>	 RECOVERY - Check systemd state on naos is OK: OK - running: The system is fully operational
[01:04:09] <Zppix>	 my x-cache:cp1058 miss, cp1058 hit/4
[01:04:19] <mutante>	 tendril-backend.eqiad = db1011
[01:04:24] <mutante>	 there is no tendril-backend.codfw
[01:04:37] <mutante>	 so we are not talking to different backends.. uhmm
[01:04:53] <Zppix>	 my x-varnish:36791322, 6094755 6124425
[01:04:57] <mutante>	 we should make another ticket :p
[01:05:46] <mutante>	 db1011 appears to be running normal afaict
[01:05:58] <bd808>	 cp2* is codfw and then what is cp4*? sf?
[01:06:05] <Reedy>	 yeah
[01:06:06] <mutante>	 yes, 4 = ulsfo
[01:06:26] <bd808>	 so it looks like the sf varnish is the bad one
[01:06:34] <bd808>	 cp4003
[01:06:48] <Zppix>	 i take it  cp1* is eqiad?
[01:06:54] <Reedy>	 yes
[01:06:58] <Reedy>	 and 2 is esams
[01:07:03] <Reedy>	 uh, 3 is
[01:07:09] <mutante>	 but why would it be a varnish problem if it is "database connection to tendril" 
[01:07:14] <bd808>	 and cp5* will be somewhere in asia :)
[01:07:24] <mutante>	 heh, yea
[01:07:29] <mutante>	 we are waiting for the name :)
[01:07:50] <Zppix>	 we have a german one?
[01:08:05] <Reedy>	 Last I checked, Amsterdam is not in Germany
[01:08:11] <mutante>	 we know it will start with "sin"
[01:08:22] <bd808>	 Zppix: https://wikitech.wikimedia.org/wiki/Esams_cluster
[01:08:22] <Reedy>	 end, surely? ;)
[01:08:24] <mutante>	 eh, end
[01:08:40] <Zppix>	 im shocked not having a server in germany is like not having servers at all :D
[01:08:58] <paladox>	 Zppix why?
[01:09:08] <Amir1>	 ebernhardson bd808: After cherry-picking and fixing the port by removing old cherry-pick, now errors have stopped
[01:09:16] <bd808>	 Amir1: w00t!
[01:09:19] <Zppix>	 germany is usually where al lthe servers are hosted in the EU atleast that ive seen
[01:09:31] <Reedy>	 I think you're very wrong on that
[01:09:35] <bd808>	 yeah
[01:09:40] <Reedy>	 There's some big german based providers, sure, hetzner etc
[01:09:42] <Zppix>	 "that i've seen"
[01:10:30] <ebernhardson>	 hmm, the cabeling demoed on the esams page is a bit messy :P (but don't look behind my tv either...)
[01:10:37] <ebernhardson>	 cabling
[01:11:04] <Reedy>	 It's a few years ago :P
[01:11:07] <bd808>	 ebernhardson: blame m.ark ;)
[01:11:12] <paladox>	 Zppix actually they are all spread out
[01:11:15] <wikibugs>	 (03PS3) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563)
[01:12:36] <bd808>	 ebernhardson: codfw is prettier -- https://wikitech.wikimedia.org/wiki/Codfw_cluster#/media/File:Wikimedia_Foundation_Servers_2015-86.jpg
[01:18:52] <mutante>	 fwiw, comparing Amsterdam IX with Frankfurt CIX https://ams-ix.net/technical/statistics  vs https://www.de-cix.net/en/locations/germany/frankfurt/statistics   they are both about the same in terms of traffic, over 5Terabyte/s   
[01:18:58] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:21:32] <ebernhardson>	 ooh, codfw is prettier :)
[01:22:00] <mutante>	 definitely :) that is papaul's work
[01:26:54] <wikibugs__>	 (03CR) 10Ladsgroup: "I cherry-picked this in beta cluster. The errors have stopped but I couldn't see any logs coming from requests in logsatsh. Maybe still so" [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup)
[01:27:31] <Amir1>	 ebernhardson: ORES logs are indeed saved in logstash
[01:27:32] * Amir1 https://gerrit.wikimedia.org/r/#/c/348184/1
[01:27:48] <Amir1>	 Sorry, wrong link
[01:27:55] <Amir1>	 https://logstash.wikimedia.org/app/kibana#/dashboard/ORES
[01:28:29] <Amir1>	 It was only beta cluster
[01:32:38] <wikibugs>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182225 (10Dzahn)
[01:33:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[01:33:48] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time
[01:34:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time
[01:34:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.048 second response time
[01:34:53] <ebernhardson>	 Amir1: hmm, interesting. I'm sure i had found a bunch of error messages on the prod machines
[01:35:13] <wikibugs__>	 (03PS4) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563)
[01:37:00] <Amir1>	 ebernhardson: if you find them again. Can you inform me? maybe it's another issue
[01:37:15] <ebernhardson>	 Amir1: checking, i'm not seeing them in any of todays logs, so perhaps something changed since then
[01:38:35] <ebernhardson>	 Amir1: well, if it's not error now and the logs are in logstash, thats probably good enough
[01:39:08] <Amir1>	 great, I hope it helps in moving the upgrade forward 
[01:39:46] <wikibugs>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Paladox) it works for me.
[01:47:58] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[01:48:06] <wikibugs__>	 (03PS1) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822)
[01:49:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn)
[01:49:25] <wikibugs__>	 (03PS2) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822)
[01:50:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn)
[01:50:24] <wikibugs__>	 (03PS3) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822)
[01:52:13] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "i removed the change in modules/role/manifests/memcached.pp:8 . I could not explain the syntax error. but it's gone without that. strange " [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn)
[01:53:00] <wikibugs__>	 (03PS4) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822)
[01:55:52] <wikibugs>	 (03CR) 10Dzahn: [C: 032] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn)
[02:01:02] <wikibugs__>	 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182243 (10Dzahn) fixed. gone on 2001, exists on 1001, no more cruft in Icinga  https://icinga.w...
[02:01:15] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3182246 (10Dzahn)
[02:01:17] <wikibugs__>	 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182244 (10Dzahn) 05Open>03Resolved
[02:02:06] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2795939 (10Dzahn) once jenkins is running on both servers, don't forget to remove https://gerrit.wikimedia.org/r/#/c/348171/...
[02:12:23] <wikibugs__>	 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182269 (10Dzahn)
[02:12:25] <wikibugs>	 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#3182270 (10Dzahn)
[02:14:09] <wikibugs__>	 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#3182272 (10Dzahn)
[02:14:12] <wikibugs>	 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn)
[02:22:36] <wikibugs__>	 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#3182302 (10Dzahn) 05Resolved>03Open re-opening.  tendril in DNS is still an alias for einsteinium   ``` tendril.wikimedia.org is an alias for einsteinium.wikimedia.org. einsteinium....
[02:23:39] <wikibugs>	 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3182307 (10Dzahn)
[02:26:53] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2990878 keys, up 21 days 10 hours - replication_delay is 635
[02:45:53] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2986371 keys, up 21 days 10 hours - replication_delay is 0
[03:00:11] <wikibugs__>	 (03PS1) 10Dzahn: ci/labs/tendril: add some comments/FIXMEs about moving Hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/348194
[03:01:38] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "only comments" [puppet] - 10https://gerrit.wikimedia.org/r/348194 (owner: 10Dzahn)
[03:19:44] <wikibugs__>	 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3182333 (10faidon) Sure, that's OK.
[03:32:53] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:32:53] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:35:53] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:35:53] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient
[03:40:56] <wikibugs>	 (03PS1) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850)
[03:43:29] <wikibugs__>	 (03PS2) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850)
[03:44:45] <wikibugs>	 (03PS3) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850)
[03:47:58] <wikibugs>	 (03Abandoned) 10Dzahn: base: blacklist acpi_pad kernel module [puppet] - 10https://gerrit.wikimedia.org/r/348016 (owner: 10Dzahn)
[03:50:44] <wikibugs>	 (03PS4) 10Dzahn: base::kernel: mod blacklist for Dell R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850)
[03:56:23] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6161/  (praseo* and xeon are R320, the other 2 are random others as control)" [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn)
[04:09:23] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=461.40 Read Requests/Sec=579.20 Write Requests/Sec=26.70 KBytes Read/Sec=38439.20 KBytes_Written/Sec=158.40
[04:15:24] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.40 Read Requests/Sec=0.80 Write Requests/Sec=2.90 KBytes Read/Sec=3.20 KBytes_Written/Sec=70.00
[04:49:17] <wikibugs__>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182380 (10Peachey88)
[05:57:41] <wikibugs>	 (03PS1) 10Matthias Mullie: Full path to xvfb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348199
[06:31:29] <wikibugs__>	 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3182388 (10elukey) From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.Redis.Versions.html:  ``` Redis Versions Prior to 2.8.22  Redis bac...
[06:37:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[07:05:54] <icinga-wm>	 RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[07:10:21] <wikibugs__>	 (03PS1) 10Legoktm: Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611)
[07:23:53] <elukey>	 !log executed CONFIG SET appendfsync no on redis2005:6780 as performance test
[07:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:30] <elukey>	 I started this test yesterday but got stopped by the unrelated replica lagging
[07:25:21] <elukey>	 Redis latency doctor is suggesting that all the redis jobqueues are showing spikes in latency for AOF related activities, that with the current (Default) config involve fsync() every 1s 
[07:26:01] <elukey>	 what I am trying to test is if avoiding fsync could remove latency spikes registered
[07:26:15] <elukey>	 not suggesting to remove it everywhere of course, this is only to prove a point :)
[08:00:13] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182442 (10elukey) @hashar nice finding!  So let me recap the errors that we are seein...
[08:02:59] <wikibugs__>	 06Operations, 10Traffic, 10netops: Network equipment order for SIN - https://phabricator.wikimedia.org/T162984#3182444 (10ayounsi)
[08:03:53] <wikibugs__>	 06Operations, 10Traffic, 10netops, 10procurement: Network equipment order for SIN - https://phabricator.wikimedia.org/T162984#3182458 (10ayounsi)
[08:09:39] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182459 (10elukey) While reviewing the above data and graphite metrics I realized that...
[08:24:43] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:24:53] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:25:33] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[08:25:43] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[08:32:03] <elukey>	 !log restored appendfsync to 'everysec' on Redis rdb2005:6380 (end of performance experiment) 
[08:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:46] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182473 (10elukey) We removed the persistent connections in T129517#2113526  @aaron -...
[09:18:53] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[09:19:53] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2978037 keys, up 21 days 17 hours - replication_delay is 0
[09:23:18] <elukey>	 I know I know Redis you are not happy with the buffers
[09:25:13] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:26:03] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[09:43:34] <elukey>	 !log temporarily set sysctl -w net.ipv4.ip_local_port_range="15000 64000" on mw1306 (jobrunner) as test - (rollback: sysctl -w net.ipv4.ip_local_port_range="32768 60999") - T157968
[09:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:42] <stashbot>	 T157968: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968
[09:44:53] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1306 is CRITICAL: CRITICAL: nf_conntrack is 95 % full
[09:45:04] <hashar>	 elukey: contract is full :(
[09:45:29] <elukey>	 yeah makes sense
[09:45:52] <hashar>	 more different (src, src_port, dest, dest_port) to keep track of I guess
[09:46:08] <hashar>	 then since tcp reuse is around, I guess we can try lowering the local port range
[09:47:11] <elukey>	 so we have net.netfilter.nf_conntrack_max = 262144 maximum
[09:47:12] <elukey>	 mmm
[09:47:53] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1306 is OK: OK: nf_conntrack is 64 % full
[09:48:19] <elukey>	 elukey@mw1306:~$ sudo sysctl net.netfilter.nf_conntrack_max
[09:48:19] <elukey>	 net.netfilter.nf_conntrack_max = 262144
[09:48:19] <elukey>	 elukey@mw1306:~$ sudo sysctl net.netfilter.nf_conntrack_count
[09:48:20] <elukey>	 net.netfilter.nf_conntrack_count = 173125
[09:48:35] <elukey>	 so yeah that value would need to be bumped a bit as well
[09:50:26] <hashar>	 could we set the local port range per process?
[09:50:49] <elukey>	 !log temporarily set sysctl -w net.netfilter.nf_conntrack_max=524288 on mw1306 (jobrunner) as test - (rollback: sysctl -w net.netfilter.nf_conntrack_max=262144")
[09:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:47] <hashar>	 forget me
[09:51:51] <hashar>	 I am mumbling really
[09:52:22] <hashar>	 the thing thing is that if the connections are reused, I  am not sure why we would lack sockets
[09:52:56] <elukey>	 well not all of them are reused
[09:53:03] <elukey>	 only the ones that can be reused safely
[09:56:46] <elukey>	 so total conns in TIME_WAIT are now 102287
[09:59:54] <elukey>	 hashar: what should I use as filter string in kibana to isolate redis messages and host:mw1306
[09:59:57] <elukey>	 ?
[10:00:10] <hashar>	 yup
[10:00:11] <hashar>	 https://logstash.wikimedia.org/goto/0189564b88c008041dfd7eddb7c9e1a7
[10:00:17] <hashar>	 and did channels.raw: "redis"
[10:00:25] <hashar>	 was actually looking at that
[10:00:42] <elukey>	 thanks :)
[10:01:13] <hashar>	 the thing is we have multiple errors
[10:03:50] <elukey>	 TCP:   110017 (estab 88, closed 109889, orphaned 0, synrecv 0, timewait 109886/0), ports 0
[10:03:57] <elukey>	 this is a bit crazy to look on a client
[10:05:07] <hashar>	 then give the client reuses tcp connections
[10:05:22] <hashar>	 I am not sure how it would end up falling to connect due to lack of local fd
[10:05:33] <hashar>	 maybe there is a similar issue on the server side as well
[10:05:44] <hashar>	 I am lost really :(
[10:06:17] <elukey>	 I am not super expert in this particular case but I suspect that whatever is listed as TIME_WAIT is not recycled for some reason
[10:06:43] <elukey>	 otherwise if I got it correctly the socket should stay in TIME_WAIT only for a couple of second
[10:06:46] <elukey>	 *seconds
[10:07:14] <elukey>	 that option helps but it is probably not the silver bullet
[10:07:29] <elukey>	 the bigger issue in my opinion is that we lack proper connection pooling to Redis
[10:08:13] <hashar>	 supposedly that is twice a value of 60 seconds
[10:08:21] <hashar>	 or 2 minutes locked in TIME_WAIT stte
[10:08:38] <hashar>	 but I guess recycle actually reuse that
[10:09:49] <hashar>	 I am off for lunch. 
[10:14:13] <elukey>	 will leave these settings for a bit and then rollback
[10:14:20] <elukey>	 I don't see a massive change
[10:29:43] <elukey>	 !log rollback systctl settings on mw1306 after experiment (stop jobchron/runner, stop hhvm, restore systctl settings, restart hhvm and job* daemons)
[10:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:31] <elukey>	 didn't have to stop hhvm at all, only job* daemons
[10:34:16] <elukey>	 will keep an eye on mw1306
[10:35:21] <elukey>	 I really hoped for a better result :/
[10:42:27] <wikibugs__>	 06Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3182571 (10Yury_Bulka) Is it likely to have this implemented by September?
[10:49:27] <wikibugs__>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182578 (10elukey) So on every mw host we set `net.ipv4.tcp_tw_reuse=1`, this is proba...
[11:04:11] <wikibugs__>	 (03PS1) 10Elukey: Fix Zookeeper's alarm for heap usage [puppet] - 10https://gerrit.wikimedia.org/r/348206 (https://phabricator.wikimedia.org/T157968)
[11:13:54] <wikibugs>	 (03CR) 10Elukey: [C: 032] Fix Zookeeper's alarm for heap usage [puppet] - 10https://gerrit.wikimedia.org/r/348206 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey)
[11:16:32] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3182620 (10ayounsi) Here are the changes I suggest to push to get this going, a 1:1 copy of the setup in eqiad (except IPs). > ayounsi@mr1-codfw# commit check  > configuration check succeeds  Paste...
[11:23:31] <wikibugs>	 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3182655 (10ayounsi) I suggest we change this alert to only triggers on core/transit/peering/cust links and add a more global alert for all the other kind of links...
[11:25:45] <hashar>	 elukey: so widening the local port range just allowed for the connections to spread on a larger ranger
[11:25:56] <hashar>	 leading to X * more connections in TIME_WAIT
[11:27:05] <hashar>	 and a raise of conntrack
[11:28:47] <elukey>	 hashar: yep, but that's was expected
[11:29:22] <hashar>	 and with netstat -n -o  (which shows the time spent in a state)
[11:29:31] <hashar>	 the TIME_WAIT seems to idle for 60 secs
[11:29:59] <elukey>	 ah nice I didn't know that option
[11:30:25] <hashar>	 neither did I until I opened the man page :-}
[11:31:35] <elukey>	 this is kinda supporting my point that tw_reuse is not applicable to all the TIME_WAIT sockets
[11:31:47] <elukey>	 it helps but it is not the silver bullet
[11:34:22] <hashar>	 and netstat -n --statistics  has a bunch of tcp related stats
[12:34:46] <icinga-wm>	 PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:46:10] <wikibugs__>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182747 (10elukey) Multiple PEBKACs from my side:  1) I acked permanently the alarms in Icinga without realizing i...
[12:59:12] <wikibugs>	 (03PS1) 10Elukey: Fix thresholds for Zookeeper Heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/348214 (https://phabricator.wikimedia.org/T157968)
[13:00:47] <wikibugs__>	 (03CR) 10Elukey: [C: 032] Fix thresholds for Zookeeper Heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/348214 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey)
[13:03:46] <icinga-wm>	 RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[13:04:37] <icinga-wm>	 PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:09:31] <wikibugs>	 (03CR) 10MarkTraceur: [C: 031] Full path to xvfb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348199 (owner: 10Matthias Mullie)
[13:31:32] <Sotiale>	 (I'm not sure here is proper place ;) )Could you please help me rename works? Accroding to GRP, to rename account with more than 50,000 edits should not be performed without supervision of a sysadmin.
[13:32:54] <Reedy>	 Sounds about right
[13:32:58] <icinga-wm>	 RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[13:34:26] <Sotiale>	 Oh, it's good luck.
[13:35:18] <Sotiale>	 What should I do for it?
[13:40:01] <Reedy>	 Sotiale: I think the usual reason is to make sure people are around incase something goes wrong (or it causes big problems on one of the wikis)
[13:43:58] <wikibugs__>	 (03PS3) 10Jgreen: exim/fundraising: barium -> civi1001, donate mails to civicrm [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn)
[13:45:37] <wikibugs>	 (03CR) 10Jgreen: [C: 032] exim/fundraising: barium -> civi1001, donate mails to civicrm [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn)
[13:46:00] <wikibugs__>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182893 (10elukey) 05Open>03Resolved @Dzahn thanks a lot for the heads up, I should have fixed the issues. My...
[13:47:56] <Sotiale>	 @Reedy: I already notified  information about expected problems(ex. account login errors, etc..). As a result, He/She understood it and He/She said he/she  would not object to any problem related to the problems.
[13:48:13] <Reedy>	 Well, it's more if you cause database replication problems
[13:48:18] <Reedy>	 Which has been know to happen in the past
[13:48:31] <Reedy>	 I don't know how much better this is now as to how it was previously
[13:53:16] <p858snake|L>	 from memory, legoktm recommends filing a task in phabricator and will look at the request
[13:53:38] <p858snake|L>	 there are some that due to the number of edits can't be preformed safely
[14:22:27] <wikibugs__>	 (03Abandoned) 10Hashar: Revert "ldap: Add warning to ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar)
[14:23:00] <wikibugs>	 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#3182966 (10elukey)
[14:26:58] <wikibugs>	 (03Draft33) 10Hashar: (WIP) Crazy rspec for the role module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307425
[14:27:27] <wikibugs>	 (03Abandoned) 10Hashar: (WIP) Crazy rspec for the role module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307425 (owner: 10Hashar)
[14:32:10] <wikibugs__>	 (03Abandoned) 10Hashar: check_graphite anomaly option to set minimum upper band [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar)
[14:33:17] <wikibugs__>	 (03Abandoned) 10Hashar: mediawiki-firejail: lint python scripts [puppet] - 10https://gerrit.wikimedia.org/r/338978 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[14:33:20] <wikibugs>	 (03Abandoned) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[14:33:43] <wikibugs__>	 (03Abandoned) 10Hashar: mediawiki-firejail: quiet firejail [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar)
[14:47:38] <wikibugs__>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Marostegui) Works for me
[14:51:57] <wikibugs__>	 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10elukey)
[14:56:06] <wikibugs__>	 06Operations, 13Patch-For-Review, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3183050 (10hashar) I have found a straightforward case: MediaWiki invokes `convert --version` bu...
[14:59:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.56 seconds
[15:02:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
[15:47:43] <icinga-wm>	 PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:15:33] <icinga-wm>	 RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[16:54:37] <wikibugs>	 06Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3183349 (10Ragesoss) For a wikimedia.org link, very unlikely. Getting all of the requirements in place to move this to WMF production servers is...
[17:07:06] <twentyafterfour>	 !log deployed phabricator hotfix for T162943
[17:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:16] <Zppix>	 T162943
[17:09:30] <Zppix>	 bd808:  is task detection link thing broken?
[17:10:13] <Zppix>	 oh nevermind its a restricted task
[17:13:24] <twentyafterfour>	 doh! https://twitter.com/usrbingrump/status/852173471949914112
[17:15:08] <Zppix>	 twentyafterfour:  when in doubt revert to the commit called "Initalise Repo" that commit always seems to work for some reason :P
[17:37:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:37:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time
[17:41:17] <wikibugs>	 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3183419 (10Halfak)
[17:41:58] <wikibugs__>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3183439 (10Dzahn) @elukey thank you for fixing :) They all look green now. I'll comment if i see them again.
[17:42:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:44:03] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 2.678 second response time
[17:44:05] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10RobH)
[17:44:36] <wikibugs__>	 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH)
[17:45:06] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH)
[17:46:44] <wikibugs__>	 (03CR) 10Dzahn: [C: 04-1] "i think the best monitoring is probably if we add a second check, leave the current one as it is, but add a second one to specifically jus" [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox)
[17:48:06] <mutante>	 !log mw1297 - restarted hhvm and apache
[17:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:03] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:54:53] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:56:34] <wikibugs__>	 (03PS3) 10Paladox: Phabricator: Update nrpe command for checking if phd is running [puppet] - 10https://gerrit.wikimedia.org/r/348165
[17:57:28] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH)
[17:57:43] <wikibugs__>	 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH)
[18:07:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:07:53] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:08:33] <icinga-wm>	 PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:08:43] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[18:08:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:08:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:10:23] <icinga-wm>	 RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.007 second response time
[18:10:33] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:10:33] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[18:10:33] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[19:01:42] <wikibugs__>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3183666 (10jcrespo) 05Open>03stalled Most likely a one-time error that got cached for some time? Tendril db tends to fail quite regularly due to large queries asking for large reports (but that is mostly ok). W...
[19:22:29] <wikibugs>	 (03PS1) 10Hashar: swift: make rewrite_thumb_server optional [puppet] - 10https://gerrit.wikimedia.org/r/348236
[19:30:02] <wikibugs>	 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3141925 (10RobH)
[19:35:16] <Zppix>	 Dereckson:  are you around? Trwiki is having some abusefilter issues agaiin
[19:35:56] <Zppix>	 ^^ or if anyone else is around to assist with that i would be grateful
[19:36:09] <wikibugs>	 (03PS2) 10Hashar: swift: feature flag the proxy rewriting [puppet] - 10https://gerrit.wikimedia.org/r/348236
[19:39:53] <wikibugs>	 (03PS3) 10Hashar: swift: feature flag the proxy rewriting [puppet] - 10https://gerrit.wikimedia.org/r/348236
[19:40:13] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:42:45] <wikibugs__>	 (03PS1) 10Aklapper: Phabricator monthly email: Also include Differential user activity [puppet] - 10https://gerrit.wikimedia.org/r/348238
[19:44:48] <wikibugs>	 (03CR) 10Aklapper: "Probably needs additional permissions to allow the script to access the phabricator_differential DB (which the script did not access befor" [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper)
[19:53:13] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[19:53:47] <Zppix>	 jouncebot:  now
[19:53:47] <jouncebot>	 No deployments scheduled for the next 229 hour(s) and 6 minute(s)
[19:54:13] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2970804 keys, up 22 days 3 hours - replication_delay is 0
[19:54:30] <Zppix>	 i know theres no deployments today but is there a chance i could have something deployed its a mediawiki/config change... https://gerrit.wikimedia.org/r/#/c/347807/
[19:56:34] <Reedy>	 it's not a mediawiki config change
[19:56:38] <Reedy>	 it's a localisation change
[19:57:32] <Reedy>	 It also doesn't look like it's anything actually broken, it's an enhancement
[20:00:56] <Zppix>	 thats my fault i misread the repo
[20:01:05] <Zppix>	 thats what happens when you look at mutiple gerrit changes
[20:01:07] <Zppix>	 at once
[20:02:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:03:17] <hashar>	 Zppix: we usually dont deploy anything on friday
[20:03:39] <hashar>	 and next week is on deployment freeze  because the service is going to be switched from a datacenter to another
[20:03:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:03:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:04:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:05:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[20:05:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[20:07:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:07:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:08:13] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[20:09:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:09:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:10:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[20:10:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:11:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:11:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:11:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[20:12:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[20:14:32] <wikibugs__>	 (03PS1) 10Reedy: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960)
[20:14:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[20:14:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[20:15:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:16:27] <wikibugs__>	 (03CR) 10TerraCodes: [C: 031] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy)
[20:18:23] <wikibugs>	 (03CR) 10Reedy: [C: 032] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy)
[20:18:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[20:18:49] <wikibugs>	 (03CR) 10Luke081515: [C: 031] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy)
[20:18:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:18:55] <Sagan>	 uhm, to slow :o
[20:19:12] <Sagan>	 *to
[20:19:14] <Sagan>	 *too
[20:19:32] <wikibugs__>	 (03Merged) 10jenkins-bot: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy)
[20:19:42] <wikibugs__>	 (03CR) 10jenkins-bot: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy)
[20:19:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[20:19:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:19:53] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 1055468 msg (=800000 warning): ocg_render_job_queue 3044 msg (=3000 critical)
[20:20:13] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 1047741 msg (=800000 warning): ocg_render_job_queue 3063 msg (=3000 critical)
[20:21:00] <paladox>	 wonder why ^^ that keeps happening
[20:21:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:21:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[20:22:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[20:24:23] <hashar>	 paladox: OCG exploded again :(
[20:24:28] <paladox>	 oh
[20:24:39] <paladox>	 thanks for explaning :)
[20:25:24] <hashar>	 https://grafana.wikimedia.org/dashboard/db/ocg?orgId=1
[20:25:28] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#2921133 (10RobH)
[20:25:39] <wikibugs__>	 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142232 (10RobH)
[20:26:11] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/abusefilter.php: abusefilter-modify-restricted for trwiki T161960 (duration: 01m 38s)
[20:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:19] <stashbot>	 T161960: Enable the blocking feature of AbuseFilter on trwiki - https://phabricator.wikimedia.org/T161960
[20:27:33] <paladox>	 oh
[20:27:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:29:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:30:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[20:30:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:33:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[20:33:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:33:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:33:58] <wikibugs__>	 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3183833 (10bd808) ``` Accept-Ranges: bytes Age: 10 Content-Encoding: gzip Content-Length: 76 Content-Type: text/html; charset=UTF-8 Date: Fri, 14 Apr 2017 20:31:37 GMT Server: Apache Strict-Transport-Security: max-...
[20:34:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on db1084 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:34:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[20:34:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:35:05] <volans>	 What's up?
[20:35:16] <volans>	 checking
[20:35:25] <_joe_>	 hey
[20:35:28] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on db1084 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:35:32] <apergos>	 good evening
[20:35:34] <_joe_>	 I'm checking mobileapps now
[20:35:36] <apergos>	 and it's already back
[20:35:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[20:35:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[20:35:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:36:04] <_joe_>	 uhm
[20:36:41] <volans>	 db1084 is at 34 of loadavg with 16 cores and 3k connections
[20:37:21] <volans>	 let me check the other DB too
[20:37:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:37:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:39:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:40:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[20:41:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:41:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:44:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[20:44:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:45:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:46:18] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on db1081 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:46:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[20:47:08] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on db1081 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:47:14] <marostegui>	 what is going on with that one
[20:47:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[20:47:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:47:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:48:46] <volans>	 marostegui: see "query"
[20:49:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:50:52] <wikibugs__>	 (03PS1) 10Reedy: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241
[20:50:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:51:00] <wikibugs>	 (03PS2) 10Reedy: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241
[20:51:02] <wikibugs__>	 (03PS1) 10Volans: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348242
[20:51:04] <wikibugs>	 (03CR) 10Reedy: [C: 032] Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy)
[20:51:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[20:51:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[20:51:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:52:15] <wikibugs__>	 (03Merged) 10jenkins-bot: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy)
[20:52:29] <wikibugs__>	 (03CR) 10jenkins-bot: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy)
[20:52:37] <wikibugs>	 (03Abandoned) 10Volans: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348242 (owner: 10Volans)
[20:53:38] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Disable Linter on larger wikis T148609 (duration: 00m 41s)
[20:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:46] <stashbot>	 T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609
[20:53:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:53:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[20:54:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[20:54:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[20:57:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
[20:59:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[21:15:48] <wikibugs>	 06Operations, 10Gerrit, 07LDAP: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#1676037 (10bd808) We may be able to do something about this trivially after completing {T161859}.  As @hashar points out in the summary, today we...
[21:24:59] <ShakespeareFan00>	 Okay
[21:25:02] <ShakespeareFan00>	 Why was Linter pulled?
[21:25:36] <volans>	 ShakespeareFan00: https://phabricator.wikimedia.org/T148609#3183893
[21:26:59] <ShakespeareFan00>	 So it broke the wiki
[21:27:06] <ShakespeareFan00>	 Why doesn't this suprise me?
[21:27:08] <ShakespeareFan00>	 XD
[21:27:30] <volans>	 it was creating issues, it surely need more investigation to understand what triggered it
[21:38:13] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:38:43] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:40:04] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.726 second response time
[21:43:34] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.707 second response time
[21:45:43] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:46:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:46:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[21:49:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:50:03] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.791 second response time
[21:51:33] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[21:52:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:54:41] <paladox>	 Wonder why they are going now
[21:54:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time
[21:55:02] <robh>	 so far its just those 3 and they flap back
[21:55:04] <robh>	 but its odd
[21:56:20] <paladox>	 oh yep
[21:58:43] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:00:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:02:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[22:04:33] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.756 second response time
[22:05:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:10:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:10:13] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:12:35] <paladox>	 robh ^^ more it seems
[22:12:41] <paladox>	 different mw number
[22:12:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[22:13:23] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database pawikisource: database exists on query. Default database: pawikisource. [Query snipped]
[22:13:58] <Reedy>	 ...
[22:15:22] <Reedy>	 Wheee, more bugs
[22:16:13] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.735 second response time
[22:16:28] <Reedy>	 !log created linter tables on pawikisource T148609
[22:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:36] <stashbot>	 T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609
[22:17:26] <Reedy>	 !log created linter tables on wbwikimedia T148609
[22:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:00] <paladox>	 Reedy any more bugs?
[22:23:13] <Reedy>	 Parsoid is causing a load of spam in the logs
[22:23:19] <paladox>	 oh.
[22:23:27] <paladox>	 Is that because of linter?
[22:30:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[22:34:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:37:03] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.005 second response time
[22:39:02] <Reedy>	 ah, it's video scalers
[22:39:13] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:39:51] <wikibugs__>	 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3184042 (10RobH)
[22:40:03] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142249 (10RobH)
[22:45:16] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029754 (10RobH)
[22:45:27] <wikibugs__>	 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3184051 (10RobH)
[22:48:03] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.014 second response time
[22:49:01] <volans>	 !log restarting parsoid to get the disable linter change T148609
[22:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:08] <stashbot>	 T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609
[22:58:28] <jynus>	 !log skipping CREATE DATABASE pawikisource on dbstore2001- duplicate declaration due to multi-source
[22:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:23] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[23:12:23] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database wbwikimedia: database exists on query. Default database: wbwikimedia. [Query snipped]
[23:12:59] <volans>	 jynus: ^^^
[23:14:06] <jynus>	 ha ha
[23:14:39] <jynus>	 !log skipping CREATE DATABASE wbwikimedia on dbstore2001- duplicate declaration due to multi-source
[23:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:23] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[23:19:56] <wikibugs>	 (03CR) 10Jcrespo: "It may have, but the previous was was scheduled long time ago?/puppet wasn't run properly? I didn't see alerts this time. However, we got " [puppet] - 10https://gerrit.wikimedia.org/r/347996 (owner: 10Jcrespo)
[23:28:39] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3184078 (10jcrespo)
[23:28:42] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3184077 (10jcrespo)
[23:45:30] <wikibugs>	 06Operations, 10DBA, 10Traffic: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3184091 (10jcrespo) 05stalled>03Open I assume that is a hit of an error message?   Traffic: What is tendril.wikimedia.org's caching policy so that this can happen? I would expect a smaller TTL than...
[23:50:24] <wikibugs>	 (03CR) 10Krinkle: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad)