[00:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150116T0000). Please do the needful. [00:01:25] Just an FYI for swatters, logstash has been a bit wonky today so you may want to monitor logs on fluorine for errors [00:02:12] (03PS3) 10Dzahn: etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:03:06] (03PS4) 10Dzahn: etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:03:59] (03CR) 10Dzahn: [C: 032] etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:04:09] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:04:58] greg-g, bd808, if you are done deploying, i would like to push out zeroportal [00:05:26] yurikR: It's time for swat but I don't know if there are patches or a deployer for it [00:05:45] no patches in wikitech [00:06:56] bd808, oh, it was showing as friday [00:07:10] RoanKattouw, ^demon|away, are you deploying swat? [00:07:23] i could add my patch to it [00:07:24] It is Friday :) get on UTC time man :) [00:07:30] ))) [00:08:14] <^demon|away> I don't even have a terminal open :p [00:08:14] i follow my own time... internet time [00:08:15] hrmm [00:08:25] 15:37 -tomaw(tom@freenode/staff/tomaw)- [Global Notice] Hi all. Yes, it seems we erred with a firewall rule there. Everything should be back to normal now. [00:08:44] ok, seems like i could just do my own depl instead of swat [00:09:02] or RoanKattouw wants to deploy? :D [00:11:26] I think your it yurikR. Note my previous warning that you shouldn't necessarily trust logstash at the moment [00:11:41] bd808, what's the best way to track health? [00:11:44] atm [00:12:38] yurikR: Probably /usr/local/bin/fatalmonitor on fluorine [00:13:29] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:13:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:13:57] ^ trying to fix that, something messed up permissions [00:14:08] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [00:15:58] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:18:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:22:50] !log on both puppetmasters: chown gitpuppet /var/lib/git/operations/puppet/.git/logs/refs/heads/production & .git/logs/HEAD & .git/logs/refs/remotes/origin to fix puppet-merge. git pulled on strontium [00:22:55] hmm, this is weird - is there a reason why git pull on tin shows a new gerrit fingerprint? [00:23:02] bd808, ^ [00:23:29] yurikR: no idea [00:24:19] 5e:14:27:23:d2:20:69:cb:38:09:7d:5f:87:1d:16:2c ? [00:25:04] !log log bot , are you here? [00:25:34] mutante: nope. it didn't come back from the netsplit [00:25:51] hrmm.. ok. freenode messed up with a firewall rule, heh [00:25:58] brb [00:26:43] !log yurik Synchronized php-1.25wmf15/extensions/ZeroPortal: zero portal to master (duration: 00m 13s) [00:26:57] 1 apache public key error: [00:27:17] !log yurik Synchronized php-1.25wmf15/extensions/ZeroPortal: zero portal to master - retry (duration: 00m 06s) [00:27:29] ok, fixed [00:36:58] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:45:48] PROBLEM - puppet last run on es2005 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:38] !log restarted morebots [00:46:46] Logged the message, Master [00:46:57] !log on both puppetmasters: chown gitpuppet /var/lib/git/operations/puppet/.git/logs/refs/heads/production & .git/logs/HEAD & .git/logs/refs/remotes/origin to fix puppet-merge. git pulled on strontium [00:47:01] Logged the message, Master [00:50:09] ACKNOWLEDGEMENT - Apache HTTP on mw1062 is CRITICAL: Connection refused daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - DPKG on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - Disk space on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - HHVM processes on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - HHVM rendering on mw1062 is CRITICAL: Connection refused daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - NTP on mw1062 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - RAID on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:10] ACKNOWLEDGEMENT - configured eth on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - dhclient process on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - nutcracker port on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - nutcracker process on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:12] ACKNOWLEDGEMENT - salt-minion processes on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:51:57] ori: HHVM monitoring broke recently, somehow [00:52:01] or is it new [00:52:30] further down on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader [00:52:59] ah.. hmm: Got status 502 from the graphite server at [00:53:08] RECOVERY - puppet last run on es2005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:53:40] but a link like this seems ok http://graphite.wikimedia.org/render?format=json&from=-10min&target=servers.mw1186.hhvmHealthCollector.queued.value [00:58:59] (03CR) 10BryanDavis: "Log volume in logstash went up from 68,723,096 events on 2015-01-13 to 123,314,490 events on 2015-01-14 when group1 was switched to loggin" [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [01:02:08] morebots, there? [01:02:08] I am a logbot running on tools-exec-03. [01:02:08] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [01:02:08] To log a message, type !log . [01:02:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [01:02:56] andrewbogott: i restarted it, it was alive but on the other side of the netsplit [01:03:11] ok [01:13:37] (03PS1) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:14:23] (03PS2) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:15:16] (03PS3) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:16:46] (03CR) 10Dzahn: [C: 032] wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) (owner: 10Dzahn) [01:19:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:37:48] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 1 failures [01:55:48] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:00:28] (03PS1) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 [02:03:40] !log ori Synchronized php-1.25wmf15/extensions/EventLogging: (no message) (duration: 00m 06s) [02:03:46] !log ori Synchronized php-1.25wmf14/extensions/EventLogging: (no message) (duration: 00m 05s) [02:03:52] Logged the message, Master [02:03:56] Logged the message, Master [02:06:08] (03PS2) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 [02:06:53] !log EventLogging syncs were of I335ad42bb: JsonSchemaContent: Fix html rendering of objects and arrays [02:06:57] Logged the message, Master [02:19:00] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:19:05] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-16 02:19:04+00:00 [02:19:10] Logged the message, Master [02:19:14] Logged the message, Master [02:31:33] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:31:37] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-16 02:31:37+00:00 [02:31:40] Logged the message, Master [02:31:45] Logged the message, Master [02:43:58] PROBLEM - Mediawiki Apple Dictionary Bridge on terbium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia_Foundation not found on https://search.wikimedia.org:443https://search.wikimedia.org/?lang=ensite=wikipediasearch=Wikimedia_Foundationlimit=1 - 3389 bytes in 0.094 second response time [02:46:55] https://search.wikimedia.org/?lang=en&site=wikipedia&search=Wikimedia_Foundation&limit=1 is spitting out PHP for me... [02:47:16] http://fpaste.org/170327/21376442/raw/ [02:48:12] * legoktm has to run [02:48:41] i guess Apple caches the results, because i'm still getting them from osx dictonary [02:49:12] i know nothing about that url :( [02:58:29] if i drop &limit=1 i get a more reasonable looking output [03:01:53] !log xtrabackup clone db1020 to db1046 [03:02:03] Logged the message, Master [03:09:26] !log ori Synchronized php-1.25wmf14/includes/content/JsonContent.php: I2f4f9cb343: Let subclasses specify content model in JsonContent (duration: 00m 06s) [03:09:34] Logged the message, Master [03:17:40] (03PS1) 10Springle: db1020 is primary [puppet] - 10https://gerrit.wikimedia.org/r/185384 [03:20:06] (03CR) 10Springle: [C: 032] db1020 is primary [puppet] - 10https://gerrit.wikimedia.org/r/185384 (owner: 10Springle) [03:34:11] (03Abandoned) 10OliverKeyes: Change the URLs used by Pybal to simplify tracking for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [03:36:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [03:39:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [03:40:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 17.18 ms [03:46:26] Host google? [03:47:46] presumably a sanity check for internet connectivity from the monitoring server [03:47:49] PROBLEM - haproxy process on dbproxy1002 is CRITICAL: PROCS CRITICAL: 2 processes with command name haproxy [03:47:58] google safe browsing check? [03:48:38] (modules/icinga/manifests/gsbmonitoring.pp) [03:48:44] okay. . . [03:49:03] ACKNOWLEDGEMENT - haproxy process on dbproxy1002 is CRITICAL: PROCS CRITICAL: 2 processes with command name haproxy Sean Pringle me [03:49:16] huh yeah safe browsing check, i hadn't looked at this before [03:49:53] but host down would indicate failure to reach their service to query it [03:49:58] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:14:13] (03Abandoned) 10TTO: Allow import from any WMF project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://bugzilla.wikimedia.org/15583) (owner: 10TTO) [04:22:39] !log on mw1228 doing some tests to figure out why incorrect Expires header is being sent on requests for /images/* [04:22:46] Logged the message, Master [04:27:48] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:29:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 20.26 ms [04:34:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:36:28] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 15.98 ms [04:37:09] fwiw the google safe browsing checks have existed since at least 2011. icinga says downtime is 2m10s today, no other problems in the past week. it does say down rather than unreachable. i haven't found a google page reporting on the status of the service. [04:40:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 16 04:40:10 UTC 2015 (duration 40m 9s) [04:40:19] Logged the message, Master [04:45:28] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [04:54:09] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:54:59] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.00 ms [05:00:59] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:01:32] i am still able to perform the test manually from neon even through it's reported to be in the down state [05:01:39] curl "www.google.com/safebrowsing/diagnostic?site=mediawiki.org/" | grep -i --color "not currently listed as suspicious" [05:01:58] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 15.97 ms [05:02:07] hmph [05:04:48] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:18:28] 3operations: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#981163 (10tstarling) 3NEW [05:29:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:30:01] i wonder why the google safe browsing checks hardcode an IP (74.125.225.84) for the host check instead of using www.google.com, which is what the actual service checks use. [05:31:58] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 17.59 ms [05:32:26] jgage: does hashar's name appear in the git-log? [05:33:22] he loves optimizing away hostname resolution by replacing stable and readable names with IP addresses [05:34:39] i didn't trace it back further than this commit by peter in 2011, whcih features the IP: https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/e0eac18323f8241b47e8005851962959ca4db969 [05:37:26] mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com. [05:54:27] !log mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [05:54:33] Logged the message, Master [05:55:06] Reedy: can you /cs access #wikipedia-userscripts add Technical_13 helper [05:55:53] Almost no-one is ever there. Thanks. [06:07:52] thanks, i often forget about SAL. i wonder who reads it. [06:08:09] (03Abandoned) 10KartikMistry: WIP: Use SSL in cxserver config [puppet] - 10https://gerrit.wikimedia.org/r/185157 (owner: 10KartikMistry) [06:09:11] well, i do [06:15:48] !log Icinga test of Mediawiki Apple Dictionary Bridge as https://search.wikimedia.org/?lang=en&site=wikipedia&search=Wikimedia_Foundation&limit=1 returns an error since shortly after l10n update at 02:31 UTC, though URL works without &limit=1 and end user osx dictionary lookups are still working. [06:15:54] Logged the message, Master [06:16:29] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [06:17:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 17.25 ms [06:29:09] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:29] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:08] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:51] (03CR) 10Gage: [C: 032] Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [06:37:58] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [06:40:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 17.11 ms [06:44:11] <_joe_> uhm [06:45:49] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:09:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:13:10] 3operations: DRIVE YOUR CAR AND GET PAID ADVERTISING FOR MONSTER ENERGY DRINK.($400 Weekly) - https://phabricator.wikimedia.org/T86999#981250 (10emailbot) [07:13:48] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:14:49] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 19.80 ms [07:27:58] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:28:49] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 19.96 ms [07:30:36] 3operations, Spam-Spam: DRIVE YOUR CAR AND GET PAID ADVERTISING FOR MONSTER ENERGY DRINK.($400 Weekly) - https://phabricator.wikimedia.org/T86999#981254 (10yuvipanda) [07:32:00] <_joe_> lol [07:32:13] I don't drive :( [07:32:18] <_joe_> what's with this packet loss? [07:32:29] <_joe_> ori: as in you don't have a driver license? [07:32:41] mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [07:32:44] Yeah. [07:33:26] <_joe_> good for you, I guess this limits the places you can live in the US [07:33:31] I could pose with the cans on the roof of the car if someone else does the driving. [07:33:44] But safety would be a concern. [07:34:39] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:36:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 16.48 ms [07:37:00] re: limits the places you can live in the US -- yeah. It doesn't even work very well in SF. It wasn't an issue in New York, though. [07:37:19] <_joe_> yeah new york is _definitely_ a walking city [07:39:06] <_joe_> but well, I have to say I'm managing to take the car once a month in Rome, so it's doable basically everywhere with a public transportation system (and the one here is really bad) [07:40:47] it's doable here, but it means forfeiting on the few truly nice things about the bay area, which is the nearness of some really breathtaking natural beauty [07:41:00] *on one of the [07:41:27] <_joe_> well, back to packaging! https://gerrit.wikimedia.org/r/#/c/185187/ [07:41:40] <_joe_> I also have a nutcracker package for precise [07:42:11] oh wow, pcre cache and the leak fix [07:42:17] nice [07:42:17] <_joe_> yes [07:42:33] <_joe_> we'll take it to production when I'm there next week :) [07:44:12] nice work [07:45:08] * YuviPanda also doesn’t drive for safety concerns [07:45:23] * YuviPanda waves [07:45:36] hi Yuvi [07:45:41] hi ori [07:45:56] I’ll see you again in a few days! Sqeee! :) [07:46:04] <_joe_> YuviPanda: :)) [07:46:35] <_joe_> I have to say I hate travelling for work, but at least this time I have the incentive of meeting with quite a lot of you guys in person [07:46:56] yup. [07:47:20] <_joe_> next year I will probably be complaining for weeks :P [07:47:39] ‘eugh, have to see *those* guys again. Hope I do not end up punching anyone’? :) [07:48:38] <_joe_> nah, I am peaceful :) [07:49:49] <_joe_> no, I hate work travel because it means being away from my family, working in hotel rooms, not having the opportunity to truly visit the place I'm in, and I usually come back really tired [07:49:58] <_joe_> Athens ops meeting was a nice exception [07:50:26] _joe_: exactly [07:50:28] 09:42 < _joe_> we'll take it to production when I'm there next week :) [07:50:31] lol [07:50:41] you think you'll work the next two weeks? [07:51:01] <_joe_> paravoid: if I sadly know myself, I'll be up around 3 AM every day [07:51:15] * _joe_ needs sleeping pills [07:52:07] _joe_: you could join me and ori in getting 2AM subway sandwiches! [07:52:10] if ori still does those [07:52:38] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:52:44] <_joe_> I'm more the "man vs food" type [07:52:44] heh [07:53:02] what's with google [07:53:08] <_joe_> I see no package loss from neon to google [07:53:11] <_joe_> btw [07:53:29] <_joe_> mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [07:53:33] <_joe_> but I don't see that [07:53:51] me neither [07:54:12] * YuviPanda stays off betalabs today, goes to add proper nodejs support on toollabs [07:54:29] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 18.66 ms [07:56:07] hardcoded IP address in puppet... [07:56:33] <_joe_> paravoid: srsly? [07:56:43] <_joe_> sigh [07:57:35] oh so TimStarling's PCRE cache work landed in HHVM I see? [07:57:36] nice! [07:58:03] (03PS1) 10Springle: repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 [07:59:08] (03CR) 10Springle: [C: 032] repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 (owner: 10Springle) [07:59:12] (03Merged) 10jenkins-bot: repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 (owner: 10Springle) [08:00:59] !log springle Synchronized wmf-config/db-eqiad.php: repool db1051 db1056, warm up (duration: 00m 10s) [08:01:05] Logged the message, Master [08:01:59] RECOVERY - Mediawiki Apple Dictionary Bridge on terbium is OK: HTTP OK: HTTP/1.1 200 OK - 748 bytes in 0.156 second response time [08:02:14] <_joe_> ok, who did what? [08:02:15] <_joe_> :) [08:07:08] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:07:29] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 19.04 ms [08:08:07] springle: there's a dbproxy1002 check_failover alert [08:21:30] (03PS1) 10Giuseppe Lavagetto: mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 [08:21:35] <_joe_> paravoid: ^^ [08:22:42] (03PS2) 10Giuseppe Lavagetto: mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 [08:24:47] (03CR) 10Giuseppe Lavagetto: "tested on testwiki, it does in fact use HHVM correctly." [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:25:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:25:38] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.002428770065 secs [08:26:45] mixing tabs and spaces [08:26:46] (03CR) 10Ori.livneh: "Already merged, but +1 anyway -- this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:27:04] paravoid: I saw but chose not to complain :P [08:27:17] <_joe_> paravoid: I already know, I'll fix that [08:27:26] <_joe_> I just wanted to fix it in prod ASAP [08:27:40] <_joe_> all those poor iphone users... [08:27:45] so [08:27:53] can we encode the server in a header? [08:27:59] <_joe_> yes [08:28:18] we currently send "Server: Apache" and X-Powered-By [08:28:36] X-Cache has been very useful in debugging varnish issues [08:28:59] perhaps we want... Server: mw1052 (Apache, HHVM) or something? [08:29:31] or even more arbitrary tags perhaps, e.g. trusty, jessie etc. [08:29:38] but hostname for sure [08:29:52] <_joe_> I think the hostname is enough [08:29:53] (03PS1) 10Giuseppe Lavagetto: mediawiki: retab of the virtualhost for search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/185397 [08:30:22] well, you have more in X-P-By [08:30:26] and there's no point in two headers [08:30:55] <_joe_> X-P-By is set by HHVM in most cases, we can always tamper with it of course [08:32:12] <_joe_> the Server: header is set by apache internally [08:32:26] <_joe_> I think we can just change ServerTokens for that [08:32:39] <_joe_> meaning we can turn it off [08:33:00] there's always varnish :) [08:33:20] <_joe_> well, varnish doesn't know the hostname [08:33:38] no, but can replace Server with the value of another header [08:33:41] <_joe_> or are you suggesting to mangle headers so that we suppress one and unify it? [08:34:08] <_joe_> yeah, I'd prefer not to if possible. I'm taking a look [08:34:12] well let's first agree what we want to do [08:34:15] if anything [08:35:33] <_joe_> I'd say having one header that tells you a) it is apache b) the hostname serving the request and c) if it was served by HHVM, which version; and if it was a static content, state it [08:35:39] <_joe_> so something like [08:35:57] <_joe_> Server: Apache (mw1053) - HHVM/3.3.1 [08:36:15] <_joe_> or Server: Apache (mw1053) - static [08:36:43] <_joe_> and we can remove the X-Powered-By header too [08:38:48] <_joe_> interestingly enough, it seems the Server: header can't be unset in apache 2.4 [08:38:55] <_joe_> but we can probably overwrite it [08:39:59] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:40:19] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 17.02 ms [08:41:35] (03CR) 10Faidon Liambotis: [C: 04-2] "Why isn't ferm::rule enough? I'd rather not." [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) (owner: 10Dzahn) [08:47:40] YuviPanda: there? [08:48:38] kart_: ‘sup [08:49:31] wtf firefox. can’t resolve gerrit? [08:49:35] chrome can [08:49:37] * YuviPanda mumbles [08:50:39] YuviPanda: you merged role for cxserver beta/production [08:50:43] but, https://gerrit.wikimedia.org/r/#/c/180125/6/manifests/role/cxserver.pp [08:50:44] yup [08:50:54] We sometime need different config :) [08:51:05] specially: see above ps [08:51:15] kart_: hiera! [08:51:27] YuviPanda: ouch [08:51:29] :) [08:51:55] kart_: so you either let the base class (in this case ::cxserver) have no params or defaults to prod, and override for labs [08:51:59] kart_: see the overrides for labs at http://wikitech.wikimedia.org/wiki/Hiera:deployment-prep [08:52:02] well, betalabs [08:52:19] more hiera documentation at http://wikitech.wikimedia.org/wiki/Hiera [08:52:44] Nods. Thanks. [08:52:53] _joe_: also, from ^ I don’t actually see anything that looks up based on $::realm [08:53:05] should / could we add one, for things that are ‘if you are a labs instance, do this' [08:53:19] <_joe_> YuviPanda: YuviPanda for a simple reason - labs has its own hiera config [08:53:19] that or we could already do this and I’m missing it [08:53:30] <_joe_> so it's naturally separated [08:53:49] _joe_: you mean the wikitech ones? [08:53:53] or is it there somewhere else too? [08:54:12] <_joe_> YuviPanda: I need to document this, you're right [08:54:15] (03CR) 10Alexandros Kosiaris: "You mean gerrit is not configurable enough to make it listen on both ports ?" [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) (owner: 10Dzahn) [08:54:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:54:22] _joe_: yup. [08:54:35] ctrl-f realm gives me [08:54:35] hieradata/eqiad/admin.yaml ($::realm) [08:54:37] <_joe_> YuviPanda: you can have hieradata/labs.yaml [08:54:39] <_joe_> for example [08:54:40] but I’m pretty sure that’s $::site [08:54:46] _joe_: aha! that’s what I was looking for :) [08:54:59] _joe_: oh, it already exists [08:55:02] * YuviPanda looks sheepish now [08:55:09] <_joe_> YuviPanda: modules/puppetmaster/files/labs.hiera.yaml [08:55:18] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 18.19 ms [08:55:39] right. [08:56:23] _joe_: makes sense now. thanks. [08:57:22] I should probably migrate at least *some* of the Hiera:deployment-prep overrides to a yaml file in ops/puppet [08:58:15] YuviPanda: that would be nice [08:58:30] (03PS1) 10KartikMistry: Moved comment at right place! [puppet] - 10https://gerrit.wikimedia.org/r/185400 [08:58:35] I am not opposed to the wiki page, but for beta it is probably better to have change go through the Gerrit review :D [08:58:45] hashar: yaeh, I agree [08:59:10] though I have no idea whether our hiera file hierarchy would supports inheritance such as labs -> deployment-prep [08:59:27] (03PS2) 10Yuvipanda: cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [08:59:33] hashar: it does, I think. [08:59:40] mwyaml overrides nuyaml, I think [08:59:45] so you can still make changes to it on wikitech [09:00:02] (03CR) 10Hashar: [C: 031] cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [09:00:12] <_joe_> YuviPanda, hashar I strongly object to this [09:00:19] <_joe_> why gerrit? [09:00:26] so we can review? [09:00:27] <_joe_> wiki has revision history [09:00:35] and attach the change to a bug [09:00:35] well, primarily so when someone changes something and git greps, they find this as well [09:00:45] <_joe_> so that you need to wait me to give +2 to you? [09:00:47] if you change the name of a param, for example. [09:01:00] <_joe_> srsly? [09:01:00] well on beta we can cherry pick the patch on the local puppetmaster [09:01:03] _joe_: no, they can still override it via wikitech (if I understood the hierarchy correctly) [09:01:12] but yeah, that is some more patches that will have to be +2ed by ops eventually [09:01:13] <_joe_> hashar: how is that better that editing the wiki page? [09:01:19] <_joe_> you're still skipping review then [09:01:20] <_joe_> :P [09:01:27] <_joe_> I really don't get it guys [09:01:29] on the wiki one can get make a change without any peer review [09:01:53] <_joe_> even on the puppetmaster by cherry-picking changes that have -1 or -2 on gerrit [09:01:53] where as we can propose the patch in Gerrit, wait for review and once reviewed deploy/cherry pick [09:01:57] <_joe_> or editing files by hand [09:02:17] <_joe_> I think file-based hiera in labs should just be for very general things [09:02:50] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [09:02:52] <_joe_> but well, your poisoned well :) Just don't come to complain when puppet breaks there. [09:02:54] _joe_: well, re: cherrypicking them, sometimes they get -2d with ‘do not do this’ and no alternative is offered. [09:02:56] https://phabricator.wikimedia.org/T78076 for example [09:03:29] (03PS1) 10KartikMistry: admin: Add dotfiles for kartik [puppet] - 10https://gerrit.wikimedia.org/r/185401 [09:03:42] _joe_: I’m also killing all cherry-picks that aren’t there just for testing. there were 3, now there’s 1 (the one I just linked to) [09:03:46] that I’m not fully sure how to tackle. [09:04:04] I really hate when bugs make it to production :/ [09:04:23] <_joe_> well, reimage from scratch the appservers in beta :) [09:04:45] _joe_: and sometime we get a rough patch deployed which is merely to unblock us, then we iterate with ops until the patch is production grade [09:05:16] that was the case to get hhvm auto update on the CI slaves. I originally used ensure => latest, got that deployed which got me hhvm. Then iterated with ops to get the patch to use Debian unattended upgrade instead [09:05:20] but at least, I got hhvm installed :] [09:05:37] man, I was going to stay out of this today and work on toollabs instead. Totally failing on that now. [09:05:41] <_joe_> hashar: so how is having it in gerrit (the hiera data) any different than having them in wikitech [09:05:47] <_joe_> if you don't care about the review? [09:06:01] <_joe_> YuviPanda: go back to toollabs [09:06:02] <_joe_> :) [09:06:10] _joe_: I think it’s not ‘they do not care about review (always)’. They care about it less than ops does, but it’s not binary. [09:06:16] <_joe_> hashar: my point is that hiera data don't matter that much in a review [09:06:26] _joe_: I primarily want it for git-grepping [09:06:26] <_joe_> in general [09:06:35] (03PS1) 10Faidon Liambotis: Remove decom'ed server "sanger" from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185402 [09:06:37] (03PS1) 10Faidon Liambotis: ldap: cleanup unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/185403 [09:06:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:06:59] (03CR) 10Faidon Liambotis: [C: 032] "Fairly obvious." [puppet] - 10https://gerrit.wikimedia.org/r/185402 (owner: 10Faidon Liambotis) [09:07:00] woo, LDAP cleanup [09:07:04] yeah a bit [09:07:06] more are coming [09:07:09] not much though [09:07:16] this is primarily a certs.pp cleanup [09:07:25] really the whole ldap module should go [09:07:28] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.37 ms [09:07:29] and replaced with OpenLDAP [09:07:37] and role classes [09:07:59] (and we already have a module for openldap) [09:08:09] _joe_: going back to toollabs now. All of this definitely needs to be talked about in person over the next two weeks. [09:08:35] (03CR) 10Faidon Liambotis: [C: 032] ldap: cleanup unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/185403 (owner: 10Faidon Liambotis) [09:08:41] _joe_: yeah I understand your point. Maybe I am overthinking it :-] [09:09:04] merges? [09:09:07] dammit [09:09:47] we should get you guys a dedicated Zuul setup that would be allowed to merge changes [09:09:53] no thanks [09:10:18] that saves you a click! [09:10:49] (actually not at all) [09:11:11] (03PS1) 10Faidon Liambotis: mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 [09:11:20] what do people think of ^^ [09:11:24] should this be named "lists"? [09:11:27] or mailman is fine? [09:11:33] it's fairly wmf-specific [09:12:00] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: puppet fail [09:12:23] which blurs the line with the role class as well, e.g. all those monitoring checks could be folded into the module [09:13:09] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:13:16] _joe_ / akosiaris? [09:13:19] (03CR) 10Alexandros Kosiaris: [C: 032] admin: Add dotfiles for kartik [puppet] - 10https://gerrit.wikimedia.org/r/185401 (owner: 10KartikMistry) [09:14:20] paravoid: mailman is fine, the role should have the monitoring checks and named lists :-) [09:14:33] confused ya enough ? [09:14:45] well the module has lists.wikimedia.org hardcoded in it [09:14:59] and you can't get away from it much [09:15:07] we have HTML templates, for instance [09:15:50] <_joe_> if we have a lot of wmf-specific things, lists is probably less misleading for other people [09:16:06] if you feel pedantic enough you can move the wmf-specific content to the role [09:16:18] no, almost everything is wmf-specific [09:16:25] <_joe_> but well I don't care much, mailman is fine anyways [09:17:02] <_joe_> gee I forgot why I separated ini directives for hhvm. sigh [09:17:08] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:17:14] ? [09:17:49] OK, so what are we supposed to do about google being down according to icinga ? [09:18:04] my guess is nothing but I am curious [09:18:08] (03PS2) 10Faidon Liambotis: mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 [09:18:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 16.00 ms [09:18:15] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 (owner: 10Faidon Liambotis) [09:18:17] <_joe_> akosiaris: it's the check that is wrong, it's using an hardcoded ip [09:18:38] <_joe_> I'll fix it when I'm done with HHVM if no one got to it [09:18:53] yeah, that should be fixed, but still that does not change my point [09:19:19] <_joe_> akosiaris: well, it's like a poor man's probe of network reachability [09:19:26] <_joe_> I guess [09:19:57] yeah I get the idea. I always setup those as well [09:20:06] I just don't have them notify [09:20:30] it's more like an indication for ops when they login into icinga web ui that something is terribly wrong [09:20:51] (03PS1) 10KartikMistry: WIP: cxserver: Use different URL for apertium in BetaLabs [puppet] - 10https://gerrit.wikimedia.org/r/185406 [09:21:09] akosiaris: Tried to do, but feel free to fix ^^ :) [09:24:31] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:25:39] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 20.33 ms [09:35:34] <_joe_> kart_: that won't work [09:35:45] <_joe_> in labs you have a different hiera setup [09:36:40] (03Abandoned) 10Alexandros Kosiaris: WIP: cxserver: Use different URL for apertium in BetaLabs [puppet] - 10https://gerrit.wikimedia.org/r/185406 (owner: 10KartikMistry) [09:37:04] _joe_: Labs = Beta Cluster, you mean? [09:37:53] akosiaris: thanks. [09:37:57] _joe_: got it. [09:38:23] I shoud've read discussion more carefully :) [09:38:53] I think it is already fixed kart_ [09:39:02] yesterday by YuviPanda [09:39:36] it is just that the old config.js was a link and not the config file itself [09:39:55] yup, I did that. [09:40:01] I also cleaned out the config.js file [09:40:06] so it should work? [09:40:37] YuviPanda: which one ? the cxserver/config.js or the cxserver/cxserver/config.js ? [09:40:44] I rm’d both [09:40:49] latter was a symlink to former [09:41:00] so the symlink was there just a few mins ago [09:41:07] I removed it and ran puppet [09:41:12] and now it is OK [09:41:26] hmm [09:41:30] how I wonder how that came back [09:41:33] so perhaps you were caught in a race ? [09:41:40] with? [09:42:03] between merging, puppet running and you rming the files ? [09:42:10] oh, that’s possible [09:42:18] not the most likely scenario but one that explains it [09:42:20] but puppet should’ve ran again since? [09:42:22] hmm [09:42:24] at least if it does not happen again [09:42:39] puppet would have refused to remove the symlink [09:42:50] hmm [09:42:51] and populate it with a new file (unfortunately) [09:43:06] i think force => true would have done it though [09:43:09] hmm [09:43:17] or I wonder if jenkins or somesuch is making it a symlink [09:43:32] everything was owned by jenkins except config.js which puppet keeps setting to root [09:43:48] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:45:39] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 18.96 ms [09:48:29] akosiaris: sort a security related question, would any ops mind if i enable ptrace in labs ? [09:49:43] as in ? your VMs ? [09:49:46] or in general ? [09:49:59] and what do you mean by enable ptrace btw... [09:50:30] akosiaris: preferably, generally. I mean : sysctl sys.kernel.yama.ptrace_scope=0 [09:51:51] akosiaris: the reason behind is i want to install reptyr labs-wide, it is very handy for senil people like me who forget to prefix commands with screen [09:52:00] and then log-out and cry [09:52:34] the problem is reptyr relies on ptrace, which is off by deafult in ubuntu [09:52:56] it is not off. It is disallowed only for non-child processes of the same uid [09:53:10] root bypasses anything by default anyway [09:53:15] same same :) [09:53:33] i'm not root in toollabs, only on some boxes [09:54:02] 3Wikimedia-Labs-wikitech-interface, operations: Interwiki map broken on wikitech - https://phabricator.wikimedia.org/T43786#981362 (10jayvdb) 5Open>3Resolved a:3jayvdb Appears the local interwikis are now working and in the sites interwikimap. [09:54:24] matanya: have you read this ? https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection [09:54:26] ? [09:55:02] (03PS1) 10Yuvipanda: beta: Ensure that mw related users are present in scap targets [puppet] - 10https://gerrit.wikimedia.org/r/185409 (https://phabricator.wikimedia.org/T67591) [09:55:16] now I have [09:55:52] so you advise against akosiaris ? [09:55:56] so, I am leaning towards saying it is OK for labs, but I do have some reservations [09:56:06] (03CR) 10Yuvipanda: "Related pathset: https://gerrit.wikimedia.org/r/#/c/185409/" [puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [09:56:24] I'd advise against it [09:56:33] was waiting for this :D [09:56:34] it's an unnecessary deviation of prod from labs [09:57:08] and for toollabs, it's a degraded security [09:57:51] paravoid: toollabs ? [09:58:12] I am unclear on the attack vectors for toollabs [09:58:29] then again I am always unclear on the entire environment for toollabs tbh [09:58:55] :) if you think betalabs is terrible... [09:58:55] not arguing with you though on the unnecessary deviation of prod from labs [09:59:39] YuviPanda: I dont. I am not overly familiar with the grid engine, that's all [10:00:16] akosiaris: yeah, grid engine has terrible documentation, and is just overall terrible for adminestering [10:00:29] unless you have spent ages doing it already, like Coren has. [10:00:51] I was afraid of that [10:01:32] outside the grid, it’s not so bad. it’s fairly well puppetized. [10:05:50] paravoid: so, you abandoned https://gerrit.wikimedia.org/r/#/c/134519/ a long time ago (from bd80) (for good reason). It lived on as a cherry-pick on beta-labs since, and I’ve fixed enough other things to be able to remove it now. Just needs https://gerrit.wikimedia.org/r/#/c/185409/ which is okayish, I think (though I hope at some point the beta module [10:05:50] doesn’t exist). [10:06:01] I’ll wait for you or ori to +1 before merging. [10:07:09] and then I can get rid of that local cherrypick on beta [10:07:53] that's fine [10:08:33] alright then [10:08:36] I don't really care about what's in under modules/beta [10:09:04] (03CR) 10Yuvipanda: [C: 032] beta: Ensure that mw related users are present in scap targets [puppet] - 10https://gerrit.wikimedia.org/r/185409 (https://phabricator.wikimedia.org/T67591) (owner: 10Yuvipanda) [10:09:45] hmm, I hope we can kill the beta/ module at some point [10:10:07] I’m supposed to be working on making beta better this quarter [10:10:36] that's nice [10:11:04] yeah. monitoring + unification [10:11:06] as long as it's not if $::realm hacks all over the place, I'm onboard [10:11:22] yeah, no realm branches [10:11:26] hiera. heira everywhere [10:11:39] careful about that too [10:12:07] if parameters can stay the same that's even better [10:12:15] sometimes they can, it's just that people took a shortcut [10:12:30] by the end of it, I’d ideally want an alert on betalabs puppetmaster that screamed if there was an unmerged cherry-picked patch on it for more than, say, a day [10:12:41] paravoid: true. I think *the* biggest culprit is /var being fucking 2G on labs [10:12:44] which is terrible and stupid [10:12:58] and a huge divergence from prod [10:13:03] oh god finally [10:13:11] a labs person that agrees with me on this [10:13:30] on the labs jessie thread I've basically said "get rid of the separate /var" very persistenly [10:13:37] paravoid: sadly I’m also the only labs person who thinks this way rather than ‘it should be fixed by not polluting /var/log' [10:13:49] well, I really want us to just LVM / [10:13:54] I'm okay with that [10:14:00] (03PS2) 10Giuseppe Lavagetto: [WMF] New package with additional patches and fixes to the ini files and to the upstart/init scripts [debs/hhvm] - 10https://gerrit.wikimedia.org/r/185187 [10:14:00] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [10:14:17] it would be ok if prod also has tiny /vars, but no it does not [10:14:18] although really, a non-LVM / + reasonable base image space should be fine [10:14:19] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 15.40 ms [10:15:37] paravoid: yeah. let’s hope I can convince Coren and andrewbogott_afk over the next two weeks :) [10:15:58] and yay, local hacks on betalabs puppetmaster down to 1 [10:16:00] from 3 [10:16:41] you know, the image's size doesn't correspond to what's actually on disk [10:16:52] raw image files are sparse and qcow2 is even smarter than that [10:17:27] so even if the base image is 30G, unless something actually use this, it won't take up 30G [10:18:20] if something uses 30G though, and removes it, I'm not sure if space is reclaimed [10:18:32] years ago there was work for this to happen using SCSI TRIM commands (discard) [10:18:33] yup, yup, yup. There’s literally 0 reasons for our current allocation [10:18:44] and most of our instances have lots of unallocated space that they never use [10:18:51] and if they do, it’s with the labs srv role, that just puts it all in /srv [10:19:02] ok [10:19:04] let's clean it up then [10:19:20] I have a very strong opinion against a separate /var [10:19:26] as it happens, it also fails with jessie right now [10:20:00] well, if we don’t have a separate /var [10:20:04] it would just use up space from / [10:20:05] but [10:20:10] /dev/vda1 9.3G 5.9G 3.0G 67% / [10:20:14] root itself is smallish [10:20:15] exactly [10:20:24] omg I've told the exact same things to Coren [10:20:30] so that’s terrible to [10:20:31] *too [10:21:01] so even if you merge 2G from /var and 2G from /var/log (the latter happened recently after lots of complaining), that’s still only 14G [10:21:04] not enough, I’d think [10:21:05] at all [10:21:25] well you get all the unused spare space from / too though [10:21:37] free space is currently fragmented [10:21:38] that's silly [10:21:56] so if I look at this instance [10:21:57] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000006a5.eqiad.wmflabs [10:22:02] it says 40G [10:22:03] <_joe_> I have a deja vu [10:22:06] but of course, it’s not used at all [10:22:17] <_joe_> like this discussion has been done 10 times at least :) [10:22:30] <_joe_> I do agree completely [10:22:43] <_joe_> lemme add that disk space is the cheapest commodity we have nowadays [10:22:46] how about we all corner Coren and andrewbogott_afk next week and make this happen? :) [10:23:11] not just for jessie (but that would be a start) but also rebuild trusty instances. [10:23:22] and then I can go around killing all the places I had to make log file paths configurable [10:23:26] that’s just a waste of time [10:23:40] although it’s been months since a /var/log alert now, which is nice [10:23:55] my proposal would just be [10:24:02] 30G / no LVM [10:24:13] and LVM on /srv [10:24:22] <_joe_> +1 [10:24:34] if people want to write a bunch of crazy logs, they can always symlink /var/log/crap to /srv/crap [10:24:41] or log to /srv/crap directly [10:24:44] yeah, unified / [10:24:50] plus LVM [10:24:52] seems nice. [10:25:00] <_joe_> "/srv/crap" is defined in FHS? :P [10:25:01] or mount vg-crap to /var/log/crap [10:25:05] <_joe_> it sounds nice [10:25:12] or bind-mount /var/log/crap to /srv/crap [10:25:15] or WHATEVER [10:25:19] it's not rocket science [10:25:25] <_joe_> "a general container for java applications" [10:25:51] yup, yup [10:25:52] also, if you need more than say, 10G for logs, you're doing something wrong [10:26:09] but needing more than 2G for /var is completely sane, I think [10:26:19] you either log for a very long time, which is bad from a privacy perspective [10:26:29] or you log in a very high rate, which is bad from an IOPS perspective [10:26:35] (for labs, that is) [10:26:39] yes, 2GB for /var is just bonkers [10:26:56] how about instead I just write to an sqlite file on NFS? :) [10:27:24] it’s not like that would saturate links or anything. [10:27:54] labstore1001's ganglia is broken again I see [10:27:54] *sigh* [10:28:24] <_joe_> that may be my fault [10:29:16] I have filed https://phabricator.wikimedia.org/T87003 for this [10:29:52] :D [10:31:06] edited description to be better [10:32:08] hmm [10:32:14] I can’t join a channel [10:35:04] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#981453 (10yuvipanda) 5Open>3Resolved a:3yuvipanda They have a sane shell now! \o/ [10:35:05] 3Beta-Cluster, operations: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#981456 (10yuvipanda) [10:35:53] 3Beta-Cluster, operations: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#981459 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This and the associated issues (different shell, etc) have been fix. prod and beta are unified on mwdepl... [10:41:53] akosiaris: re: https://phabricator.wikimedia.org/T86143#981469 [10:42:01] akosiaris: yes! I opened up ferm rules for those, and it still doesn’t work! [10:42:15] akosiaris: port 22 is blocked from shinken-01 and shinken-server-01 to deployment-mediawiki02, for example [10:47:01] ok, I am having a look [10:48:03] (03PS2) 10Giuseppe Lavagetto: mediawiki: retab of the virtualhost for search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/185397 [10:48:05] (03PS1) 10Giuseppe Lavagetto: monitoring: allow host to check based on the fqdn of a host [puppet] - 10https://gerrit.wikimedia.org/r/185414 [10:48:52] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: retab of the virtualhost for search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/185397 (owner: 10Giuseppe Lavagetto) [10:56:11] (03PS3) 10Giuseppe Lavagetto: [WMF] New package with additional patches and fixes to the ini files and to the upstart/init scripts [debs/hhvm] - 10https://gerrit.wikimedia.org/r/185187 [11:02:54] YuviPanda: deployment-mediawiki02 does not include the ferm class (anymore?) [11:02:54] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#835083 (10yuvipanda) [11:03:07] that is why it is not updating the rules [11:03:25] now why did it used to include ferm and no longer does ? [11:03:27] searching [11:04:09] (03PS1) 10Springle: require at least one haproxy proxy, but allow multiple. [puppet] - 10https://gerrit.wikimedia.org/r/185416 [11:04:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [11:04:38] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 16.73 ms [11:05:31] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#981511 (10mark) >>! In T78076#976105, @bd808 wrote: >>>! In T78076#975352, @yuvipanda wrote: >> Why is this needed again? T76086 seems to have fixed T75206. And as @ori said, we should be agnostic about the... [11:05:53] (03CR) 10Springle: [C: 032] require at least one haproxy proxy, but allow multiple. [puppet] - 10https://gerrit.wikimedia.org/r/185416 (owner: 10Springle) [11:06:02] akosiaris: aha! that makes sense, yeah. [11:06:07] I’ll let you dig :) [11:12:21] * YuviPanda goes afk for food [11:18:49] 3Release-Engineering, operations, Continuous-Integration: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#981523 (10hashar) I have no spare cycles to implement the feature in Zuul. That is straight python, should not be too hard for anyone to realize it. [11:22:48] RECOVERY - haproxy process on dbproxy1002 is OK: PROCS OK: 2 processes with command name haproxy [11:23:32] _joe_: mw1244 is throwing 503s [11:23:44] and mw1017 has puppet disabled but I guess you know that [11:24:05] apergos: dataset1001 disk space alerts again [11:24:41] YuviPanda: updated https://phabricator.wikimedia.org/T86143, it seems the base::firewall was removed. It needs to be re-added from ferm to pickup changes [11:24:48] s/from/for/ [11:26:43] <_joe_> paravoid: yes mw1017 I know [11:27:17] <_joe_> (and I've put in a comment) [11:27:23] akosiaris: ah, hmm. [11:27:31] what’s the status of base::firewall in our prod hosts? [11:28:26] included selectively [11:28:52] do the mw hosts have them? [11:28:58] don't think so [11:29:37] hmm [11:29:48] <_joe_> I don't think so, no [11:33:18] akosiaris: so I’m not sure why / when it was added at all [11:33:24] oh [11:33:36] hmm [11:33:53] I suppose the way I’d want to fix this is by including base::firewall on mw* hosts in general [11:34:01] I suppose eventually we’d want base::firewall on all hosts [11:34:30] <_joe_> why? [11:34:33] no [11:34:42] <_joe_> we don't actually [11:34:47] not all hosts need it and it can impact performance [11:36:59] hmm [11:37:00] ok [11:37:10] * YuviPanda doesn’t know much about firewalls other than trivial iptables rules [11:37:23] that's all we have :) [11:37:25] so far [11:37:29] just trivial iptable rules [11:37:43] they're just encapsulated in a couple abstraction layers [11:38:08] oh sure, ferm I understand. But I suppose I don’t fully understand why we firewall off some internal-only services and not others. [11:38:16] or if we do that at all [11:38:17] actually [11:38:31] well we started with no firewalls [11:38:46] then there were some horrible iptables.pp scripts [11:38:59] since ferm, we've been trying to expand the usage of firewalls [11:39:21] public services (public IPs, that is) are obviously higher priority than private ones [11:39:29] akosiaris: so I see that the problem is in all hosts where natfix *was* applied at some point (deployment-prep and integration). And don’t want to apply base::firewall in beta when prod won’t have it (at least for now). perhaps ‘solution’ is to hand drop the default drop policy? [11:39:48] right. I understand why we do that on hosts with a public IP [11:40:04] but not sure why we do it for private hosts. [11:41:19] well, there's no reason for one private host to be able to SSH to another, for example [11:41:39] there's not? [11:41:40] but that's a lower priority for sure, I'm not sure which private hosts you're referring to specifically [11:41:43] how do you copy stuff between hosts? :P [11:41:53] hm,m so just defence in depth? [11:42:02] paravoid: sca* hosts seem to have base::firewall applied. [11:42:10] well, at least they have ferm rules for opening up particular ports [11:42:29] that are served by cxserver, apertium, etc [11:42:33] mark: scp -3? :) [11:42:45] what's that? [11:43:04] copy from one remote to the other via the localhost, but it was mostly a joke [11:43:17] oh that [11:45:08] imho firewalling of purely internal hosts is a discussion we may want to have, but later [11:46:04] yeah, I think the solution to my current problem is to make beta more like prod by removing remnants of old base::firewall [11:46:26] what is beta-specific here? [11:46:40] or how does this relate to beta? [11:46:50] paravoid: https://phabricator.wikimedia.org/T86143 [11:46:58] I want to have ssh checks on labs. [11:47:08] default security group already opens ssh to everywhere internally in labs [11:47:23] but beta hosts had base::firewall applied a long time ago [11:47:30] for a very, very kludgy NAT fix [11:47:36] ohgod [11:47:42] because labs instances can’t actually hit their own public IPs [11:48:06] akosiaris: removed the natfix and related kludginess (we moved the kludginess to the DNS server instead) [11:48:14] paravoid: but the remnants of base::firewall still exist [11:48:27] oh wait, *THAT* is why there are ferm rules in the sca* roles [11:48:36] the problem aiui is that there is an *unmanaged* firewall on those hosts [11:48:42] yup [11:48:44] i.e. there is ferm config lying around that isn't managed by puppet [11:48:48] oh [11:48:51] no, I think not [11:48:56] afaik, at least [11:49:05] it just wasn’t ensure => absent (or equivalent) when removed [11:49:12] well exactly [11:49:13] oh, I see. we might be saying the same things [11:49:30] if you include base::firewall, this installs ferm and realizes all the ferm::rules [11:49:34] right, I first read that as the firewall is being hand-managed [11:49:37] :) [11:49:38] = creates /etc/ferm/* [11:49:39] right [11:49:46] but assuming that we won’t be doing this in prod anytime soon [11:49:54] removing the class doesn't actually undo this (because... puppet) [11:49:57] my inclination is to just hand-remove the old rules. [11:49:58] yeah [11:50:06] yeah [11:50:08] just dpkg -P ferm [11:50:14] oh, that should do? [11:50:29] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [11:50:30] I don't remember if postrm restores the set but [11:50:45] dpkg -P ferm; iptables -P INPUT ACCEPT; iptables -F INPUT [11:50:51] should do it :) [11:50:51] sudo service ferm stop ; dpkg -P ferm should do it anyway [11:51:04] hmm [11:51:08] dpkg -P ferm errors out [11:51:14] with [11:51:14] no such variable: $BASTION_HOSTS [11:51:19] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 17.50 ms [11:51:33] sigh [11:51:36] rm /etc/ferm/* first? [11:51:39] yeah [11:52:05] hmm [11:52:08] there’s still [11:52:08] a [11:52:13] DROP tcp -- anywhere anywhere state NEW tcp flags:!FIN,SYN,RST,ACK/SYN [11:52:18] although I can’t actually read that rule properly yet [11:52:23] is that the default drop? [11:52:35] no [11:52:39] hmm [11:52:46] that’s the only DROP [11:52:53] just don't logout [11:52:58] :) [11:53:05] heh [11:53:07] it will drop packets in NEW state and without FIN,SYN,RST,ACK/SYN [11:53:14] TCP flags set [11:53:14] iptables -P INPUT ACCEPT; iptables -F INPUT [11:53:55] btw, this needs to be done on all beta hosts [11:54:04] so automate it as much as possible :-) [11:54:08] woo [11:54:10] that did it [11:54:15] akosiaris: yeah, can do a salt command [11:54:27] just be extra carefut [11:54:29] careful [11:54:34] I probably shouldn’t do it on a friday evening [11:54:37] with no other labsen around [11:54:44] or no other releng people around [11:55:06] let me just note this on the phab task [11:55:07] and let it be [11:55:32] I can do it if you want [11:55:42] akosiaris: that would be awesome :) [11:57:07] * YuviPanda adds ‘firewalls’ to list of opsy-fundamental-things he should fundamentally understand [11:57:59] akosiaris: do log in -releng [12:00:29] (03PS4) 10Yuvipanda: shinken: Add ssh checks for all monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/181807 (https://phabricator.wikimedia.org/T86027) [12:01:55] YuviPanda: might wanna check now and perhaps close https://phabricator.wikimedia.org/T86143 [12:02:26] akosiaris: yup, I cherrypicked the patch to http://shinken-test.wmflabs.org/problems [12:02:35] and it’s churning through ssh checks now [12:02:51] http://shinken-test.wmflabs.org/all?global_search=ssh# [12:02:53] (guest/guest) [12:03:21] akosiaris: I’m also going to move shinken to jessie once we have that labs image available. jessie has 2.x and then I can work on getting the web ui fixed / a proper auth provider [12:04:14] yay! [12:04:22] sooo many things to do [12:04:52] there’s also labsdb-audit, but that’s blocked-ish on core. and me and springle are going to drop a lot of databases / tables during the next weeks from labsdb! [12:05:16] brb in about 15mins, fooood [12:05:19] and postgres :P [12:05:28] akosiaris: yup. I’ve the code written, but haven’t found time to test it. [12:05:33] have a nice dinner ? [12:05:39] akosiaris: also need to figure out where exactly it’ll run. probably the NFS machine [12:05:39] it is dinner time over there isn't it ? [12:05:40] I think YuviPanda needs more tasks [12:05:48] akosiaris: it’s 5:40PM, and it’s lunch... [12:06:04] 3:30 hours away only... [12:06:05] akosiaris: I’ve sortof standardized on an europeanish timezone now, where I wake up just about when _joe_ starts working :) [12:06:10] ahahahah [12:06:27] which weirdly is always earlier than me [12:06:29] much better than my previous mid-atlantic timezone where I’d wake up when _joe_ goes to lunch [12:06:39] and I am one tz before him [12:06:52] although my trick when going to SF has always been ‘switch to SF timezone for a week before' [12:06:56] you were in a midatlantic tz ? [12:07:03] and that usually implies a shift of about 2-3h [12:07:10] who were you working for ? iceland ? [12:07:11] :P [12:07:20] akosiaris: well, tz as calculated by ‘whenever you wake up, presume that is 8AM' [12:07:26] <_joe_> lol [12:07:27] ‘and wherever it is 8AM at that time, is your tz' [12:07:39] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [12:07:48] fixing this idiotic google thing ! [12:07:55] <_joe_> YuviPanda: you wake up at 7:00Z [12:07:59] YuviPanda: go to lunch ! [12:08:00] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.47 ms [12:08:02] :P [12:08:03] <_joe_> akosiaris: there is a CR from me [12:08:07] :D [12:08:32] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/#/c/185414/ [12:08:40] it’s also partly the team change. Apps team, once we hired more engineers, closest I had was east coast US [12:08:46] before that everyone I worked with was in west coast. [12:09:03] but that meant I’ll get bored in the evenings and do analytics / labsy things, which turned out to be a good thing [12:09:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [12:09:43] anyway, foooood [12:10:04] hmpf, I was supposed to have added nodejs support in toollabs, and instead got caught up on beta again [12:10:28] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 19.99 ms [12:11:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] monitoring: allow host to check based on the fqdn of a host (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/185414 (owner: 10Giuseppe Lavagetto) [12:12:45] _joe_: all that if case there has me looking at it multiple times tbh [12:12:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [12:13:00] seems more complicated than it should be [12:13:29] <_joe_> akosiaris: maybe it is :) [12:14:16] <_joe_> akosiaris: I tried not to remove existing safeguards, but they are in fact useless [12:14:43] remove them, I won't mind [12:17:39] (03PS3) 10Yuvipanda: deployment: Unify salt_masters role for prod / labs [puppet] - 10https://gerrit.wikimedia.org/r/185137 (https://phabricator.wikimedia.org/T86885) [12:17:45] anyone wanna CR ^? [12:18:31] (note that the inclusion in virt1000 is wrong, and should be removed in a later patch. I’ve noted this in gerrit comments) [12:19:26] (03PS2) 10Giuseppe Lavagetto: monitoring: allow host to check based on the fqdn of a host [puppet] - 10https://gerrit.wikimedia.org/r/185414 [12:19:33] <_joe_> YuviPanda: I will [12:19:43] thanks [12:21:02] (03CR) 10Giuseppe Lavagetto: monitoring: allow host to check based on the fqdn of a host (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/185414 (owner: 10Giuseppe Lavagetto) [12:21:27] <_joe_> akosiaris: I hope the new patchset is ok [12:21:36] <_joe_> the former one was pretty wrong btw [12:22:23] <_joe_> I'm a bit tired today [12:24:03] (03CR) 10Yuvipanda: [C: 032] "Firewall issues fixed!!!1" [puppet] - 10https://gerrit.wikimedia.org/r/181807 (https://phabricator.wikimedia.org/T86027) (owner: 10Yuvipanda) [12:25:54] akosiaris: so now this is fixed :) [12:26:02] akosiaris: I’m wondering if we should remove the ferm:: rules from *oid [12:26:37] akosiaris: because base::firewall isn’t included anywhere now that I checked, and I think the ferm rules were supposed to be there for beta since it included NAT rules in the past but somehow people got confused and put them in prod *and* beta? [12:26:47] indeed, sca* hosts have no iptables [12:28:01] (03PS1) 10Alexandros Kosiaris: txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 [12:28:43] (03CR) 10jenkins-bot: [V: 04-1] txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 (owner: 10Alexandros Kosiaris) [12:28:49] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:32:25] (03CR) 10Истенный: "https://git.wikimedia.org/git/operations/puppet.git" [puppet] - 10https://gerrit.wikimedia.org/r/185416 (owner: 10Springle) [12:32:58] (03PS1) 10Glaisher: Set wgCategoryCollation to 'uca-pl' on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185426 (https://phabricator.wikimedia.org/T86821) [12:33:27] 3ops-core: Build a new HHVM package - https://phabricator.wikimedia.org/T86906#981636 (10Joe) Package built and manually deployed on beta [12:34:04] (03PS1) 10Yuvipanda: *oid: Remove useless ferm declarations [puppet] - 10https://gerrit.wikimedia.org/r/185428 [12:34:10] akosiaris: ^ more *oid cleanup [12:34:20] and removes some kludginess too [12:35:08] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [12:35:10] (03PS2) 10Alexandros Kosiaris: txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 [12:35:39] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.20 ms [12:36:36] (03CR) 10Alexandros Kosiaris: "On a side note, the provisioning of systemd unit files means a systemctl daemon-reload needs to be issued for systemd to wake up and get t" [puppet] - 10https://gerrit.wikimedia.org/r/185424 (owner: 10Alexandros Kosiaris) [12:36:59] YuviPanda: those are not useless [12:37:31] it might have been related to beta at some point but those rules are not useless at all [12:38:06] akosiaris: well, base::firewall isn’t applied in the sca* hosts [12:38:12] and iptables -L says there are no rules [12:38:20] akosiaris: of course, the alternative is we apply base::firewall to sca* hosts [12:38:29] which is another way of making these useful :) [12:38:32] but right now they’re noops [12:38:32] yeah, but when it is applied they will be realized [12:38:38] exactly [12:38:48] but should we apply base::firewall to sca* hosts? [12:38:50] so when the time comes, you just have to turn a switch [12:38:54] hmm [12:38:55] I feel that we should [12:38:59] mark is gonna say no :P [12:39:06] haha :) [12:39:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Feel free to kill the comments that mislead and talk about Beta and no longer relevant ticket but the rules themselves are not useless." [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [12:39:46] in fact... [12:39:48] akosiaris: yeah, the comments at least should be killed. and parsoid has no ferm rule [12:39:53] in ::production [12:40:01] the others, half did and half didn’t and I unified them to have it [12:40:14] yeah. I appreciated that :-) [12:40:53] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [12:41:04] hopefully we can get rid of our prod lucid instances at some point :) [12:41:04] lucid ???? [12:41:08] so I can kill that instance [12:41:12] akosiaris: yes, apergos uses it for salt testing [12:41:15] we have only 3 [12:41:19] it can’t die until we are lucid free in prod [12:41:19] actually 2 [12:41:29] we are like 99.999% free [12:41:40] its sodium, a dead box called ms1004 and nescio [12:41:42] * YuviPanda brings democracy to Lucid [12:42:43] akosiaris: right. but still - apergos wants to test the new packages there still :) so we have a lone lucid labs box, that you can only ssh into as root and where puppet doesn’t even run I think... [12:44:14] * akosiaris sigh [12:46:14] (03PS1) 10Alexandros Kosiaris: Add base::firewall to Service Cluster A [puppet] - 10https://gerrit.wikimedia.org/r/185429 [12:46:20] YuviPanda: see? ^ [12:46:24] :P [12:47:35] (03CR) 10Yuvipanda: [C: 04-1] "parsoid::production doesn't have ferm rules." [puppet] - 10https://gerrit.wikimedia.org/r/185429 (owner: 10Alexandros Kosiaris) [12:47:55] but parsoid is not in service cluster a [12:47:55] (03CR) 10Yuvipanda: "Should also remove the comments about beta from the ferm rules" [puppet] - 10https://gerrit.wikimedia.org/r/185429 (owner: 10Alexandros Kosiaris) [12:47:59] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [12:48:19] ok, I will do that but parsoid in not in sca [12:48:19] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 17.40 ms [12:48:44] (03CR) 10Yuvipanda: "I4e5f91eceba3d4894430ba5fbdb9f3945b99d2de is the other approach, which removes the ferm rules as noops :)" [puppet] - 10https://gerrit.wikimedia.org/r/185429 (owner: 10Alexandros Kosiaris) [12:48:46] akosiaris: oh [12:48:51] akosiaris: well, right. [12:49:01] I will merge both our patches [12:49:08] well.. the parts I like :-) [12:49:11] hehe :) [12:49:17] comment removal + base::firewall? [12:49:47] yeah [12:49:48] I’ll not comment on wether we should firewall or not, because I don’t know enough about that to have an informed opinion :) [12:53:43] anyway, am off to meet some friends. Might be back later tonight, hopefully working on toollabs stuff [12:53:45] * YuviPanda waves [12:54:02] have a nice time [12:55:18] (03CR) 10Alexandros Kosiaris: "I removed the various comments I was talking about in https://gerrit.wikimedia.org/r/#/c/185429/" [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [12:55:30] (03PS2) 10Alexandros Kosiaris: Add base::firewall to Service Cluster A [puppet] - 10https://gerrit.wikimedia.org/r/185429 [13:06:57] (03CR) 10Alexandros Kosiaris: [C: 032] Add base::firewall to Service Cluster A [puppet] - 10https://gerrit.wikimedia.org/r/185429 (owner: 10Alexandros Kosiaris) [13:17:15] (03PS2) 10Alexandros Kosiaris: *oid: Remove useless ferm declarations [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [13:19:23] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. As a side note and for future reference, the host_fqdn should not be overused as it sets a dependency on the DNS service working cor" [puppet] - 10https://gerrit.wikimedia.org/r/185414 (owner: 10Giuseppe Lavagetto) [13:20:22] (03CR) 10Alexandros Kosiaris: [C: 032] "This ended up being the removal of parsoid ferm rule only. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [13:31:15] (03PS2) 10JanZerebecki: Beta Features: Disable the Compact Personal Bar feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185116 (https://phabricator.wikimedia.org/T85541) (owner: 10Jforrester) [13:32:04] (03CR) 10JanZerebecki: [C: 031] Beta Features: Disable the Compact Personal Bar feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185116 (https://phabricator.wikimedia.org/T85541) (owner: 10Jforrester) [13:41:52] (03PS1) 10Alexandros Kosiaris: Setup holmium as a backup::host [puppet] - 10https://gerrit.wikimedia.org/r/185432 [13:45:33] (03CR) 10Alexandros Kosiaris: [C: 032] Setup holmium as a backup::host [puppet] - 10https://gerrit.wikimedia.org/r/185432 (owner: 10Alexandros Kosiaris) [13:46:52] 3ops-core: backup old blog server/holmium with bacula - server will be wiped post backup - https://phabricator.wikimedia.org/T86975#981801 (10akosiaris) https://gerrit.wikimedia.org/r/185432 has been merged. Starting a full backup of /srv/org/wikimedia/blog soon (waiting for puppet for complete on both hosts). @... [14:06:11] (03PS1) 10Alexandros Kosiaris: Use hiera to have ytterbium listen only on it's IP address [puppet] - 10https://gerrit.wikimedia.org/r/185434 [14:06:49] s/it's/its/ [14:06:57] but I don't like that much [14:08:03] (03CR) 10Faidon Liambotis: [C: 04-1] "I don't like this much. It seems a bit error-prone, e.g. what happens if we renumber the server, will we remember to change this? What abo" [puppet] - 10https://gerrit.wikimedia.org/r/185434 (owner: 10Alexandros Kosiaris) [14:08:34] I am not even sure it will work yet tbh [14:09:21] (03CR) 10JanZerebecki: [C: 031] "Yes, please, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [14:09:29] heh, it actually works [14:10:04] I have to say that in general, I don't see why we would change gerrit's port now [14:10:34] also if your plan is to make gerrit listen on port 22... [14:10:36] it won't work [14:10:42] or well, I doubt it will work [14:10:51] I doubt gerrit starts as root then drops privileges, it's a java app [14:11:33] test $UID = 0 && CH_USER="-c $GERRIT_USER" [14:11:33] if start-stop-daemon -S -b $CH_USER \ [14:11:35] there you go [14:12:05] heh, not surprised [14:12:14] so what's the point [14:13:43] people getting a connection refused instead of ytterbium's ssh server ? [14:13:57] just firewall that off if that's your goal [14:14:23] people shouldn't be getting ytterbium's ssh server even if they hit ytterbium's IP anyway :) [14:14:38] that's true [14:22:46] (03PS1) 10Alexandros Kosiaris: Followup commit for c914851 [puppet] - 10https://gerrit.wikimedia.org/r/185438 [14:23:05] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#981861 (10faidon) My comment was/is that I have a slight preference against non-human users having a home directory under /home. I called those "system" users and mwdeploy does have system =>... [14:24:33] (03CR) 10Alexandros Kosiaris: [C: 032] Followup commit for c914851 [puppet] - 10https://gerrit.wikimedia.org/r/185438 (owner: 10Alexandros Kosiaris) [14:30:59] (03PS1) 10Aude: Update client lists for test / wikidata change dispatching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185439 [14:33:26] is there any lucid server I could test a ssh client config change against? (don't need any access onto that server as the cipher and so on is done before the public key is checked) [14:45:48] sodium ? [14:46:15] yeah [14:46:19] ms1004 or sodium [14:53:28] (03CR) 10JanZerebecki: "Also tested with sodium.wikimedia.org which is on Ubuntu lucid." [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [14:55:06] 3ops-core: backup old blog server/holmium with bacula - server will be wiped post backup - https://phabricator.wikimedia.org/T86975#981928 (10akosiaris) The backup has finished and is stored in the Archive pool that has a maximum lifetime of 5 years. @RobH the only question that remains is the database, otherwis... [14:55:21] (03PS1) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [14:56:05] (03CR) 10jenkins-bot: [V: 04-1] Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [14:57:46] (03CR) 10JanZerebecki: [C: 031] "Works fine when set on the client and connecting to sodium.wikimedia.org which is on Ubuntu lucid with sshd version OpenSSH_5.3p1 Debian-3" [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [14:58:08] yea sodium works find for testing this. thx [15:01:44] (03CR) 10JanZerebecki: [C: 031] "Tested while being set on client with a server on Ubuntu lucid." [puppet] - 10https://gerrit.wikimedia.org/r/185329 (owner: 10Dzahn) [15:06:03] (03Abandoned) 10Milimetric: Adding 'research' read only user to wikimetrics db [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/180222 (https://phabricator.wikimedia.org/T76109) (owner: 10Nuria) [15:30:39] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Papaul) 5Open>3Resolved Complete [15:46:18] (03PS1) 10Ottomata: Prep for migrating Hadoop namenodes to analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/185443 [15:47:18] (03CR) 10jenkins-bot: [V: 04-1] Prep for migrating Hadoop namenodes to analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/185443 (owner: 10Ottomata) [15:48:02] (03CR) 10BBlack: [C: 04-1] "header.append() does not do what you would logically think it does. It creates a second header with the same name, rather than adding ",n" [puppet] - 10https://gerrit.wikimedia.org/r/184997 (owner: 10Ori.livneh) [15:48:53] (03PS2) 10Ottomata: Prep for migrating Hadoop namenodes to analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/185443 [15:50:15] (03CR) 10Ottomata: [C: 032] Prep for migrating Hadoop namenodes to analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/185443 (owner: 10Ottomata) [15:50:31] (03PS1) 10BBlack: Revert "VCL: Use header.append() in more places." [puppet] - 10https://gerrit.wikimedia.org/r/185444 [15:50:43] (03PS2) 10BBlack: Revert "VCL: Use header.append() in more places." [puppet] - 10https://gerrit.wikimedia.org/r/185444 [15:50:52] (03CR) 10BBlack: [C: 032 V: 032] Revert "VCL: Use header.append() in more places." [puppet] - 10https://gerrit.wikimedia.org/r/185444 (owner: 10BBlack) [15:51:13] 3ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#981984 (10Papaul) disks in slot 7 and slot 10 replaced. Bad disk in shipping for return [15:53:16] RECOVERY - Hadoop Namenode - Stand By on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [15:55:06] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:03:10] bd808, traffic reduction to logstash looks great since i merged your udp2log change last night :) [16:03:35] sweet. Now if I could figure out how to get the index to catch up ... [16:03:50] yeah i was wondering about that [16:03:54] 4.5M events still waiting in redis to index [16:04:31] if we didn't have any events that needed to joined we could make it multithreaded [16:05:00] the only think we are stitching back together now is hhvm crash dumps I think [16:05:28] is there a plan for resolving that to not need stitching? [16:05:30] bd808: Make logstash faaaaaasterrrrrr [16:05:37] RECOVERY - Hadoop Namenode - Stand By on analytics1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [16:05:54] yeah we need to finalize requested harware specs so we can push forward procurement [16:06:02] well we could just ignore them I suppose [16:06:32] that would be good. Did we have an email thread talking about what we want/need? [16:06:36] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:06:40] * bd808 remembers vaguely [16:06:53] it was in https://phabricator.wikimedia.org/T84958 [16:07:08] splitting logstash from elasticsearch would make a big difference I think [16:07:42] !log stopping hadoop cluster [16:07:52] Logged the message, Master [16:08:17] i'll ask robh if we have a standard machine type which would we appropriate [16:08:50] I wonder if the boxes that were reclaimed from lsearchd are beefy enough? [16:09:05] they are out of warranty though [16:09:10] :( [16:09:26] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:09:35] if we are looking to make logging more reliable that's probably not a good direction to take [16:09:46] disk space? hmm. [16:09:48] que decis? [16:09:57] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:10:08] oh socket timeout? weird [16:10:21] do you speak spanish? [16:10:29] ? [16:10:33] jgage: how many systems? https://wikitech.wikimedia.org/wiki/Server_Spares has some available but if those specs arent good we can get something [16:10:40] OH hdfs mount is being stupid! [16:10:41] ha [16:10:41] ok. [16:10:47] because namenode is down. [16:10:49] interseting. [16:10:56] need to add umounting that to migration steps! [16:11:12] redis in the logging pipeline has always felt a little weird to me [16:11:29] is there any plan for removing it/simplifying the pipeline? [16:12:12] paravoid: we could replace it with kafka [16:12:30] the point of redis is to provide a buffer for incoming evnets [16:12:53] why do we need a buffer? [16:13:03] rather than the old method of flinging udp and hoping the packets get processed [16:13:35] robh, maybe 3-5. i'll get back to you after musing on requirements a bit, let's chat next week. [16:13:49] The thought was basically that having more robust/durable logging was a good thing™ [16:14:32] that sounds good indeed [16:14:39] can't you write to ES directly? [16:14:41] T.im and I talked about it a bit in October and he said that he'd never really liked the udp process because of the loss [16:14:44] redis helps not only with buffering but also by making a non-spof endpoint so that events can be ingested by any of the logstash hosts [16:15:01] that just makes redis a spof doesn't it [16:15:12] not if it's a cluster [16:15:49] which is it not at the moment, right now it is 3 separate instances that get randomly appended to [16:16:14] kafka would be the most awesome solution that we currently know how to run I think [16:16:22] just ask! :) [16:16:29] I know very little about how this all works [16:16:29] mmm kafka [16:16:39] but why do we need yet another redundant cluster in front of a redundant cluster? [16:16:42] although, i think you might need your own kafka cluster...not sure analytics would want production logging in the analytics kafka cluster, not sure [16:16:47] https://wikitech.wikimedia.org/wiki/Logstash [16:16:57] there's a pretty picture even [16:17:12] yeah i would argue for a separate kafka cluster, no reason to couple analytics [16:17:14] elasticsearch is supposed to be this redundant thing where any one node can fail, right? :) [16:17:19] jgage: cool [16:17:28] except it is of the older layout. let me find the newer picture [16:17:29] would make debugging the existing analytics kafka problems harder [16:17:45] just let me know as soon as you do (so if its a lot of machines and odd requirements we can get it on the projected purchases asap) [16:17:51] k [16:18:44] Here's the diagram that is more correct now -- https://commons.wikimedia.org/wiki/File:Elk-mw-ha-redis.svg [16:19:06] except the logstash services are only consuming from a single redis right now [16:19:21] 3 redis servers -> 3 logstash servers [16:19:38] each on the same host (which I think is fine actually) [16:20:26] paravoid: To the question of writing directly to elasticsearch, yes we can do that if we are not going to modify the log events at all on the way in [16:21:05] There's not much that we are doing to the Monolog generated json events now but there are a few fixups that logstash applies [16:21:17] (03PS1) 10Ottomata: analytics1001 and analytics1002 are now the hadoop namenodes [puppet] - 10https://gerrit.wikimedia.org/r/185445 [16:21:20] ok [16:21:35] (03PS2) 10Ottomata: analytics1001 and analytics1002 are now the hadoop namenodes [puppet] - 10https://gerrit.wikimedia.org/r/185445 [16:21:48] so how does the existing code handle the failure of one of the redis(es)? [16:21:55] fall back to the next one? [16:22:12] nope :( no logs to logstash for that request [16:22:18] heh [16:22:24] so couldn't you just write to logstash directly? [16:22:32] via e.g. TCP syslog? [16:22:40] I'm not sure, does logstash even listen to syslog? [16:22:45] it can [16:22:49] yes. we do that for apache2 + hhvm logs [16:23:00] (03CR) 10Ottomata: [C: 032] analytics1001 and analytics1002 are now the hadoop namenodes [puppet] - 10https://gerrit.wikimedia.org/r/185445 (owner: 10Ottomata) [16:23:23] but there are limits there. syslog lines have a length limit [16:23:44] 3ops-core: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#982002 (10Papaul) [16:23:45] 3ops-codfw: label & setup drac/basic setings for rbf2001 & rbf2002 - https://phabricator.wikimedia.org/T86940#982000 (10Papaul) 5Open>3Resolved Racktable update mgmt setup complete port # info rbf2001 = g-5/0/17 rbf2002 = g-5/0/22 Complete [16:23:49] so big json blobs (stacktraces) can get corrupted [16:24:31] we even see that a bit with the GELF chunked udp transport that node and java apps are using [16:24:50] i'm interested in trying this approach someday: http://untergeek.com/2012/10/11/using-rsyslog-to-send-pre-formatted-json-to-logstash/ [16:25:13] that seems orthogonal to this discussion, no? [16:25:37] fwiw, both syslog-ng and rsyslog are getting kafka modules [16:25:51] cool [16:26:00] So the growing pain we are seeing today is running 2 jvm based apps (logstash and elasticsearch) on nodes with 16G of ram and 6 cores [16:26:08] but while I like kafka and we're clearly investing in it [16:26:26] kafka -> logstash -> elasticsearch just feels... complicated [16:26:34] but maybe i'm just not accustomed to it, dunn [16:27:13] at the log event volume we have today elasticsearch looks to be the bottleneck. It is just not keeping up with the ingest rate [16:27:40] which leads to gc thrash and occasional OOM [16:27:51] input queue -> stream processing -> storage [16:27:54] seems like a good model to me [16:28:21] paravoid: embrace microservices :) [16:28:31] "micro" [16:28:59] 16GB RAM and 6 cores for handling text lines not being enough [16:29:05] * 3 machines [16:29:15] and that's just with 2 jvm apps, before adding the third one [16:29:37] which also needs a fourth set of jvm apps for brokers [16:30:33] not exactly what I call "micro" :) [16:30:34] yesterday we stored and indexed 118,833,106 events at 50.3G per elasticsearch node [16:31:09] so far today we have 128,560,743 events with 62.3G per node [16:31:17] what are all these? [16:32:09] All the things that mediawiki writes to fluorine + parsoid + hadoop + restbase + ... [16:32:23] so.. error logs? [16:32:51] yeah many of which are at much more than error/warn verbosity [16:33:22] it sounds a bit excessive [16:33:49] that's ~1400 events/sec [16:33:52] <^demon|away> I think we can get rid of some of the noise probably. [16:36:27] jgage: The elasticsearch cluster isn't even responsive to curl 'http://localhost:9200/_cat/master' right now :( [16:37:10] ha, i think the 'micro' in the services does not refer to the deployment size or complexity, but to the fact that each service does a tiny thing [16:37:14] tiny very specific [16:37:31] It looks like 1001 (master) OOMed at some point and got stuck. I'm going to bounce it [16:37:38] meh ok [16:37:53] bd808: I find myself wondering how many of those events are simply noise, of no use to anyone. (and if those could be designed out of the traffic somehow) [16:38:13] chrismcmahon: A ton of them are noise. [16:38:17] RECOVERY - RAID on es2010 is OK: OK: optimal, 1 logical, 2 physical [16:38:42] <^d> That I think is the most important thing. [16:38:54] <^d> It makes the service more useful /and/ lowers the ingest rate. [16:39:19] !log restarted elasticsearch on logstash1001 [16:39:24] Logged the message, Master [16:39:32] ironically, the best way to determine what's noise seems to be to pump it all into logstash so that we can look through them [16:39:42] <^d> (also, 16gb sounds like barely enough for ES alone, much less logstash + etc etc etc) [16:39:51] ^d: yeah [16:40:09] <^d> We have a bunch of search servers with like 48GB that could be salvaged possibly :p [16:40:10] If the ES layer was robust I think we'd be much happier [16:40:36] [logstash1001] no known master node, scheduling a retry [16:40:40] not goodly [16:41:26] All 3 have oomed [16:41:36] and are basically stuck [16:41:41] gah [16:41:52] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [16:42:00] time for a cold restart :( [16:42:02] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [16:42:02] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [16:42:14] <^d> Can we throw some of the now-decom'd search boxes at it? Even if just as a stopgap? [16:42:16] 1003 was the last known master for today's shard [16:42:38] so I'm going to shutdown all 3, and then start 1003 first followed by 1001 and 1002 [16:43:57] !log shutdown whole elasticsearch cluster for logstash [16:44:00] Logged the message, Master [16:44:45] ^d that sounds great to me, though i don't have time to work on them today. bd808? [16:45:17] I have time but not root powers to change things in puppet [16:45:24] <^d> likewise :) [16:45:27] 3x ram for ES sounds like a very good thing if they've got enough disk [16:45:55] 3ops-core: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) 3NEW a:3RobH [16:46:16] (03PS1) 10RobH: setting mgmt for server procyon [dns] - 10https://gerrit.wikimedia.org/r/185447 [16:46:30] <^d> I'll file a Phab task to reclaim a few of those boxes. [16:48:04] !log Upgraded elasticsearch and restarted on all logstash nodes [16:48:10] Logged the message, Master [16:48:11] 3ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10RobH) 3NEW a:3Papaul [16:48:26] !log finished hadoop namenode migration. Hadoop cluster is back online [16:48:29] Logged the message, Master [16:48:46] yay ottomata :D [16:48:52] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 3, number_of_data_nodes: 3 [16:49:01] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 3, number_of_data_nodes: 3 [16:49:02] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 3, number_of_data_nodes: 3 [16:49:18] 3ops-core: set server procyon's asset tag mgmt ip info - https://phabricator.wikimedia.org/T87030#982048 (10RobH) 3NEW a:3RobH [16:49:28] 3ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10RobH) [16:49:30] 3ops-core: set server procyon's asset tag mgmt ip info - https://phabricator.wikimedia.org/T87030#982056 (10RobH) [16:50:49] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982062 (10Chad) 3NEW [16:50:59] <^d> jgage, bd808 ^ [16:51:23] thanks ^d [16:51:39] 3Analytics, operations: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#982081 (10Ottomata) [16:52:05] 3operations: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#982083 (10Chad) [16:52:06] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982082 (10Chad) [16:52:57] 3Analytics, operations: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#20804 (10Ottomata) analytics1001 and analytics1002 have been provisioned, and the Hadoop NameNode and YARN master services have been migrated off of analytics1010 and analytics1004 (ciscos).... [16:56:25] 3ops-core: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) [16:57:10] hm those lsearchd machines have only 300gb. elasticsearch on logstash100x are currently using 822gb. [16:57:57] we could drop the replica count. Right now we have full data on all boxes [16:58:21] or spread out over more? How many hosts are there? [16:59:27] (03PS1) 10Glaisher: Enable VisualEditor on 'Draft' (118) namespace at hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185449 (https://phabricator.wikimedia.org/T87027) [16:59:44] having trouble finding details because they've been removed from puppet. i'm not even sure what the hostnames were. [17:00:08] (03CR) 10Jforrester: [C: 031] Enable VisualEditor on 'Draft' (118) namespace at hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185449 (https://phabricator.wikimedia.org/T87027) (owner: 10Glaisher) [17:01:12] search10xx [17:01:22] 3ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10RobH) [17:01:51] 3ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10RobH) Please note initial task said to rack in b8-codfw, and I've changed it to a8-codfw, after Mark pointed out that there was only one sandbox vlan so far in codfw. [17:01:57] https://gerrit.wikimedia.org/r/#/c/184620/3/manifests/site.pp,unified [17:03:12] looks like there's 24 of em unless some have already been reused [17:07:56] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982135 (10mark) So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed? [17:11:46] 3ops-requests, WMF-Design: optoutresearch@ list, add recipient - https://phabricator.wikimedia.org/T86551#982148 (10Jgreen) p:5Triage>3Normal [17:12:25] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982150 (10Gage) The current nodes have insufficient RAM and Elasticsearch keeps OOMing. (Details: {T84958}) [17:12:43] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982153 (10bd808) >>! In T87031#982135, @mark wrote: > So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed? I broke it. I got all the changes merge... [17:13:05] (03PS1) 10Chad: Uninstall TitleKey. Cirrus has taken over. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185450 [17:14:41] 3ops-requests, WMF-Design: optoutresearch@ list, add recipient - https://phabricator.wikimedia.org/T86551#982156 (10Jgreen) 5Open>3Resolved [17:15:42] (03PS1) 10Chad: Remove useless profiling of Cirrus inclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185451 [17:16:02] 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#982159 (10bd808) We really won't know what the new log volume looks like until 2015-01-18T00:00Z. 2015-01-17 will be the first day that we have all redis MW traffic and no extra log2udp relay traffic. If we can l... [17:16:23] (03CR) 10Chad: [C: 032] Uninstall TitleKey. Cirrus has taken over. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185450 (owner: 10Chad) [17:16:29] (03Merged) 10jenkins-bot: Uninstall TitleKey. Cirrus has taken over. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185450 (owner: 10Chad) [17:16:31] (03CR) 10Chad: [C: 032] Remove useless profiling of Cirrus inclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185451 (owner: 10Chad) [17:16:35] (03Merged) 10jenkins-bot: Remove useless profiling of Cirrus inclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185451 (owner: 10Chad) [17:17:02] !log demon Synchronized wmf-config/: (no message) (duration: 00m 06s) [17:17:08] Logged the message, Master [17:17:38] delete delete delete [17:23:54] (03PS1) 10Chad: Undeploy MarkAsHelpful, has been disabled since like 2012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185456 [17:24:18] (03CR) 10Chad: [C: 032] Undeploy MarkAsHelpful, has been disabled since like 2012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185456 (owner: 10Chad) [17:24:22] (03Merged) 10jenkins-bot: Undeploy MarkAsHelpful, has been disabled since like 2012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185456 (owner: 10Chad) [17:24:45] !log demon Synchronized wmf-config/: (no message) (duration: 00m 05s) [17:24:49] Logged the message, Master [17:27:48] I feel lighter already [17:29:08] <^d> greg-g: It's like how you feel like a new person after your first day at the gym after making your resolution :) [17:29:28] oh, that day feels like crap and then I stop going [17:29:31] this is soooo much better [17:31:10] <^d> I'm sure we've got a few others that are only on like test2?wiki [17:37:55] 3ops-core: set server procyon's asset tag mgmt ip info - https://phabricator.wikimedia.org/T87030#982208 (10Papaul) [17:37:56] 3ops-core: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10Papaul) [17:37:57] 3ops-codfw: rack server procyon - oit backup server - https://phabricator.wikimedia.org/T87029 (10Papaul) 5Open>3Resolved Racktable update mgmt set-up complete asset tag info; wmf6161 port # ge-8/0/2 [17:39:05] (03PS1) 10Glaisher: Add apache config for m.{project}.org (-wikipedia) [puppet] - 10https://gerrit.wikimedia.org/r/185461 (https://phabricator.wikimedia.org/T78421) [17:40:40] (03PS1) 10BryanDavis: beta: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185462 [17:40:42] (03PS1) 10BryanDavis: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 [17:41:17] jgage: We have some ideas on how to trim down the load on logstash :) [17:42:31] (03CR) 10Glaisher: "I think this is the easiest way to do this or should we use separate VHosts?" [puppet] - 10https://gerrit.wikimedia.org/r/185461 (https://phabricator.wikimedia.org/T78421) (owner: 10Glaisher) [17:44:27] yay [17:44:36] (03CR) 10RobH: [C: 032] setting mgmt for server procyon [dns] - 10https://gerrit.wikimedia.org/r/185447 (owner: 10RobH) [17:45:11] PROBLEM - Hadoop Namenode - Stand By on analytics1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:46:12] shhh [17:48:02] (03CR) 10Legoktm: [C: 031] beta: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185462 (owner: 10BryanDavis) [17:50:05] !log depooled amssq42 text cache in esams [17:50:09] Logged the message, Master [17:51:10] (03PS1) 10Ottomata: Removing hadoop::standby role from analytics1004 [puppet] - 10https://gerrit.wikimedia.org/r/185468 [17:51:47] (03CR) 10Ottomata: [C: 032 V: 032] Removing hadoop::standby role from analytics1004 [puppet] - 10https://gerrit.wikimedia.org/r/185468 (owner: 10Ottomata) [17:52:16] 3Wikimedia-Apache-configuration, operations: wikibooks.org redirects to en.wikibooks.org - https://phabricator.wikimedia.org/T87039#982251 (10Glaisher) 3NEW [17:53:02] (03PS2) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [17:53:40] (03PS1) 10BBlack: disable amssq42 esams text cache backend [puppet] - 10https://gerrit.wikimedia.org/r/185469 [17:53:42] (03PS1) 10BBlack: amssq42 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/185470 [17:53:50] (03CR) 10jenkins-bot: [V: 04-1] Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [17:54:00] (03PS2) 10BBlack: disable amssq42 esams text cache backend [puppet] - 10https://gerrit.wikimedia.org/r/185469 [17:54:11] (03PS2) 10BryanDavis: beta: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185462 [17:54:14] (03CR) 10BBlack: [C: 032 V: 032] disable amssq42 esams text cache backend [puppet] - 10https://gerrit.wikimedia.org/r/185469 (owner: 10BBlack) [17:54:27] (03CR) 10BryanDavis: [C: 032] beta: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185462 (owner: 10BryanDavis) [17:54:47] (03PS2) 10BBlack: amssq42 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/185470 [17:55:03] (03CR) 10BBlack: [C: 032 V: 032] amssq42 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/185470 (owner: 10BBlack) [17:55:18] (03Merged) 10jenkins-bot: beta: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185462 (owner: 10BryanDavis) [17:55:38] The authenticity of host '[gerrit.wikimedia.org]:29418 ([2620:0:861:3:208:80:154:81]:29418)' can't be established. [17:55:45] RSA key fingerprint is dc:e9:68:7b:99:1b:27:d0:f9:fd:ce:6a:2e:bf:92:e1. [17:55:58] is that just ipv6 wackyness? [17:56:28] it shouldn't be [17:56:37] ssh is smart about that [17:56:59] where do I check the rsa fingerprint vs gerrit? [17:57:09] This is on tin btw [17:57:58] bd808: debug1: Server host key: RSA dc:e9:68:7b:99:1b:27:d0:f9:fd:ce:6a:2e:bf:92:e1 [17:58:17] ^ is what I get when I manually connect verbosely to gerrit on 29418 from home, and no key mismatch warning on my known_hosts stuff [17:58:21] so that's the right fingerprint [17:58:24] thanks bblack [17:58:28] 3Wikimedia-Apache-configuration, operations: wikibooks.org redirects to en.wikibooks.org - https://phabricator.wikimedia.org/T87039#982263 (10Glaisher) .com as well [17:58:49] (03Abandoned) 10Jgreen: dmarc_parser added (redo) [puppet] - 10https://gerrit.wikimedia.org/r/163881 (owner: 10Jgreen) [17:59:47] !log bd808 Synchronized wmf-config/logging-labs.php: beta: Allow wgDebugLogGroups to exclude logstash append (03c3ab27) (duration: 00m 06s) [17:59:52] Logged the message, Master [18:00:55] (03PS1) 10Jgreen: dmarc parser and database injector [puppet] - 10https://gerrit.wikimedia.org/r/185472 [18:09:26] (03PS1) 10Glaisher: Redirect wikibooks.(org|com) to www.wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) [18:10:47] (03CR) 10John F. Lewis: [C: 04-1] "Please use redirect.dat. (under the redirects directory)" [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [18:11:18] (03CR) 10Glaisher: Redirect wikibooks.(org|com) to www.wikibooks.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [18:13:45] (03CR) 10Glaisher: "From redirects.dat:" [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [18:13:55] !log document count not changing for logstash-2015.01.16 index [18:14:00] Logged the message, Master [18:35:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:38:22] any idea what this appserver network jump was? http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [18:38:30] <^d> Someone able to merge a simple puppet patch? I'm adding my .bashrc [18:38:43] yeah [18:39:06] (03PS2) 10BBlack: Add my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/184783 (owner: 10Chad) [18:39:12] (03CR) 10BBlack: [C: 032 V: 032] Add my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/184783 (owner: 10Chad) [18:39:57] <^d> ty bblack [18:41:36] (03PS3) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [18:45:58] (03PS1) 10RobH: setting rbf2001/2002 base isntall params [puppet] - 10https://gerrit.wikimedia.org/r/185478 [18:46:47] (03CR) 10Ottomata: Mail webrequest partition status summaries to analytics ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [18:46:58] (03CR) 10RobH: [C: 032] setting rbf2001/2002 base isntall params [puppet] - 10https://gerrit.wikimedia.org/r/185478 (owner: 10RobH) [18:48:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:50:15] (03CR) 10CSteipp: "Jeff, what's this being used for?" [puppet] - 10https://gerrit.wikimedia.org/r/185472 (owner: 10Jgreen) [19:14:20] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 3 failures [19:14:20] PROBLEM - HTTPS on amssq42 is CRITICAL: Return code of 255 is out of bounds [19:15:19] PROBLEM - Varnish HTTP text-backend on amssq42 is CRITICAL: Connection refused [19:16:30] RECOVERY - Varnish HTTP text-backend on amssq42 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.194 second response time [19:17:50] I hate that. it was in downtime, but reinstall -> remove/re-add from monitoring :p [19:19:20] RECOVERY - HTTPS on amssq42 is OK: SSLXNN OK - 36 OK [19:27:21] amssq42 is? jessie as well? [19:33:00] think its trusty? [19:33:05] i was on it yesterday fixing logster [19:33:12] did I bork it? it seemed fine yesterday [19:34:46] (03PS2) 10BryanDavis: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 [19:35:46] 3ops-codfw, ops-eqiad: ship blanking panels from eqiad to codfw - https://phabricator.wikimedia.org/T86082#982687 (10Cmjohnson) Papaul looks like these arrived. Please verify and close ticket. [19:35:57] 3ops-codfw, ops-eqiad: ship blanking panels from eqiad to codfw - https://phabricator.wikimedia.org/T86082#982689 (10Cmjohnson) a:5Christopher>3Papaul [19:36:59] (03PS4) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [19:39:55] (03CR) 10QChris: Mail webrequest partition status summaries to analytics ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [19:40:05] (03CR) 10QChris: Mail webrequest partition status summaries to analytics ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [19:42:23] Hey greg-g, we have a regression preventing people from using links with pipes in UploadWizard descriptions. This caused more ire than I thought it would. I'd appreciate being able to sync a revert. [19:47:03] paravoid: yeah it's jessie now as well [19:47:41] it didn't go as smoothly as I hoped, though. Apparently I failed to account for all my manual hacks on cp1008, so I have to iterate back on varnish + varnishkafka packaging/init stuff a little [19:48:20] ottomata: and no, you didn't bork it, I did :) [19:48:54] ahha [19:49:05] bblack, i *just* packaged logster for trusty to fix that single host! :p [19:52:58] lol [19:53:18] I guess jessie already had logster? I didn't run into any issue on that part [19:56:20] it does [19:56:29] (03PS1) 10BBlack: jessie fixup + bump to 1.0.6-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/185490 [19:56:38] uhhh, but [19:56:45] we have afork of logster that has extra stuff, i thikn... [19:56:49] (03CR) 10BBlack: [C: 032 V: 032] jessie fixup + bump to 1.0.6-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/185490 (owner: 10BBlack) [19:56:56] that's what you get when you fork :) [19:58:56] i ahve upstreamed some of it... [19:59:01] dunno what is in jessie though [20:00:06] also, apparently even our old varnish-3plus stuff will install systemd service unit files when built for jessie, oddly enough [20:00:32] I thought in my initial experiments that those only came from the varnish4 package, and that the old initscripts worked fine, etc. but no :p [20:00:42] oh hah [20:00:44] nice [20:01:01] does it work for -n frontend too? [20:01:02] well it's problematic because we override the initscript in puppet, but systemd believes the installed systemd unit file over our initscript [20:01:07] and no, not currently [20:01:10] right... [20:01:46] so I'll either have to make puppet remove the package's systemd unit file (ugly), or go ahead and port our puppet initscript templates to systemd service templates (probably better) [20:02:26] probably [20:03:08] fun :) [20:05:03] (03PS3) 10BryanDavis: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 [20:07:00] marktraceur: /me nods [20:07:08] KK [20:07:17] Hopefully Jenkins starts cooperating [20:10:41] um, bblack, what is the timeline for upgrading varnishes to jessie? [20:11:26] ottomata: https://phabricator.wikimedia.org/T86648 [20:12:23] (this quarter) [20:12:27] (03CR) 10Legoktm: [C: 031] Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 (owner: 10BryanDavis) [20:13:59] marktraceur: let me know when you're done. I've got a config change to take some of the load off of the logastash servers [20:14:51] bd808: You could do that now, I haven't started [20:14:59] coolio [20:15:00] Still doing backport patches, tgr is still writing a fix for one thing [20:15:20] (03PS4) 10BryanDavis: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 [20:15:52] aye, ok. [20:15:58] good to know, thanks [20:15:58] (03CR) 10BryanDavis: [C: 032] Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 (owner: 10BryanDavis) [20:16:00] (03Merged) 10jenkins-bot: Allow wgDebugLogGroups to exclude logstash append [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185463 (owner: 10BryanDavis) [20:17:15] !log bd808 Synchronized wmf-config/logging.php: Allow wgDebugLogGroups to exclude logstash append (e808e690) (duration: 00m 07s) [20:17:24] Logged the message, Master [20:17:51] !log bd808 Synchronized wmf-config/InitialiseSettings.php: Allow wgDebugLogGroups to exclude logstash append (e808e690) (duration: 00m 05s) [20:17:54] Logged the message, Master [20:19:08] marktraceur: all clear [20:19:47] Thanks [20:19:49] We're getting there [20:20:12] If tgr isn't ready by the time I'm thinking "where's my beer?" I'll probably start going with the four patches we have. [20:23:16] marktraceur: Smart. No reason to delay beer:30 on a Friday [20:23:37] (03PS1) 10Chad: Add subversion to Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/185535 [20:24:38] (03CR) 10Rush: [C: 032] Add subversion to Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/185535 (owner: 10Chad) [20:26:40] how come there's no private ip here: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000006d8.eqiad.wmflabs [20:27:04] should that be 10.68.16.120? [20:31:55] Jeff_Green: ^ [20:33:41] maybe ask in the labs channel, I have no idea [20:36:16] ok, thanks [20:36:33] OK we're good to go now [20:36:40] I think I'll scap, because I can, and because it lessens complications somewhat [20:37:14] arlolra: sorry I don't have a better answer [20:38:05] np. I'll get it sorted [20:46:39] greg-g: Mind if I scap rather than worry about details? [20:46:50] Not sure if you have an opinion [20:47:06] should be fine but beware it will probably take ~20-30m [20:47:11] I'm fine with that [20:47:16] The other Friday deployers will have to wait [20:47:25] :) [20:47:28] !log marktraceur Started scap: Fix UploadWizard regression and EventLogging errors [20:47:33] Logged the message, Master [20:47:37] :) [20:47:41] I need to stop deploying on Fridays though [20:47:45] Next week I'm off the stuff [20:58:23] 3Beta-Cluster, MediaWiki-Core-Team, operations: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982926 (10hashar) Seems this should go to #operations , #hhvm and #mediawiki-core-team and be rephrased to: "convert work machine (tin, terbium) to Trusty and hhvm usage" + me... [21:03:10] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [21:03:39] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [21:03:49] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [21:04:40] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 1, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 1, number_of_data_nodes: 2 [21:08:10] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 43 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 42, utimed_out: False, uactive_primary_shards: 40, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 80, uinitializing_shards: 1, unumber_of_data_nodes: 2} [21:08:24] ffs logstash [21:10:54] GWT got in some sort of infinite loop and is creating ~10 log entries per second: https://phabricator.wikimedia.org/T87040 [21:11:20] is that a "meh" volume, or should it be stopped until the bug is fixed? [21:12:30] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 122, initializing_shards: 0, number_of_data_nodes: 3 [21:13:09] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 1, number_of_data_nodes: 3 [21:14:29] * bd808 scratches head over latest elasticsearch freakout on logstash cluster [21:15:15] [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[logstash1001][inet[/10.64.32.138:9300]][bulk/shard/replica] disconnected]]] [21:17:03] !log OOM for elasticsearch on logstash1001 caused a dropped shard and icinga alerts [21:17:09] Logged the message, Master [21:17:38] jgage: can we just pry those boxes open and stick more ram in them? [21:18:35] !log marktraceur Finished scap: Fix UploadWizard regression and EventLogging errors (duration: 31m 06s) [21:18:39] Logged the message, Master [21:19:08] (03CR) 10Arlolra: "This change seemingly is the cause of https://phabricator.wikimedia.org/T86951" [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [21:20:20] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 3, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 118, initializing_shards: 2, number_of_data_nodes: 3 [21:21:19] !log restarted elasticsearch on logstash1001 [21:21:24] Logged the message, Master [21:22:51] (03CR) 10Hashar: "This commit has broken Parsoid on the beta cluster. The deployment-parsoidcache05 has ferm installed for some reason and thus needs the ht" [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [21:23:39] (03CR) 10Hashar: "https://phabricator.wikimedia.org/%5486951#982981" [puppet] - 10https://gerrit.wikimedia.org/r/185428 (owner: 10Yuvipanda) [21:24:07] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982985 (10greg) [21:32:59] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#983029 (10hashar) [21:33:00] (03PS1) 10Ottomata: Point hadoop resoucemanager and hadoop namenode CNAMES to new master namenode. [dns] - 10https://gerrit.wikimedia.org/r/185546 [21:33:17] (03CR) 10Ottomata: [C: 032] Point hadoop resoucemanager and hadoop namenode CNAMES to new master namenode. [dns] - 10https://gerrit.wikimedia.org/r/185546 (owner: 10Ottomata) [21:34:21] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10hashar) [21:34:55] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10hashar) Thanks Greg. I have added some steps to the task description. I could not find a project/Task related to the Trusty migration :-/ [21:46:18] (03CR) 10Jdlrobson: [C: 031] Hygiene: Change wgMFAnonymousEditing to wgMFEditorOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182177 (owner: 10Florianschmidtwelzow) [21:53:55] h [21:54:08] i [22:00:01] (03PS1) 10Ottomata: Temporarly override DNS CNAME entries for hadoop masters [puppet] - 10https://gerrit.wikimedia.org/r/185552 [22:01:36] (03CR) 10Ottomata: [C: 032] Temporarly override DNS CNAME entries for hadoop masters [puppet] - 10https://gerrit.wikimedia.org/r/185552 (owner: 10Ottomata) [22:05:14] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#983124 (10EBernhardson) [22:05:59] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10EBernhardson) updated description again, to clarify that the scripts don't have any dependency on hhvm, it is being used for its gdb like debug consol... [22:10:45] (03PS2) 10Florianschmidtwelzow: Hygiene: Change wgMFAnonymousEditing to wgMFEditorOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182177 [22:12:49] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [22:14:32] !log restarted elasticsearch on logstash1001; OOM errors [22:14:40] Logged the message, Master [22:15:21] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 43 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 38, utimed_out: False, uactive_primary_shards: 40, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 80, uinitializing_shards: 5, unumber_of_data_nodes: 3} [22:15:44] this is getting really old :( [22:16:10] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 2, number_of_data_nodes: 3 [22:16:39] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 2, number_of_data_nodes: 3 [22:21:22] bd808: you need help? [22:21:38] greg-g: I need ram :/ [22:21:52] can't help there [22:22:06] these boxes and my laptop are roughly equivalent [22:22:14] ...... [22:22:39] it was an experiement [22:22:46] it just worked too well :) [22:22:49] exactly! [22:23:24] * bd808 is trying to figure out if there are some more stop gap fixes that can be made [22:23:57] how was it able to handle udp2log but not monolog? [22:24:02] delete it all? :) [22:24:57] legoktm: 2 things -- we only added some logs via udp2log to the index; and I think we lost a lot via udp errors [22:25:29] :/ [22:25:46] For a couple of days we have been duping all group1 log traffic; then we started duping all wikipedai trafic as well [22:25:51] that is off now [22:26:17] but we have 3 huge indices that are making things sad [22:26:37] if we limp past 00:00UTC I hope things will get better [22:26:59] but we need more ram. we were running at the ragged edge before [22:31:54] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#983208 (10Cmcmahon) Jeff, yes please update my key. I'm pretty sure the one in place right now is from a machine that has been wiped. [22:45:26] (03PS2) 10Jgreen: dmarc parser and database injector [puppet] - 10https://gerrit.wikimedia.org/r/185472 [22:45:28] (03PS1) 10Jgreen: update cmcmahon's key, and add him to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/185567 [22:46:47] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#983240 (10Jgreen) Cmcmahon please confirm +1 https://gerrit.wikimedia.org/r/#/c/185567/ if I've got your pub key correct! [22:48:55] (03CR) 10Cmcmahon: [C: 031] "That's my key all right :-)" [puppet] - 10https://gerrit.wikimedia.org/r/185567 (owner: 10Jgreen) [23:23:24] (03PS1) 10BryanDavis: beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 [23:25:37] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt to solve:" [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis) [23:30:07] (03Draft1) 10BryanDavis: logstash: remove support for most udp2log events [puppet] - 10https://gerrit.wikimedia.org/r/185482 [23:33:06] !log ran `LTRIM logstash -50000 9999999` on redis queues to drop ~4M events in backlog [23:33:13] Logged the message, Master [23:34:39] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [23:35:48] (03PS2) 10BryanDavis: logstash: remove support for most udp2log events [puppet] - 10https://gerrit.wikimedia.org/r/185482 [23:39:32] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#983320 (10greg) a:5greg>3None