[00:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150116T0000). Please do the needful. [00:01:25] Just an FYI for swatters, logstash has been a bit wonky today so you may want to monitor logs on fluorine for errors [00:02:12] (03PS3) 10Dzahn: etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:03:06] (03PS4) 10Dzahn: etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:03:59] (03CR) 10Dzahn: [C: 032] etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (https://phabricator.wikimedia.org/T85788) (owner: 10John F. Lewis) [00:04:09] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:04:58] greg-g, bd808, if you are done deploying, i would like to push out zeroportal [00:05:26] yurikR: It's time for swat but I don't know if there are patches or a deployer for it [00:05:45] no patches in wikitech [00:06:56] bd808, oh, it was showing as friday [00:07:10] RoanKattouw, ^demon|away, are you deploying swat? [00:07:23] i could add my patch to it [00:07:24] It is Friday :) get on UTC time man :) [00:07:30] ))) [00:08:14] <^demon|away> I don't even have a terminal open :p [00:08:14] i follow my own time... internet time [00:08:15] hrmm [00:08:25] 15:37 -tomaw(tom@freenode/staff/tomaw)- [Global Notice] Hi all. Yes, it seems we erred with a firewall rule there. Everything should be back to normal now. [00:08:44] ok, seems like i could just do my own depl instead of swat [00:09:02] or RoanKattouw wants to deploy? :D [00:11:26] I think your it yurikR. Note my previous warning that you shouldn't necessarily trust logstash at the moment [00:11:41] bd808, what's the best way to track health? [00:11:44] atm [00:12:38] yurikR: Probably /usr/local/bin/fatalmonitor on fluorine [00:13:29] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:13:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:13:57] ^ trying to fix that, something messed up permissions [00:14:08] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [00:15:58] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:18:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:22:50] !log on both puppetmasters: chown gitpuppet /var/lib/git/operations/puppet/.git/logs/refs/heads/production & .git/logs/HEAD & .git/logs/refs/remotes/origin to fix puppet-merge. git pulled on strontium [00:22:55] hmm, this is weird - is there a reason why git pull on tin shows a new gerrit fingerprint? [00:23:02] bd808, ^ [00:23:29] yurikR: no idea [00:24:19] 5e:14:27:23:d2:20:69:cb:38:09:7d:5f:87:1d:16:2c ? [00:25:04] !log log bot , are you here? [00:25:34] mutante: nope. it didn't come back from the netsplit [00:25:51] hrmm.. ok. freenode messed up with a firewall rule, heh [00:25:58] brb [00:26:43] !log yurik Synchronized php-1.25wmf15/extensions/ZeroPortal: zero portal to master (duration: 00m 13s) [00:26:57] 1 apache public key error: [00:27:17] !log yurik Synchronized php-1.25wmf15/extensions/ZeroPortal: zero portal to master - retry (duration: 00m 06s) [00:27:29] ok, fixed [00:36:58] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:45:48] PROBLEM - puppet last run on es2005 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:38] !log restarted morebots [00:46:46] Logged the message, Master [00:46:57] !log on both puppetmasters: chown gitpuppet /var/lib/git/operations/puppet/.git/logs/refs/heads/production & .git/logs/HEAD & .git/logs/refs/remotes/origin to fix puppet-merge. git pulled on strontium [00:47:01] Logged the message, Master [00:50:09] ACKNOWLEDGEMENT - Apache HTTP on mw1062 is CRITICAL: Connection refused daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - DPKG on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - Disk space on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - HHVM processes on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - HHVM rendering on mw1062 is CRITICAL: Connection refused daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - NTP on mw1062 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn T86542 [00:50:09] ACKNOWLEDGEMENT - RAID on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:10] ACKNOWLEDGEMENT - configured eth on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - dhclient process on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - nutcracker port on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:11] ACKNOWLEDGEMENT - nutcracker process on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:50:12] ACKNOWLEDGEMENT - salt-minion processes on mw1062 is CRITICAL: Connection refused by host daniel_zahn T86542 [00:51:57] ori: HHVM monitoring broke recently, somehow [00:52:01] or is it new [00:52:30] further down on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader [00:52:59] ah.. hmm: Got status 502 from the graphite server at [00:53:08] RECOVERY - puppet last run on es2005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:53:40] but a link like this seems ok http://graphite.wikimedia.org/render?format=json&from=-10min&target=servers.mw1186.hhvmHealthCollector.queued.value [00:58:59] (03CR) 10BryanDavis: "Log volume in logstash went up from 68,723,096 events on 2015-01-13 to 123,314,490 events on 2015-01-14 when group1 was switched to loggin" [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [01:02:08] morebots, there? [01:02:08] I am a logbot running on tools-exec-03. [01:02:08] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [01:02:08] To log a message, type !log . [01:02:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [01:02:56] andrewbogott: i restarted it, it was alive but on the other side of the netsplit [01:03:11] ok [01:13:37] (03PS1) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:14:23] (03PS2) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:15:16] (03PS3) 10Dzahn: wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) [01:16:46] (03CR) 10Dzahn: [C: 032] wikistats: fix Wikia updating [debs/wikistats] - 10https://gerrit.wikimedia.org/r/185357 (https://phabricator.wikimedia.org/T61943) (owner: 10Dzahn) [01:19:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:37:48] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 1 failures [01:55:48] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:00:28] (03PS1) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 [02:03:40] !log ori Synchronized php-1.25wmf15/extensions/EventLogging: (no message) (duration: 00m 06s) [02:03:46] !log ori Synchronized php-1.25wmf14/extensions/EventLogging: (no message) (duration: 00m 05s) [02:03:52] Logged the message, Master [02:03:56] Logged the message, Master [02:06:08] (03PS2) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 [02:06:53] !log EventLogging syncs were of I335ad42bb: JsonSchemaContent: Fix html rendering of objects and arrays [02:06:57] Logged the message, Master [02:19:00] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:19:05] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-16 02:19:04+00:00 [02:19:10] Logged the message, Master [02:19:14] Logged the message, Master [02:31:33] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:31:37] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-16 02:31:37+00:00 [02:31:40] Logged the message, Master [02:31:45] Logged the message, Master [02:43:58] PROBLEM - Mediawiki Apple Dictionary Bridge on terbium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia_Foundation not found on https://search.wikimedia.org:443https://search.wikimedia.org/?lang=ensite=wikipediasearch=Wikimedia_Foundationlimit=1 - 3389 bytes in 0.094 second response time [02:46:55] https://search.wikimedia.org/?lang=en&site=wikipedia&search=Wikimedia_Foundation&limit=1 is spitting out PHP for me... [02:47:16] http://fpaste.org/170327/21376442/raw/ [02:48:12] * legoktm has to run [02:48:41] i guess Apple caches the results, because i'm still getting them from osx dictonary [02:49:12] i know nothing about that url :( [02:58:29] if i drop &limit=1 i get a more reasonable looking output [03:01:53] !log xtrabackup clone db1020 to db1046 [03:02:03] Logged the message, Master [03:09:26] !log ori Synchronized php-1.25wmf14/includes/content/JsonContent.php: I2f4f9cb343: Let subclasses specify content model in JsonContent (duration: 00m 06s) [03:09:34] Logged the message, Master [03:17:40] (03PS1) 10Springle: db1020 is primary [puppet] - 10https://gerrit.wikimedia.org/r/185384 [03:20:06] (03CR) 10Springle: [C: 032] db1020 is primary [puppet] - 10https://gerrit.wikimedia.org/r/185384 (owner: 10Springle) [03:34:11] (03Abandoned) 10OliverKeyes: Change the URLs used by Pybal to simplify tracking for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [03:36:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [03:39:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [03:40:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 17.18 ms [03:46:26] Host google? [03:47:46] presumably a sanity check for internet connectivity from the monitoring server [03:47:49] PROBLEM - haproxy process on dbproxy1002 is CRITICAL: PROCS CRITICAL: 2 processes with command name haproxy [03:47:58] google safe browsing check? [03:48:38] (modules/icinga/manifests/gsbmonitoring.pp) [03:48:44] okay. . . [03:49:03] ACKNOWLEDGEMENT - haproxy process on dbproxy1002 is CRITICAL: PROCS CRITICAL: 2 processes with command name haproxy Sean Pringle me [03:49:16] huh yeah safe browsing check, i hadn't looked at this before [03:49:53] but host down would indicate failure to reach their service to query it [03:49:58] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:14:13] (03Abandoned) 10TTO: Allow import from any WMF project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://bugzilla.wikimedia.org/15583) (owner: 10TTO) [04:22:39] !log on mw1228 doing some tests to figure out why incorrect Expires header is being sent on requests for /images/* [04:22:46] Logged the message, Master [04:27:48] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:29:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 20.26 ms [04:34:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:36:28] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 15.98 ms [04:37:09] fwiw the google safe browsing checks have existed since at least 2011. icinga says downtime is 2m10s today, no other problems in the past week. it does say down rather than unreachable. i haven't found a google page reporting on the status of the service. [04:40:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 16 04:40:10 UTC 2015 (duration 40m 9s) [04:40:19] Logged the message, Master [04:45:28] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [04:54:09] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [04:54:59] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.00 ms [05:00:59] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:01:32] i am still able to perform the test manually from neon even through it's reported to be in the down state [05:01:39] curl "www.google.com/safebrowsing/diagnostic?site=mediawiki.org/" | grep -i --color "not currently listed as suspicious" [05:01:58] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 15.97 ms [05:02:07] hmph [05:04:48] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:18:28] 3operations: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#981163 (10tstarling) 3NEW [05:29:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [05:30:01] i wonder why the google safe browsing checks hardcode an IP (74.125.225.84) for the host check instead of using www.google.com, which is what the actual service checks use. [05:31:58] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 17.59 ms [05:32:26] jgage: does hashar's name appear in the git-log? [05:33:22] he loves optimizing away hostname resolution by replacing stable and readable names with IP addresses [05:34:39] i didn't trace it back further than this commit by peter in 2011, whcih features the IP: https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/e0eac18323f8241b47e8005851962959ca4db969 [05:37:26] mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com. [05:54:27] !log mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [05:54:33] Logged the message, Master [05:55:06] Reedy: can you /cs access #wikipedia-userscripts add Technical_13 helper [05:55:53] Almost no-one is ever there. Thanks. [06:07:52] thanks, i often forget about SAL. i wonder who reads it. [06:08:09] (03Abandoned) 10KartikMistry: WIP: Use SSL in cxserver config [puppet] - 10https://gerrit.wikimedia.org/r/185157 (owner: 10KartikMistry) [06:09:11] well, i do [06:15:48] !log Icinga test of Mediawiki Apple Dictionary Bridge as https://search.wikimedia.org/?lang=en&site=wikipedia&search=Wikimedia_Foundation&limit=1 returns an error since shortly after l10n update at 02:31 UTC, though URL works without &limit=1 and end user osx dictionary lookups are still working. [06:15:54] Logged the message, Master [06:16:29] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [06:17:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 17.25 ms [06:29:09] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:29] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:08] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:51] (03CR) 10Gage: [C: 032] Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [06:37:58] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [06:40:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 17.11 ms [06:44:11] <_joe_> uhm [06:45:49] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:09:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:13:10] 3operations: DRIVE YOUR CAR AND GET PAID ADVERTISING FOR MONSTER ENERGY DRINK.($400 Weekly) - https://phabricator.wikimedia.org/T86999#981250 (10emailbot) [07:13:48] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:14:49] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 19.80 ms [07:27:58] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:28:49] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 19.96 ms [07:30:36] 3operations, Spam-Spam: DRIVE YOUR CAR AND GET PAID ADVERTISING FOR MONSTER ENERGY DRINK.($400 Weekly) - https://phabricator.wikimedia.org/T86999#981254 (10yuvipanda) [07:32:00] <_joe_> lol [07:32:13] I don't drive :( [07:32:18] <_joe_> what's with this packet loss? [07:32:29] <_joe_> ori: as in you don't have a driver license? [07:32:41] mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [07:32:44] Yeah. [07:33:26] <_joe_> good for you, I guess this limits the places you can live in the US [07:33:31] I could pose with the cans on the roof of the car if someone else does the driving. [07:33:44] But safety would be a concern. [07:34:39] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:36:09] RECOVERY - Host google is UP: PING WARNING - Packet loss = 80%, RTA = 16.48 ms [07:37:00] re: limits the places you can live in the US -- yeah. It doesn't even work very well in SF. It wasn't an issue in New York, though. [07:37:19] <_joe_> yeah new york is _definitely_ a walking city [07:39:06] <_joe_> but well, I have to say I'm managing to take the car once a month in Rome, so it's doable basically everywhere with a public transportation system (and the one here is really bad) [07:40:47] it's doable here, but it means forfeiting on the few truly nice things about the bay area, which is the nearness of some really breathtaking natural beauty [07:41:00] *on one of the [07:41:27] <_joe_> well, back to packaging! https://gerrit.wikimedia.org/r/#/c/185187/ [07:41:40] <_joe_> I also have a nutcracker package for precise [07:42:11] oh wow, pcre cache and the leak fix [07:42:17] nice [07:42:17] <_joe_> yes [07:42:33] <_joe_> we'll take it to production when I'm there next week :) [07:44:12] nice work [07:45:08] * YuviPanda also doesn’t drive for safety concerns [07:45:23] * YuviPanda waves [07:45:36] hi Yuvi [07:45:41] hi ori [07:45:56] I’ll see you again in a few days! Sqeee! :) [07:46:04] <_joe_> YuviPanda: :)) [07:46:35] <_joe_> I have to say I hate travelling for work, but at least this time I have the incentive of meeting with quite a lot of you guys in person [07:46:56] yup. [07:47:20] <_joe_> next year I will probably be complaining for weeks :P [07:47:39] ‘eugh, have to see *those* guys again. Hope I do not end up punching anyone’? :) [07:48:38] <_joe_> nah, I am peaceful :) [07:49:49] <_joe_> no, I hate work travel because it means being away from my family, working in hotel rooms, not having the opportunity to truly visit the place I'm in, and I usually come back really tired [07:49:58] <_joe_> Athens ops meeting was a nice exception [07:50:26] _joe_: exactly [07:50:28] 09:42 < _joe_> we'll take it to production when I'm there next week :) [07:50:31] lol [07:50:41] you think you'll work the next two weeks? [07:51:01] <_joe_> paravoid: if I sadly know myself, I'll be up around 3 AM every day [07:51:15] * _joe_ needs sleeping pills [07:52:07] _joe_: you could join me and ori in getting 2AM subway sandwiches! [07:52:10] if ori still does those [07:52:38] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [07:52:44] <_joe_> I'm more the "man vs food" type [07:52:44] heh [07:53:02] what's with google [07:53:08] <_joe_> I see no package loss from neon to google [07:53:11] <_joe_> btw [07:53:29] <_joe_> mtr shows me packet loss between cr2-eqiad.wikimedia.org and 206.126.236.21 aka eqixva-google-gige.google.com [07:53:33] <_joe_> but I don't see that [07:53:51] me neither [07:54:12] * YuviPanda stays off betalabs today, goes to add proper nodejs support on toollabs [07:54:29] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 18.66 ms [07:56:07] hardcoded IP address in puppet... [07:56:33] <_joe_> paravoid: srsly? [07:56:43] <_joe_> sigh [07:57:35] oh so TimStarling's PCRE cache work landed in HHVM I see? [07:57:36] nice! [07:58:03] (03PS1) 10Springle: repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 [07:59:08] (03CR) 10Springle: [C: 032] repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 (owner: 10Springle) [07:59:12] (03Merged) 10jenkins-bot: repool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185394 (owner: 10Springle) [08:00:59] !log springle Synchronized wmf-config/db-eqiad.php: repool db1051 db1056, warm up (duration: 00m 10s) [08:01:05] Logged the message, Master [08:01:59] RECOVERY - Mediawiki Apple Dictionary Bridge on terbium is OK: HTTP OK: HTTP/1.1 200 OK - 748 bytes in 0.156 second response time [08:02:14] <_joe_> ok, who did what? [08:02:15] <_joe_> :) [08:07:08] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:07:29] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 19.04 ms [08:08:07] springle: there's a dbproxy1002 check_failover alert [08:21:30] (03PS1) 10Giuseppe Lavagetto: mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 [08:21:35] <_joe_> paravoid: ^^ [08:22:42] (03PS2) 10Giuseppe Lavagetto: mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 [08:24:47] (03CR) 10Giuseppe Lavagetto: "tested on testwiki, it does in fact use HHVM correctly." [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:25:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use HHVM for the apple search dictionary [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:25:38] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.002428770065 secs [08:26:45] mixing tabs and spaces [08:26:46] (03CR) 10Ori.livneh: "Already merged, but +1 anyway -- this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/185396 (owner: 10Giuseppe Lavagetto) [08:27:04] paravoid: I saw but chose not to complain :P [08:27:17] <_joe_> paravoid: I already know, I'll fix that [08:27:26] <_joe_> I just wanted to fix it in prod ASAP [08:27:40] <_joe_> all those poor iphone users... [08:27:45] so [08:27:53] can we encode the server in a header? [08:27:59] <_joe_> yes [08:28:18] we currently send "Server: Apache" and X-Powered-By [08:28:36] X-Cache has been very useful in debugging varnish issues [08:28:59] perhaps we want... Server: mw1052 (Apache, HHVM) or something? [08:29:31] or even more arbitrary tags perhaps, e.g. trusty, jessie etc. [08:29:38] but hostname for sure [08:29:52] <_joe_> I think the hostname is enough [08:29:53] (03PS1) 10Giuseppe Lavagetto: mediawiki: retab of the virtualhost for search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/185397 [08:30:22] well, you have more in X-P-By [08:30:26] and there's no point in two headers [08:30:55] <_joe_> X-P-By is set by HHVM in most cases, we can always tamper with it of course [08:32:12] <_joe_> the Server: header is set by apache internally [08:32:26] <_joe_> I think we can just change ServerTokens for that [08:32:39] <_joe_> meaning we can turn it off [08:33:00] there's always varnish :) [08:33:20] <_joe_> well, varnish doesn't know the hostname [08:33:38] no, but can replace Server with the value of another header [08:33:41] <_joe_> or are you suggesting to mangle headers so that we suppress one and unify it? [08:34:08] <_joe_> yeah, I'd prefer not to if possible. I'm taking a look [08:34:12] well let's first agree what we want to do [08:34:15] if anything [08:35:33] <_joe_> I'd say having one header that tells you a) it is apache b) the hostname serving the request and c) if it was served by HHVM, which version; and if it was a static content, state it [08:35:39] <_joe_> so something like [08:35:57] <_joe_> Server: Apache (mw1053) - HHVM/3.3.1 [08:36:15] <_joe_> or Server: Apache (mw1053) - static [08:36:43] <_joe_> and we can remove the X-Powered-By header too [08:38:48] <_joe_> interestingly enough, it seems the Server: header can't be unset in apache 2.4 [08:38:55] <_joe_> but we can probably overwrite it [08:39:59] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:40:19] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 17.02 ms [08:41:35] (03CR) 10Faidon Liambotis: [C: 04-2] "Why isn't ferm::rule enough? I'd rather not." [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) (owner: 10Dzahn) [08:47:40] YuviPanda: there? [08:48:38] kart_: ‘sup [08:49:31] wtf firefox. can’t resolve gerrit? [08:49:35] chrome can [08:49:37] * YuviPanda mumbles [08:50:39] YuviPanda: you merged role for cxserver beta/production [08:50:43] but, https://gerrit.wikimedia.org/r/#/c/180125/6/manifests/role/cxserver.pp [08:50:44] yup [08:50:54] We sometime need different config :) [08:51:05] specially: see above ps [08:51:15] kart_: hiera! [08:51:27] YuviPanda: ouch [08:51:29] :) [08:51:55] kart_: so you either let the base class (in this case ::cxserver) have no params or defaults to prod, and override for labs [08:51:59] kart_: see the overrides for labs at http://wikitech.wikimedia.org/wiki/Hiera:deployment-prep [08:52:02] well, betalabs [08:52:19] more hiera documentation at http://wikitech.wikimedia.org/wiki/Hiera [08:52:44] Nods. Thanks. [08:52:53] _joe_: also, from ^ I don’t actually see anything that looks up based on $::realm [08:53:05] should / could we add one, for things that are ‘if you are a labs instance, do this' [08:53:19] <_joe_> YuviPanda: YuviPanda for a simple reason - labs has its own hiera config [08:53:19] that or we could already do this and I’m missing it [08:53:30] <_joe_> so it's naturally separated [08:53:49] _joe_: you mean the wikitech ones? [08:53:53] or is it there somewhere else too? [08:54:12] <_joe_> YuviPanda: I need to document this, you're right [08:54:15] (03CR) 10Alexandros Kosiaris: "You mean gerrit is not configurable enough to make it listen on both ports ?" [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) (owner: 10Dzahn) [08:54:18] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [08:54:22] _joe_: yup. [08:54:35] ctrl-f realm gives me [08:54:35] hieradata/eqiad/admin.yaml ($::realm) [08:54:37] <_joe_> YuviPanda: you can have hieradata/labs.yaml [08:54:39] <_joe_> for example [08:54:40] but I’m pretty sure that’s $::site [08:54:46] _joe_: aha! that’s what I was looking for :) [08:54:59] _joe_: oh, it already exists [08:55:02] * YuviPanda looks sheepish now [08:55:09] <_joe_> YuviPanda: modules/puppetmaster/files/labs.hiera.yaml [08:55:18] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 18.19 ms [08:55:39] right. [08:56:23] _joe_: makes sense now. thanks. [08:57:22] I should probably migrate at least *some* of the Hiera:deployment-prep overrides to a yaml file in ops/puppet [08:58:15] YuviPanda: that would be nice [08:58:30] (03PS1) 10KartikMistry: Moved comment at right place! [puppet] - 10https://gerrit.wikimedia.org/r/185400 [08:58:35] I am not opposed to the wiki page, but for beta it is probably better to have change go through the Gerrit review :D [08:58:45] hashar: yaeh, I agree [08:59:10] though I have no idea whether our hiera file hierarchy would supports inheritance such as labs -> deployment-prep [08:59:27] (03PS2) 10Yuvipanda: cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [08:59:33] hashar: it does, I think. [08:59:40] mwyaml overrides nuyaml, I think [08:59:45] so you can still make changes to it on wikitech [09:00:02] (03CR) 10Hashar: [C: 031] cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [09:00:12] <_joe_> YuviPanda, hashar I strongly object to this [09:00:19] <_joe_> why gerrit? [09:00:26] so we can review? [09:00:27] <_joe_> wiki has revision history [09:00:35] and attach the change to a bug [09:00:35] well, primarily so when someone changes something and git greps, they find this as well [09:00:45] <_joe_> so that you need to wait me to give +2 to you? [09:00:47] if you change the name of a param, for example. [09:01:00] <_joe_> srsly? [09:01:00] well on beta we can cherry pick the patch on the local puppetmaster [09:01:03] _joe_: no, they can still override it via wikitech (if I understood the hierarchy correctly) [09:01:12] but yeah, that is some more patches that will have to be +2ed by ops eventually [09:01:13] <_joe_> hashar: how is that better that editing the wiki page? [09:01:19] <_joe_> you're still skipping review then [09:01:20] <_joe_> :P [09:01:27] <_joe_> I really don't get it guys [09:01:29] on the wiki one can get make a change without any peer review [09:01:53] <_joe_> even on the puppetmaster by cherry-picking changes that have -1 or -2 on gerrit [09:01:53] where as we can propose the patch in Gerrit, wait for review and once reviewed deploy/cherry pick [09:01:57] <_joe_> or editing files by hand [09:02:17] <_joe_> I think file-based hiera in labs should just be for very general things [09:02:50] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Move comment to correct place [puppet] - 10https://gerrit.wikimedia.org/r/185400 (owner: 10KartikMistry) [09:02:52] <_joe_> but well, your poisoned well :) Just don't come to complain when puppet breaks there. [09:02:54] _joe_: well, re: cherrypicking them, sometimes they get -2d with ‘do not do this’ and no alternative is offered. [09:02:56] https://phabricator.wikimedia.org/T78076 for example [09:03:29] (03PS1) 10KartikMistry: admin: Add dotfiles for kartik [puppet] - 10https://gerrit.wikimedia.org/r/185401 [09:03:42] _joe_: I’m also killing all cherry-picks that aren’t there just for testing. there were 3, now there’s 1 (the one I just linked to) [09:03:46] that I’m not fully sure how to tackle. [09:04:04] I really hate when bugs make it to production :/ [09:04:23] <_joe_> well, reimage from scratch the appservers in beta :) [09:04:45] _joe_: and sometime we get a rough patch deployed which is merely to unblock us, then we iterate with ops until the patch is production grade [09:05:16] that was the case to get hhvm auto update on the CI slaves. I originally used ensure => latest, got that deployed which got me hhvm. Then iterated with ops to get the patch to use Debian unattended upgrade instead [09:05:20] but at least, I got hhvm installed :] [09:05:37] man, I was going to stay out of this today and work on toollabs instead. Totally failing on that now. [09:05:41] <_joe_> hashar: so how is having it in gerrit (the hiera data) any different than having them in wikitech [09:05:47] <_joe_> if you don't care about the review? [09:06:01] <_joe_> YuviPanda: go back to toollabs [09:06:02] <_joe_> :) [09:06:10] _joe_: I think it’s not ‘they do not care about review (always)’. They care about it less than ops does, but it’s not binary. [09:06:16] <_joe_> hashar: my point is that hiera data don't matter that much in a review [09:06:26] _joe_: I primarily want it for git-grepping [09:06:26] <_joe_> in general [09:06:35] (03PS1) 10Faidon Liambotis: Remove decom'ed server "sanger" from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185402 [09:06:37] (03PS1) 10Faidon Liambotis: ldap: cleanup unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/185403 [09:06:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:06:59] (03CR) 10Faidon Liambotis: [C: 032] "Fairly obvious." [puppet] - 10https://gerrit.wikimedia.org/r/185402 (owner: 10Faidon Liambotis) [09:07:00] woo, LDAP cleanup [09:07:04] yeah a bit [09:07:06] more are coming [09:07:09] not much though [09:07:16] this is primarily a certs.pp cleanup [09:07:25] really the whole ldap module should go [09:07:28] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 16.37 ms [09:07:29] and replaced with OpenLDAP [09:07:37] and role classes [09:07:59] (and we already have a module for openldap) [09:08:09] _joe_: going back to toollabs now. All of this definitely needs to be talked about in person over the next two weeks. [09:08:35] (03CR) 10Faidon Liambotis: [C: 032] ldap: cleanup unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/185403 (owner: 10Faidon Liambotis) [09:08:41] _joe_: yeah I understand your point. Maybe I am overthinking it :-] [09:09:04] merges? [09:09:07] dammit [09:09:47] we should get you guys a dedicated Zuul setup that would be allowed to merge changes [09:09:53] no thanks [09:10:18] that saves you a click! [09:10:49] (actually not at all) [09:11:11] (03PS1) 10Faidon Liambotis: mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 [09:11:20] what do people think of ^^ [09:11:24] should this be named "lists"? [09:11:27] or mailman is fine? [09:11:33] it's fairly wmf-specific [09:12:00] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: puppet fail [09:12:23] which blurs the line with the role class as well, e.g. all those monitoring checks could be folded into the module [09:13:09] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:13:16] _joe_ / akosiaris? [09:13:19] (03CR) 10Alexandros Kosiaris: [C: 032] admin: Add dotfiles for kartik [puppet] - 10https://gerrit.wikimedia.org/r/185401 (owner: 10KartikMistry) [09:14:20] paravoid: mailman is fine, the role should have the monitoring checks and named lists :-) [09:14:33] confused ya enough ? [09:14:45] well the module has lists.wikimedia.org hardcoded in it [09:14:59] and you can't get away from it much [09:15:07] we have HTML templates, for instance [09:15:50] <_joe_> if we have a lot of wmf-specific things, lists is probably less misleading for other people [09:16:06] if you feel pedantic enough you can move the wmf-specific content to the role [09:16:18] no, almost everything is wmf-specific [09:16:25] <_joe_> but well I don't care much, mailman is fine anyways [09:17:02] <_joe_> gee I forgot why I separated ini directives for hhvm. sigh [09:17:08] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:17:14] ? [09:17:49] OK, so what are we supposed to do about google being down according to icinga ? [09:18:04] my guess is nothing but I am curious [09:18:08] (03PS2) 10Faidon Liambotis: mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 [09:18:08] RECOVERY - Host google is UP: PING WARNING - Packet loss = 93%, RTA = 16.00 ms [09:18:15] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: move into a new, separate module [puppet] - 10https://gerrit.wikimedia.org/r/185404 (owner: 10Faidon Liambotis) [09:18:17] <_joe_> akosiaris: it's the check that is wrong, it's using an hardcoded ip [09:18:38] <_joe_> I'll fix it when I'm done with HHVM if no one got to it [09:18:53] yeah, that should be fixed, but still that does not change my point [09:19:19] <_joe_> akosiaris: well, it's like a poor man's probe of network reachability [09:19:26] <_joe_> I guess [09:19:57] yeah I get the idea. I always setup those as well [09:20:06] I just don't have them notify [09:20:30] it's more like an indication for ops when they login into icinga web ui that something is terribly wrong [09:20:51] (03PS1) 10KartikMistry: WIP: cxserver: Use different URL for apertium in BetaLabs [puppet] - 10https://gerrit.wikimedia.org/r/185406 [09:21:09] akosiaris: Tried to do, but feel free to fix ^^ :) [09:24:31] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [09:25:39] RECOVERY - Host google is UP: PING WARNING - Packet loss = 86%, RTA = 20.33 ms [09:35:34] <_joe_> kart_: that won't work [09:35:45] <_joe_> in labs you have a different hiera setup [09:36:40] (03Abandoned) 10Alexandros Kosiaris: WIP: cxserver: Use different URL for apertium in BetaLabs [puppet] - 10https://gerrit.wikimedia.org/r/185406 (owner: 10KartikMistry) [09:37:04] _joe_: Labs = Beta Cluster, you mean? [09:37:53] akosiaris: thanks. [09:37:57] _joe_: got it. [09:38:23] I shoud've read discussion more carefully :) [09:38:53] I think it is already fixed kart_ [09:39:02] yesterday by YuviPanda [09:39:36] it is just that the old config.js was a link and not the config file itself [09:39:55] yup, I did that. [09:40:01] I also cleaned out the config.js file [09:40:06] so it should work? [09:40:37] YuviPanda: which one ? the cxserver/config.js or the cxserver/cxserver/config.js ? [09:40:44] I rm’d both [09:40:49] latter was a symlink to former [09:41:00] so the symlink was there just a few mins ago [09:41:07] I removed it and ran puppet [09:41:12] and now it is OK [09:41:26] hmm [09:41:30] how I wonder how that came back [09:41:33] so perhaps you were caught in a race ? [09:41:40] with? [09:42:03] between merging, puppet running and you rming the files ? [09:42:10] oh, that’s possible [09:42:18] not the most likely scenario but one that explains it [09:42:20] but puppet should’ve ran again since? [09:42:22] hmm [09:42:24] at least if it does not happen again [09:42:39]