[00:01:08] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1267576 (10Dzahn) a:5Aklapper>3Multichill [00:04:35] RoanKattouw: everything good ? [00:07:15] Yup [00:07:18] VE things are working [00:08:34] RoanKattouw: You still planning to deploy the MF submodule update for wmf4, or should we? [00:10:04] Oh crap [00:10:08] Ahm yeah let me do that [00:10:12] Sorry, I'd totally forgotten [00:11:12] !log deployed RESTBase 8865b9c48 [00:11:16] NP [00:11:19] Logged the message, Master [00:11:27] !log catrope Synchronized php-1.26wmf4/extensions/MobileFrontend/: SWAT (duration: 00m 42s) [00:11:30] 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1267599 (10Dzahn) ``` [terbium:/var/log/mediawiki/refreshLinks] $ ls total 748K 4.0K drwxrwxr-x 2 www-data mwdeploy 4.0K Jun 7 2014 . 4.0K drwxr-xr-x 5 www-data w... [00:11:32] Logged the message, Master [00:13:02] !log running refreshLinks.php for s2 [00:13:07] Logged the message, Master [00:16:46] 7Blocked-on-Operations, 6Commons, 10Wikimedia-Site-requests: Add *.wmflabs.org to `wgCopyUploadsDomains` - https://phabricator.wikimedia.org/T78167#1267610 (10Dzahn) 5Open>3stalled [00:19:05] 6operations, 10Wikimedia-Language-setup, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#1267616 (10Dzahn) >>! In T10217#135716, @deryckchan wrote: > This is getting ridiculous.... [00:20:08] 6operations, 10Wikimedia-Hackathon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1267619 (10Dzahn) [00:20:09] 6operations, 10Wikimedia-Site-requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1267618 (10Dzahn) 5Open>3stalled [00:21:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1267623 (10Dzahn) Technically this was set to be blocked by another task but it was possible to resolve it without the blocking task being resolved. Bug? [00:23:55] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1267630 (10Dzahn) [00:29:08] 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1267634 (10Dzahn) I ran the update command for s2 manually and it appeared to run just normal. [00:31:43] 6operations, 7Monitoring: Monitor the up-to-date status of wikitech-static - https://phabricator.wikimedia.org/T89323#1267636 (10Dzahn) https://wikitech-static.wikimedia.org/w/api.php?action=ask&query=[[Modification%20date::%2B]]|%3FModification%20date|sort%3DModification%20date|order%3Ddesc https://wikitech.... [00:41:47] Hm.. that's weird. I just got a random 503 error when viewing a Wikipedia article. Got it open in a separate tab, though can't reproduce it. [00:41:55] Request: GET http://en.wikipedia.org/wiki/CO2, from 10.20.0.176 via cp3014 cp3014 ([10.20.0.114]:3128), Varnish XID 2601291803 [00:41:55] Forwarded for: 94.3.126.202, 10.20.0.176, 10.20.0.176 [00:41:55] Error: 503, Service Unavailable at Thu, 07 May 2015 00:40:56 GMT [00:42:23] When viewing https://en.wikipedia.org/wiki/CO2 [00:42:24] (https) [00:43:31] Krinkle: https://gdash.wikimedia.org/dashboards/reqerror/ [00:43:48] there are some moderate 5xx spikes [00:44:17] Wow, that's not good. [00:44:27] overloaded backend? Faulty frontend? [00:44:48] it doesn't look life-threatening [00:45:16] Serving 503 to random readers is not acceptable afaik. [00:45:50] tell that to the pile of 10,000 bugs in the bug database, and the other 10 million unknown ones [00:45:58] in https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor the top term is "Error connecting to 10.64.48.22: Can't connect to MySQL server on '10.64.48.22'" [00:46:11] Yeah, but having a spike of 1000s of 503 errors every few hours sounds like a more significant pattern. [00:46:28] yeah, it is [00:46:32] followed by one about "Lock wait timeout exceeded", same DB [00:46:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:48:23] there's a whole list of DB connect errors [00:49:58] looks like icinga has some related bits about s5 slave, related? [00:50:19] note warn/crit here: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=dbstore1002&nostatusheader [00:51:00] looks like it hit a replication-halting error 5.5 hrs ago [00:51:16] yeah [00:52:31] springle: ping? [00:53:21] the connect errors in fatalmonitor are fairly rare [00:54:04] the MW db load balancer retrying s5 occasionally? [00:54:13] maybe [00:54:21] I know very little about our db layer [00:54:39] it's eight distinct IPs [00:55:01] bblack: pong [00:55:09] springle: ^ [00:55:11] :) [00:55:19] possibly more [00:55:54] couple alerts on s5 slave about replication death to a failed query, some 503s that might be related to mysql connection errors from hhvm [00:56:20] the replication error is a dbstore, not a production slave [00:56:28] unlikely to be related [00:56:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:56:30] ok [00:56:48] dbstore is just for backups and such? [00:56:57] springle: there are a low number of mysql connection errors in https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [00:56:58] correct, and analytics [00:57:26] well "low" [00:57:46] I mean, we have to get from 1000s of 503 to 1000s of fatal error or death somewhere [00:57:51] those are rarely zero [00:58:04] but checking... [00:58:09] hundreds really [00:58:19] gwicke: which db was the lock wait timeout you saw? [00:59:11] commons IIRC [00:59:17] can search for it [00:59:24] ah makes sense. always commons [00:59:25] yep [01:00:31] springle: if you search for "Lock wait timeout exceeded" in logstash, there is a good list [01:01:06] root@oxygen:/srv/log/webrequest# grep 2015-05-07T00:40:48 5xx.json |grep -w 503|wc -l [01:01:09] 498 [01:01:10] aaron's been playing with innodb_lock_wait_timeout recently. it may well have changed behavior [01:01:19] somewhat spiky [01:01:21] ^ there's at least one spot during the 503 burst, with 498 events logged in one second [01:01:32] yet: [01:01:33] root@oxygen:/srv/log/webrequest# grep 2015-05-07T00:40:48 5xx.json |grep -w 503|grep commons|wc -l [01:01:36] 84 [01:01:43] only a bit under 1/5 were from commons [01:02:04] 458 "Lock wait timeout exceeded" in the last 24 hours [01:02:37] it was related to translations [01:03:02] metawiki [01:03:07] yeah, meta [01:03:12] DELETE FROM `translate_groupstats` WHERE tgs_group = 'page-Wikimedia Foundation elections/Board elections/2015/Candidates' [01:03:21] most of the 498x 503's from that 1s sample have no clear pattern. some are api, some are articles, random wikis, etc [01:04:12] basically all related to MessageUpdateJob or MessageGroupStats::clear [01:05:03] (so probably the actual 503s are not directly related to the real problem, just an indirect effect from appservers being funky) [01:07:11] should we open a task for the "Lock wait timeout exceeded" errors? [01:09:15] seems reasonable! [01:09:43] typing one up.. who owns that code? [01:10:07] which code? [01:10:13] commons database code/ [01:10:20] MessageUpdateJob or MessageGroupStats::clear [01:10:54] let me start with reading-infrastructure-team ;) [01:11:29] https://phabricator.wikimedia.org/T98427 [01:12:18] oh ok [01:12:19] https://phabricator.wikimedia.org/T90704 [01:12:58] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [01:13:29] ah, maybe repon that instead, then [01:13:33] *re-open [01:13:40] or link to it [01:14:14] (03PS3) 10Gage: IPsec: Icinga monitor for Strongswan connections [puppet] - 10https://gerrit.wikimedia.org/r/199787 [01:14:23] adding link [01:14:56] can merge the two [01:15:23] well theirs is closed, and maybe somehow it is different [01:15:24] {{done}} [01:15:29] who knows! :) [01:15:57] no, it's the same issue, exposed by recent changes to innodb_lock_wait_timeout [01:16:11] but MW client code. looking for aaron's commit now... [01:16:13] reopened the original issue [01:17:42] haha [01:17:43] self merge [01:17:44] https://gerrit.wikimedia.org/r/#/c/206442/ [01:17:58] ffs [01:18:51] presumably that applies to both wikiuser and wikiadmin [01:19:12] jgage: I assume you've tested the thing and it basically works right? I'm mostly just reviewing with human eyeballs here [01:19:16] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1267716 (10Ijon) I have encourage the community to continue localization of messages to reach at least the threshold of 18% of MW core. I... [01:19:37] AaronSchulz: ^ we may need to raise that again, or make it only apply to wikiuser perhaps? (does it?) [01:20:09] PROBLEM - puppet last run on mw2137 is CRITICAL Puppet has 1 failures [01:21:51] (03CR) 10BBlack: [C: 031] "LGTM on a human reading of it. It looks like it was tested-while-developing, so I don't expect silly syntax errors." [puppet] - 10https://gerrit.wikimedia.org/r/199787 (owner: 10Gage) [01:22:25] woo, thanks for the reviews bblack! yes, i've been testing it :) [01:22:43] springle: this is 15 seconds? [01:23:21] (03PS4) 10Gage: IPsec: Icinga monitor for Strongswan connections [puppet] - 10https://gerrit.wikimedia.org/r/199787 [01:23:39] ^ rebase [01:23:54] jgage: I've been thinking about the problem of non-cache hosts (e.g. kafka brokers). Basically, the manifests/role/ipsec.pp stuff that uses $cluster_nodes and such. [01:24:05] gwicke: correct. lowered from default 50. reasonable for wikiuser, but possibly not wikiadmin [01:24:22] (03CR) 10Gage: [C: 032] IPsec: Icinga monitor for Strongswan connections [puppet] - 10https://gerrit.wikimedia.org/r/199787 (owner: 10Gage) [01:24:48] idk if serverTemplate applies to wikiadmin though, tbh. might be coincidence [01:24:50] we probably want to turn that around somehow to where we can flag arbitrary nodes of any time to put them in the set of "hosts that should use ipsec", and then have that logic create associations between all pairs of nodes in that set which are in t2 datacenters with all of those that are in t1 datacenters, essentially. [01:25:03] collections would be the "obvious" pattern, but could kinda suck, too [01:25:08] hmm, yeah [01:25:19] s/of any time/of any type [01:25:20] i talked to otto about upgrading one of the kafka brokers to trusty to test with [01:25:27] er i mean jessie [01:25:42] worst case we can deploy the caches first and do that after [01:25:45] yeah [01:25:56] but still, it would be good to structure things in a way that it's possible to add them easily after without rewriting all that magic [01:27:47] so, in terms of a randomly-invented pseudo language: @ipsec_hosts = <@all_hosts filtered-by ipsec-flag-of-some-kind>. @associations_for_this_host = . [01:29:09] I'm really not sure how codfw fits into all of that yet. I'm kind of ignoring it for now as nobody has clearly elucidated what we're going to do with cache tiering there. [01:29:45] I guess I should be the one elucidating that, but in order to do so, I need answers to lots of known unknowns, and some unknown unknowns. [01:30:25] i started out thinking along those lines "every ipsec host in t2 creates connections to every ipsec host in t1", but diverged from that path to restrict associations by cache type [01:31:00] yeah... [01:31:25] how expensive is maintaining a crapload of extra associations that see little-to-no traffic? [01:31:34] (03CR) 10Mjbmr: [C: 031] Enable ShortUrl on es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206737 (https://phabricator.wikimedia.org/T96668) (owner: 10Dereckson) [01:31:39] (03CR) 10Mjbmr: [C: 031] Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [01:31:43] (03CR) 10Mjbmr: [C: 031] Enable Extension:Shorturl on sa. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [01:32:05] i don't have numbers on that yet. my expectation is that the overhead is quite small, in terms of cpu, mem, and net traffic [01:32:15] but with only two test nodes i don't have a good way to simulate it [01:32:28] yeah [01:32:42] by default strongswan creates 16 worker threads, so i'm interested in whether that will need to be bumped up [01:32:50] well that's basically the tradeoff: wasted resources on that, vs wasted complexity trying to pare it down by only the machines that need to talk to each other [01:32:58] but that only affects IKE, not ESP or kernel memory usage [01:33:03] yeah [01:33:18] which won't just be cache roles of course :) [01:33:27] i'll ask the dudes in #strongswan about operating at that scale [01:33:28] there won't be a truly-clean mapping of any kind [01:34:45] for e.g. a random text node in esams, it means 32 associations back to eqiad vs just 8 with role filtering. [01:34:58] or double all of that for a t1 codfw [01:35:06] + several for kafka [01:36:35] 2x again for v4 + v6 [01:36:39] I guess in whatever magical schema that we're tagging things as ipsec nodes, we can take them with some kind of "group" attribute that represents their cache role, and have a special group that all groups associate with cross-tier, to put kafka in [01:37:48] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:37:49] yeah so we should expect something more like a diff between 40 assoc vs 136 assoc [01:37:54] (03CR) 10Springle: "Does this apply to wikiadmin?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 (owner: 10Aaron Schulz) [01:38:02] assuming +v6 +codfw +kafka [01:40:34] that's not a terribly large number.. vs default of re-keying each connection every 60 mins that means someting like two re-keys per minute [01:41:37] i guess one thing i can try is just defining a bunch of bogus connections on a single node to make sure the kernel doesn't explode with that many policies loaded [01:42:56] btw without specifying non-blocking, ipsec statusall output hangs waiting for hosts in CONNECTING state to establish or fail. with retries that can take a few minutes. that's why i had to use statusall-nb :P [01:43:52] oh, so statusall doesn't wait to fetch status, it waits for status to go all-green heh [01:44:04] well at least mostly-green, but still not "installed"? [01:44:17] nothing about ipsec/strongswan ever makes sense :) [01:44:27] yeah, it waits for IKE sessions to establish or fail [01:44:36] yeah, that utility is pretty wack [01:44:59] and the terminology about established ike sessions vs installed esp sessions is terrible [01:45:05] as is the output. totally unclear. [01:45:08] can someone please inform me who is now responsible for interwiki.cdb update (post Reedy) [01:45:37] er, established ike sessions vs installed esp transports i suppose [01:47:12] my takeaway is "installed" == "OK", everything else -> something broken or in-progress [01:47:23] yep [01:48:15] sDrewth: probably #wikimedia-releng [01:48:25] k [01:49:02] (03PS1) 10BBlack: move Cali to eqiad (incl L3 DNS) [dns] - 10https://gerrit.wikimedia.org/r/209413 [01:49:23] (03CR) 10BBlack: [C: 032 V: 032] move Cali to eqiad (incl L3 DNS) [dns] - 10https://gerrit.wikimedia.org/r/209413 (owner: 10BBlack) [01:50:14] Going Back to Cali [01:50:30] !log we're still hitting cap on Zayo as of shortly-ago in graphs and seeing smokeping loss, moved california to eqiad [01:50:42] Logged the message, Master [01:50:44] i wish we could just call someone at zayo [01:50:48] but surely that is verboten [01:51:08] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [01:54:46] I don't really expect CA to be enough, but it was worth trying [01:55:49] next up will probably be AK+HI+JP, for whom the latency hit isn't as bad as some of the other alternatives. JP might move a good chunk of users. [02:01:39] bblack: Arizona and Canada? [02:01:53] gone already [02:01:59] ah [02:02:20] we did a bigger move earlier, basically everything in the US/Canada that was on ulsfo went to eqiad, except for AK, HI, and CA [02:02:52] gotcha, i hadn't updated my local repo [02:03:30] but a lot of ulsfo's traffic is asian of course, since we have no asia pop (yet! ... ... ..........) [02:06:04] nevermind! I've got a few post-move traffic samples now. California was worth a pretty big chunk. [02:06:13] (03PS1) 10Springle: return db1054 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/209415 (https://phabricator.wikimedia.org/T89801) [02:06:20] buncha wikijunkies there heh :) [02:07:09] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:07:39] (03CR) 10Springle: [C: 032] return db1054 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/209415 (https://phabricator.wikimedia.org/T89801) (owner: 10Springle) [02:14:01] (03PS1) 10Springle: repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209417 [02:14:27] !log krenair Synchronized wmf-config: update interwiki.cdb, T98429 (duration: 00m 24s) [02:14:36] Logged the message, Master [02:16:16] (03CR) 10Springle: [C: 032] repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209417 (owner: 10Springle) [02:16:28] one sec springle [02:16:54] Krenair: ok [02:16:56] (03PS1) 10Alex Monk: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209418 (https://phabricator.wikimedia.org/T98429) [02:17:29] (03CR) 10Alex Monk: [C: 032] "file generated by updateinterwikicache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209418 (https://phabricator.wikimedia.org/T98429) (owner: 10Alex Monk) [02:18:08] sorry about that [02:20:09] np, sorry to get in your way. and jenkins seems ot be slacking off anyway [02:20:46] Well, I was also doing something off schedule to be fair :) [02:21:05] It'll probably be fine anyway [02:21:44] Ran into some unexpected problems with sync-file [02:24:27] (03Merged) 10jenkins-bot: repool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209417 (owner: 10Springle) [02:24:29] (03Merged) 10jenkins-bot: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209418 (https://phabricator.wikimedia.org/T98429) (owner: 10Alex Monk) [02:29:23] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 09m 27s) [02:29:34] Logged the message, Master [02:29:34] springle, ^ [02:29:40] tnx [02:33:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [02:34:47] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:35:18] !log springle Synchronized wmf-config/db-eqiad.php: repool db1054 in s2, warm up (duration: 01m 09s) [02:35:29] Logged the message, Master [02:35:46] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:47] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-07 02:35:43+00:00 [02:36:54] Logged the message, Master [02:37:15] RECOVERY - RAID on snapshot1004 is OK no RAID installed [02:45:29] 6operations, 7Monitoring: Monitor the up-to-date status of wikitech-static - https://phabricator.wikimedia.org/T89323#1267822 (10Dzahn) >>! In T89323#1057032, @Andrew wrote: > compare the most recent edit date > And alerts if they are more than 25 hours different ``` #!/bin/bash API_QUERY="action=query&tit... [02:59:36] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 08m 35s) [02:59:45] Logged the message, Master [03:03:54] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-07 03:02:50+00:00 [03:04:01] Logged the message, Master [03:18:10] (03CR) 10Aaron Schulz: "Yes, which should be fine (e.g. the QueryPage scripts can insert 5-10k rows without much contention)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 (owner: 10Aaron Schulz) [03:19:16] (03PS2) 10KartikMistry: Added initial Debian package for apertium-oc-es [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/207131 (https://phabricator.wikimedia.org/T96655) [03:24:41] (03PS3) 10KartikMistry: Added initial Debian package for apertium-oc-es [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/207131 (https://phabricator.wikimedia.org/T96655) [03:55:01] kart_: thanks for your replies [03:55:10] kart_: I was just doing a sweep of the #roadmap projects [03:55:14] project* [04:07:56] (03PS1) 10BryanDavis: Better handling for php lint checks [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 [04:10:00] (03CR) 10Legoktm: Better handling for php lint checks (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 (owner: 10BryanDavis) [04:15:23] (03PS2) 10BryanDavis: Better handling for php lint checks [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 [04:15:43] (03CR) 10BryanDavis: Better handling for php lint checks (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 (owner: 10BryanDavis) [04:19:40] (03CR) 10Legoktm: [C: 031] Better handling for php lint checks [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 (owner: 10BryanDavis) [04:19:53] (03PS11) 10KartikMistry: Added initial Debian packaging for apertium-fr-es [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) [04:20:38] (03CR) 10KartikMistry: Added initial Debian packaging for apertium-fr-es (031 comment) [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [05:06:41] (03CR) 10Mjbmr: "wmgNewUserMessageOnAutoCreate must also set to true for this wiki, see the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209146 (https://phabricator.wikimedia.org/T97920) (owner: 10Dereckson) [05:18:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [05:28:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [05:34:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [05:40:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 7 05:39:36 UTC 2015 (duration 39m 35s) [05:40:47] Logged the message, Master [05:46:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [05:50:08] (03CR) 10Ori.livneh: [C: 032] Better handling for php lint checks [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 (owner: 10BryanDavis) [05:50:24] (03Merged) 10jenkins-bot: Better handling for php lint checks [tools/scap] - 10https://gerrit.wikimedia.org/r/209425 (owner: 10BryanDavis) [05:51:49] (03CR) 10Yuvipanda: [C: 04-1] "+1 what Filippo said." [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) (owner: 10Hashar) [06:18:14] (03PS2) 10KartikMistry: Added initial Debian package for apertium-oc-ca [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) [06:21:04] (03CR) 10KartikMistry: "Tag pushed." [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/207046 (https://phabricator.wikimedia.org/T96652) (owner: 10KartikMistry) [06:21:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [06:21:18] (03CR) 10KartikMistry: "Tag pushed." [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/207038 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [06:22:14] (03PS2) 10KartikMistry: Added initial Debian packaging for apertium-eu-es [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/207038 (https://phabricator.wikimedia.org/T96653) [06:23:24] (03PS2) 10KartikMistry: Added initial Debian packaging for apertium-es-gl [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/206805 (https://phabricator.wikimedia.org/T96654) [06:24:06] (03PS2) 10KartikMistry: Added initial Debian package for apertium-pt-gl [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/206806 (https://phabricator.wikimedia.org/T96654) [06:24:08] (03CR) 10KartikMistry: "Tag pushed." [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/207031 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [06:27:35] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60607 bytes in 0.388 second response time [06:30:07] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [06:32:46] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw2095 is CRITICAL Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 2 failures [06:34:47] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:46:25] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:55] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [07:05:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "thanks Bryan!" [puppet] - 10https://gerrit.wikimedia.org/r/208085 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [07:07:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [07:26:31] (03CR) 10Alexandros Kosiaris: "I am fine with that as well. The comment there made me assume debug output was not wanted most of the times hence the if" [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [07:33:46] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 20.00% of data above the critical threshold [500.0] [07:34:20] (03CR) 10Ori.livneh: "+1 to that, then" [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [07:34:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [07:35:33] (03CR) 10Giuseppe Lavagetto: [C: 031] "I agree, when I added the noop logger I thought people would love not to see all the junk hiera logs, but that was wrong." [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [07:40:20] (03CR) 10Alexandros Kosiaris: [C: 032] Enable debug output in hiera_lookup [puppet] - 10https://gerrit.wikimedia.org/r/204155 (owner: 10Alexandros Kosiaris) [07:48:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [07:57:41] akosiaris: morning, one more video question: https://commons.wikimedia.org/wiki/File:Stockholm_Old_Town_Gamla_Stan.webm transcoding issues ? [08:00:01] matanya: I am really not sure. The video plays, doesn't it ? [08:00:19] I see the grey-white checkerboard though [08:00:21] yes, but only the orig size, no transcoding [08:00:37] see at the end, the transcoding erorrs [08:01:04] Invalid pixel format string '-1' [08:01:06] interesting [08:01:38] (03PS5) 10Giuseppe Lavagetto: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [08:01:57] akosiaris: figured, https://phabricator.wikimedia.org/T55863 [08:02:28] sorry for the noise. seems like i go into any possible roadblock :) [08:02:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:03:08] matanya: oh, you do know how this circles back to trusty and the discussion we had the other day ? [08:03:11] ;-) [08:03:43] heh, i wasn't aware it was vp9 but good point! [08:05:39] seems like it is time for a wikitech-l / wikimedia-l post. [08:14:38] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1268079 (10daniel) 5Open>3Invalid The grep run did not turn up any old style serialization in the dump, so I'm cl... [08:16:24] (03PS1) 10Alexandros Kosiaris: package_builder: Add lintian [puppet] - 10https://gerrit.wikimedia.org/r/209434 [08:21:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [08:21:25] PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail [08:24:12] (03CR) 10KartikMistry: "Sigh. Still not reproducible for me :/" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [08:30:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging for apertium-fr-es [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [08:34:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [08:39:15] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-eu-en [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/207031 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [08:46:40] 6operations, 10ops-codfw: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1268119 (10fgiunchedi) looks like the problem is changed version identification on (newer I think?) power strips: `Sentry Switched CDU Version 7.0k` vs `Sentry Smart CDU Version 7.0k` [08:46:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-es-ast [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/207046 (https://phabricator.wikimedia.org/T96652) (owner: 10KartikMistry) [08:48:28] (03PS1) 10Filippo Giunchedi: sentry3: add Sentry Smart CDU detection [software/librenms] - 10https://gerrit.wikimedia.org/r/209443 (https://phabricator.wikimedia.org/T84416) [08:49:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-pt-gl [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/206806 (https://phabricator.wikimedia.org/T96654) (owner: 10KartikMistry) [08:49:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging for apertium-es-gl [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/206805 (https://phabricator.wikimedia.org/T96654) (owner: 10KartikMistry) [08:50:55] PROBLEM - High load average on labstore1001 is CRITICAL 77.78% of data above the critical threshold [24.0] [08:51:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging for apertium-eu-es [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/207038 (https://phabricator.wikimedia.org/T96653) (owner: 10KartikMistry) [08:54:54] Hi matanya [08:55:01] hi yuvipanda [08:55:04] Can you stop using NFS to store encoded images? [08:55:08] err [08:55:11] Videos? [08:55:24] It is killing NFS for everyone [08:55:27] yuvipanda: yes, i guess. [08:55:36] Thank yiu [08:55:40] Right now? [08:55:41] sorry, i was using it to allow sever side uploads [08:56:07] matanya: yeah should put it on /srv or something [08:56:08] <_joe_> matanya: don't use NFS though [08:56:18] yuvipanda: i don't see any active sessions by me now [08:56:51] matanya: _joe_ saw a huge spike of NFS traffic from encoding [08:57:00] now? or yesterday ? [08:57:26] yuvipanda: https://phabricator.wikimedia.org/T98159 ? [08:57:45] matanya: _joe_ reported it a few secs before I pinged you [08:58:03] not doing any NFS afaik [08:58:25] yuvipanda: is /home on NFS ? [08:58:45] matanya: yes [08:58:50] oh, ok them [08:58:53] n [08:59:04] copied a 13GB movie [08:59:22] which was done some time ago. sorry for that [09:00:50] (03PS1) 10Merlijn van Deen: Add priority keywords/labels for !priority email command [puppet] - 10https://gerrit.wikimedia.org/r/209445 (https://phabricator.wikimedia.org/T98356) [09:01:11] matanya: cool :) thanks for killing [09:01:53] yuvipanda: any other method i can use ? [09:02:22] matanya: you can get a xlarge instance and use /srv [09:02:27] Which will be about 140gigs [09:02:37] You need to enable labs srv role for that [09:03:00] yuvipanda: i'm on an xlarge, mostly running out of space :) [09:03:41] matanya: you are mostly screwed I would say, sadly. Two xlarges maybe :) [09:04:08] yuvipanda: then i need nfs to share the movies, over /home ... [09:04:38] matanya: you will kill it for everyone else as just happened [09:04:48] And wake up opsen at 2am [09:04:52] hence i am not doing that [09:04:53] <_joe_> matanya: do that with something that can limit bandwith [09:04:54] matanya: you can scp [09:05:00] <_joe_> like scp or rsync [09:05:02] Or that [09:05:08] good idea [09:05:19] will modify the scripts [09:05:45] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:06:08] matanya: ^ that would be you killing the cp. thanks btw [09:06:15] _joe_: yuvipanda: we need to fix this [09:06:23] Coren: ^ [09:06:37] akosiaris: its been ubn priority for two days now [09:06:40] set limits over the switch perhaps ? [09:06:53] an ACL or such [09:07:01] matanya: hardware switch? that would not help [09:07:07] <_joe_> matanya: we know the solution, we just need to implement it [09:07:10] but yes, rate limiting VM traffic would do it [09:07:21] _joe_: i figured ... :) [09:08:06] 6operations, 7Monitoring: improve reqstats error alerts - https://phabricator.wikimedia.org/T98450#1268171 (10fgiunchedi) 3NEW a:3fgiunchedi [09:08:25] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#915930 (10fgiunchedi) [09:08:26] 6operations, 7Monitoring: improve reqstats error alerts - https://phabricator.wikimedia.org/T98450#1268181 (10fgiunchedi) [09:12:24] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1268183 (10Joe) oxygen has a stream of all 5xx responses (but I guess it also includes 404 from commons, counted as 503s, so beware) in a json format. Maybe we could use whatever produces that stream to... [09:13:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-oc-ca [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [09:13:24] 6operations, 10Traffic, 7Mobile: Mobile site broken - https://phabricator.wikimedia.org/T98309#1268186 (10Thgoiter) >>! In T98309#1264465, @faidon wrote: >This should be working now, please confirm. Working again. Thanks for fixing. [09:13:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian package for apertium-oc-es [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/207131 (https://phabricator.wikimedia.org/T96655) (owner: 10KartikMistry) [09:15:18] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1268188 (10MoritzMuehlenhoff) I've added the meta package to operations/debs/linux-meta.git. It has been built on copper and is available on apt.wikimedia.org "apt-get install linux... [09:15:51] (03CR) 10Alexandros Kosiaris: "I am reproducing it very easily in a trusty pbuilder environment. The series of commands is:" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [09:29:03] !log Increased /sys/block/md125/md/sync_speed_min from 1000 to 4000 [09:29:15] Logged the message, Master [09:30:44] yuvipanda, around? [09:33:36] yurik: yes but only on a phone and very grumpy :p [09:33:48] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1268231 (10mark) p:5Normal>3High [09:34:25] yuvipanda, grumpy is good )) osm db has called, said it loves you, and asked when it can get exposed to the good people who may use it for creative querying [09:34:52] Lalalala I can't hear you and you want Alex and not me [09:34:58] !log Increased /sys/block/md125/md/sync_speed_min from 4000 to 40000 [09:35:04] Logged the message, Master [09:35:26] yurik: but: file a bug :) shouldn't be too hard but no promises [09:36:24] yuvipanda, already did https://phabricator.wikimedia.org/T98382 [09:36:51] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1268232 (10mark) >>! In T96063#1207581, @coren wrote: > Raid 6 is a performance bottleneck but gives us 66% more effective storage than raid10 would in the current configuration. It doesn't mean... [09:36:57] Cool. I'll take a look tomorrow [09:37:28] yuvipanda, we have been torturing akosiaris for a while ;) [09:37:46] thanks!!! [09:38:01] yurik: yes do continue. But remember you were told ops has no resources for maps this quarter so it will be best effort and no guarantees. [09:38:18] (At least I think you were told that) [09:38:27] we haven't decided on that actually [09:38:33] but we certainly don't have a lot of resources [09:38:35] will get better soon [09:39:10] mark: yeah but right now I guess it definitely is best effort no guarantees (at least from me) [09:39:13] :D [09:40:05] yuvipanda, no worries, no pressure :))) [09:40:10] but tomorrow would be good [09:40:13] ;-P [09:40:48] \o/ more resources, yei!!! :) [09:52:05] Labstore1001 RAID6 resync will now complete in about 6 hours instead of a year [09:52:42] (03PS6) 10Giuseppe Lavagetto: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [09:54:15] <_joe_> mark: well load is up a lot now [09:54:31] <_joe_> but I guess we must endure some pain for now [09:54:53] <_joe_> a non-synced raid is more harmful than anything else [09:55:41] <_joe_> ttmserver-mediawiki01.ttmserver.eqiad.wmflabs seems to be writing a lot in burst [09:55:44] <_joe_> *bursts [09:56:44] i can lower it a bit again if needed [09:57:02] <_joe_> no it seems to oscillate [09:57:03] disk load looks acceptable to me atm [09:57:06] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [09:57:28] <_joe_> yes [09:57:50] (03CR) 10Muehlenhoff: "I can also reproduce it. I think the reason is the missing m4 macro for AP_MKINCLUDE: The apertium 3.3 source package ships an apertium.m4" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [09:58:45] i've also increased the stripe cache size for that array a bit more [09:58:48] might help as well [09:59:35] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [10:00:09] <_joe_> mh labs is suffering a bit now [10:00:27] ok i'll lower [10:00:29] to 40 [10:00:53] done [10:01:35] !log Decreased labstore1001 md125 sync_speed_min from 80000 to 40000 [10:01:42] Logged the message, Master [10:05:26] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [10:10:55] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:11:15] (03PS7) 10Giuseppe Lavagetto: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [10:16:18] <_joe_> andrewbogott_afk: you will want to test the transition well ^^ [10:16:37] <_joe_> I just managed to screw a self-hosted puppetmaster up beyond repair [10:17:41] (03PS6) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) [10:18:11] moritzm: ^^ Thanks for pointer! [10:18:27] moritzm: no jmm here? :) [10:19:19] kart_: someone claimed jmm on freenode/nickserv before, so I have to stick with moritzm :-) [10:20:58] moritzm: :) [10:21:01] <_joe_> moritzm: aww that sucks [10:21:15] Does freenode allow reclaiming nicks? Think they do [10:21:17] <_joe_> but well 3-letters nicks are hard to grasp here [10:21:22] +1 [10:21:25] <_joe_> I know something about that [10:23:57] hoo: not sure, but by now I'm so used to being IRC-bipolar, that I'll stick with it :-) [10:25:02] IRC-bipolar... that would make awesome t-shirts :'D [10:26:07] !log bounce uwsgi on graphite1001 [10:26:14] Logged the message, Master [10:32:05] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [10:39:54] hoo: jmm or hoo tshirts are nice :) [10:42:37] (03PS1) 10Filippo Giunchedi: gdash: fix deploy addon urls [puppet] - 10https://gerrit.wikimedia.org/r/209462 (https://phabricator.wikimedia.org/T64667) [10:42:45] _joe_: can I toss the salt keys (in labs) etcd-master.eqiad.wmflabs and etcd01.eqiad.wmflabs? I'm assuming you can't actually use those [10:43:11] <_joe_> apergos: they've been freshly created I guess [10:43:22] <_joe_> I just re-created the whole project from scratch [10:43:34] well the salt keys would have names like [10:43:51] i-00000186.eqiad.wmflabs or whatever [10:44:12] this is the whole fqdn vs ec2i thing [10:45:15] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [10:45:19] so do you mind if I throw them? I'm starting the upgrade of salt in labs now, makes it easier without a lot of dead or invalid keys around [10:47:09] _joe_: [10:53:11] <_joe_> apergos: do it [10:53:29] ok thanks [11:30:55] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [11:33:45] (03PS1) 10KartikMistry: Added initial Debian package for apertium-kaz [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/209463 (https://phabricator.wikimedia.org/T95876) [11:45:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I fixed one error, but when applying it to a newly-created self-hosted puppetmaster in labs I get a weird behaviour, this should be tested" [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [11:56:09] akosiaris: will you be around to merge two patches at 9 AM PST? [11:56:38] akosiaris: need co-ordinatation with cx/cxserver deployment. [11:58:39] (03CR) 10Santhosh: [C: 031] CX: Use RESTBase API for page fetch [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [12:01:21] kart_: hmm, no [12:01:42] kart_: could we do them beforehand ? like now ? [12:02:26] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [12:02:47] (03CR) 10Alexandros Kosiaris: [C: 031] WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [12:08:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: fix deploy addon urls [puppet] - 10https://gerrit.wikimedia.org/r/209462 (https://phabricator.wikimedia.org/T64667) (owner: 10Filippo Giunchedi) [12:11:15] akosiaris: :/ that has to go with CX [12:12:09] akosiaris: if you can +1 to, https://gerrit.wikimedia.org/r/#/c/207378/ that is fine too.. [12:13:04] (03PS3) 10KartikMistry: Added initial Debian packaging for apertium-dan [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) [12:13:45] akosiaris: and this, https://gerrit.wikimedia.org/r/#/c/195905 :) [12:14:45] kart_: $restbase = 'https://$lang.wikipedia.org/api/rest_v1/page/html/$title' ? [12:14:55] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [12:14:56] what's with the variables ? [12:15:19] (03PS4) 10Alexandros Kosiaris: Added initial Debian packaging for apertium-dan-nor [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [12:17:16] akosiaris: is it wrong one? [12:17:44] kart_: you tell me. variables inside a constant ? It sure is confusing [12:17:54] kart_: not sure how you use it though [12:18:00] kart_: which is why I am asking [12:18:08] akosiaris: need something like https://gerrit.wikimedia.org/r/#/c/207039/5/config.defaults.js [12:19:03] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1268485 (10Ottomata) Totally could, but that would be the > varnish -> varnishkafka -> kafka brokers -> kafkatee -> statsd -> carbon option that Faidon doesn't like so much. [12:19:31] (03CR) 10Giuseppe Lavagetto: "So, what I tried:" [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [12:19:46] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [12:19:48] akosiaris: any suggestion on that? [12:20:01] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1268494 (10Ottomata) Joe, FYI, the puppet that implements that is here: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logging.pp#L391 [12:20:52] <_joe_> ottomata: that is awesome btw [12:20:56] <_joe_> kudos for that [12:22:05] kart_: I am still not sure how you construct the query string [12:22:11] _joe_: the kafkatee thing? [12:22:14] <_joe_> yep [12:22:18] cool! glad you like [12:22:20] kart_: It just might be better to use the rest.wikimedia.org entrypoint [12:22:29] its pretty trivial to add more outputs too it [12:22:31] to* [12:23:09] akosiaris: earlier one in PS there? https://gerrit.wikimedia.org/r/#/c/207378/4..5/modules/cxserver/manifests/init.pp [12:23:09] <_joe_> ottomata: yeah I was thinking of sending the output of that grep to a python script that can process it and send it to statsd/graphite [12:23:44] aye, totally possible, but yeah, that is what paravoid didn't like so much [12:24:00] as it does add a lot of indirection into the monitoring pipe [12:24:03] <_joe_> me neither, but do we have a better alternative? [12:24:06] kart_: kind of messy indeed [12:24:15] :/ [12:24:17] what about the one i suggested in the phab? [12:24:17] <_joe_> I'm not sure about that [12:24:20] kart_: question: that url is used by the browser ? [12:24:32] so it needs to be the public entrypoint, right ? [12:24:34] <_joe_> but bbl, lunch [12:24:38] varnishncsa -m filters | count lines every n seconds | send to statsd [12:25:04] _joe_: would love your htoughts on the ticket, the most annoying thing about my suggestion is so many varnishncsa instances running on all caches :/ [12:25:04] !log bounce uwsgi on graphite1001 [12:25:28] but, it is much simpler and more direct [12:25:52] morebots: :( [12:25:53] I am a logbot running on tools-exec-1203. [12:25:53] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [12:25:53] To log a message, type !log . [12:25:54] and probably easier to scale, since then all of the monitoring is distributed across the whole cluster of caches [12:26:05] rather than a single or few nodes filtering the full log of all webrequests [12:26:07] !log bounce uwsgi on graphite1001 [12:26:26] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [12:27:16] kart_: I suppose $lang or $title is being text replaced with something else in cxserver ? [12:27:26] akosiaris: yes. [12:28:27] kart_: does cxserver rely on evaluating the string with variable interpolation ? or does it do regexp replace? it does seem like the former [12:28:39] which is not very safe [12:29:11] for example $title might be change by someone in the future to be in double quotes in that string [12:29:35] then puppet will interpolate the $title variable replacing it with something and cxserver will break [12:29:43] akosiaris: handled by, https://gerrit.wikimedia.org/r/#/c/207039/5/pageloader/PageLoader.js [12:29:55] akosiaris: oh. I see. [12:30:03] akosiaris: any suggestion? [12:30:37] !log rebooting cp1070 [12:30:42] Logged the message, Master [12:31:01] kart_: a better. OK so let's change the strings to something that is not a variable in puppet [12:31:10] like _LANG_ and _TITLE_ or something ? [12:31:24] (03CR) 10Ottomata: [C: 031] role::cache: decommission statsite [puppet] - 10https://gerrit.wikimedia.org/r/209188 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [12:33:27] (03PS1) 10BBlack: update late_command for new cache kernel pkgs [puppet] - 10https://gerrit.wikimedia.org/r/209466 [12:33:31] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to 2.8.6.1 - https://phabricator.wikimedia.org/T65847#1268528 (10Florian) Gerrit [[ https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.11.html | release 2.11 ]] adds another really really useful feature, [[ https://gerrit-docum... [12:33:33] 6operations, 10Deployment-Systems, 7Graphite, 5Patch-For-Review: [scap] Deploy events aren't showing up in graphite/gdash - https://phabricator.wikimedia.org/T64667#1268531 (10fgiunchedi) 5Open>3Resolved so two additional issues identified, one fixed in https://gerrit.wikimedia.org/r/209462 the other w... [12:33:44] (03CR) 10BBlack: [C: 032 V: 032] update late_command for new cache kernel pkgs [puppet] - 10https://gerrit.wikimedia.org/r/209466 (owner: 10BBlack) [12:34:19] akosiaris: let me look at again. As we need $lang, $title and how it can be replaced :) [12:34:38] kart_: ok [12:41:17] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [12:43:52] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1268580 (10BBlack) I've updated cp1070 via the meta package, everything worked out there with automatic initramfs/grub/etc. I also updated our hacky late_command stuff for the cache... [12:44:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] CX: Use RESTBase API for page fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [12:46:07] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [12:47:18] (03CR) 10Sbisson: [C: 031] "Works fine for me. Why does it take so long for VE to load on mw-vagrant?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [12:48:58] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1268582 (10BBlack) We're now basically ready to make progress on this. Needs some coordination on reboots. The necessary command to upgrade the kernel prior to reboot is `apt-get -y install linux-meta` [12:49:33] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to 2.8.6.1 - https://phabricator.wikimedia.org/T65847#1268584 (10JanZerebecki) I'd like the upgrade to 2.11 to be reconsidered. But it is also a question of finding someone who does the work. So lets first find someone to do the work of upgrading to 2.8.6.1. [12:50:06] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1268585 (10BBlack) Turns out everything works fine on our newer kernel with cpufrequtils installed. Still blocking on cache reboots first to eliminate the hacky trunk kernel before apply @ori's cpufrequtils cl... [12:50:18] (03CR) 10Alexandros Kosiaris: [V: 04-1] "Fails with ./configure: line 2954: AP_MKINCLUDE: command not found" [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/209463 (https://phabricator.wikimedia.org/T95876) (owner: 10KartikMistry) [12:51:33] (03CR) 10Alex Monk: "I don't know how Vagrant is set up - does it load from Parsoid on the server, Restbase on the server, or Restbase on the client? IIRC you " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [12:52:44] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to 2.8.6.1 - https://phabricator.wikimedia.org/T65847#1268589 (10matmarex) > But there is a (maybe) bad thing with Gerrit 2.11: The old (actual default) change screen was removed iirc, so the new change screen is the default Good, it's already way better... [12:58:44] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to 2.8.6.1 - https://phabricator.wikimedia.org/T65847#1268606 (10Florian) >>! In T65847#1268584, @JanZerebecki wrote: > I'd like the upgrade to 2.11 to be reconsidered. But it is also a question of finding someone who does the work. So lets first find some... [13:01:16] PROBLEM - puppet last run on mw2178 is CRITICAL puppet fail [13:01:33] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-dan_0.1.0-1 [13:01:34] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-en-gl_0.5.2~r57551-1 [13:01:34] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-an_0.3.0~r60158-1 [13:01:34] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-ast_1.1.0~r60158-1 [13:01:34] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-gl_1.0.8~r57542-1 [13:01:35] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eu-en_0.3.1~r60155-1 [13:01:36] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eu-es_0.3.3~r56159-1 [13:01:37] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eus_0.1.0-1 [13:01:38] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-fr-es_0.9.2~r27040-1 [13:01:39] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-oc-ca_1.0.6~r60158-1 [13:01:40] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-oc-es_1.0.6~r60161-1 [13:01:41] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-pt-gl_0.9.2~r57551-1 [13:01:41] Logged the message, Master [13:01:42] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-tat_0.1.0~r57462-1 [13:01:47] Logged the message, Master [13:01:51] Logged the message, Master [13:01:56] Logged the message, Master [13:02:01] Logged the message, Master [13:02:07] Logged the message, Master [13:02:13] Logged the message, Master [13:02:18] Logged the message, Master [13:02:24] Logged the message, Master [13:02:29] Logged the message, Master [13:02:34] Logged the message, Master [13:02:39] Logged the message, Master [13:02:44] Logged the message, Master [13:03:16] kart_: ^ [13:03:48] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1268611 (10MoritzMuehlenhoff) >>! In T97411#1268580, @BBlack wrote: > > Personally, I think we probably should just go ahead and have the jessie installer d-i use our kernel for all... [13:05:34] akosiaris: thanks! [13:06:07] (03CR) 10Sbisson: "I don't exactly know but there is a parsoid service that the mediawiki process (server-side) is making requests to. We had issues with thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [13:08:13] (03CR) 10Alex Monk: "In production it depends which wiki you're on... We don't send private wikis to restbase, for example." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [13:08:37] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1268619 (10BBlack) yeah sounds great. [13:10:10] akosiaris: tell me this was scripted :P [13:11:32] (03PS2) 10KartikMistry: Added initial Debian package for apertium-kaz [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/209463 (https://phabricator.wikimedia.org/T95876) [13:13:12] paravoid: isn't it obvious ? [13:19:15] RECOVERY - puppet last run on mw2178 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:19:50] (03PS1) 10KartikMistry: Added initial Debian package for apertium-kaz-tat [debs/contenttranslation/apertium-kaz-tat] - 10https://gerrit.wikimedia.org/r/209468 (https://phabricator.wikimedia.org/T95876) [13:19:55] More scriptkiddo at work :) ^^ [13:22:32] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit (to 2.8.6.1? to 2.11?) - https://phabricator.wikimedia.org/T65847#1268661 (10matmarex) [13:23:59] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit (to 2.8.6.1? to 2.11?) - https://phabricator.wikimedia.org/T65847#1268664 (10Nemo_bis) The request for 2.11 should be moved to T70271. This report is for requesting a minor version upgrade within 2.8.x. [13:24:28] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to 2.8.6.1 - https://phabricator.wikimedia.org/T65847#1268667 (10Nemo_bis) [13:25:12] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#699035 (10Nemo_bis) [13:26:05] more trancoding issues :/ https://commons.wikimedia.org/wiki/File:Pla%C5%BEa_Bele%C4%8Dica,_Trpanj_122905.webm [13:28:55] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [13:32:10] jouncebot: prev [13:32:17] ^ it totally needs that command [13:32:36] jouncebot: next [13:32:36] In 1 hour(s) and 27 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150507T1500) [13:33:07] was there some kind of big scap in the past few hours? [13:33:23] well ok 5 hours ago [13:35:01] I see salt job history of scape fetch/deploy, seems to be all one things, but the hosts it hits are spread out over a several-hours period ending ~5 hours ago [13:35:18] is that remotely normal? [13:35:18] s/scape/scap/ [13:36:56] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [13:42:54] <_joe_> bblack: scap doesn't use salt [13:42:57] <_joe_> trebuchet does [13:43:04] <_joe_> which is used to deploy scap [13:43:13] <_joe_> yeah I know, that's kinda funny [13:44:17] seems like there's all kinds of salt scape (trebuchet) traffic over several hours last night [13:44:26] seems like something that was probably meant to happen all at once, but didn't [13:46:51] (03PS1) 10Muehlenhoff: Add versioned dependency on updated firmware-bnx2 firmware (needed by Linux 3.18 and later (see Debian #779128) (Bug: T97411) [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/209480 [13:47:03] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1268771 (10JanZerebecki) Even if we upgrade straight to 2.11 we still would need to find someone who has the time and will to do it. [13:48:05] (03CR) 10BBlack: [C: 031] Add versioned dependency on updated firmware-bnx2 firmware (needed by Linux 3.18 and later (see Debian #779128) (Bug: T97411) [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/209480 (owner: 10Muehlenhoff) [13:48:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [13:49:30] 6operations, 10Traffic: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#1268775 (10BBlack) [13:49:32] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1268774 (10BBlack) [13:54:29] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1268789 (10BBlack) ^ Unblocked from varnish update, since that's going to take longer to investigate. Also already apt-updated the kernel everywhere, just the reboots themselves remain now. [13:54:34] (03CR) 10Alexandros Kosiaris: [C: 031 V: 031] "We can merge that and see if it works. Should it work, we should push it upstream. And now that I 've said upstream, we should upgrade as " [software/librenms] - 10https://gerrit.wikimedia.org/r/209443 (https://phabricator.wikimedia.org/T84416) (owner: 10Filippo Giunchedi) [13:55:43] !log uploaded to apt.wikimedia.org jessie-wikimedia: linux-meta_1.1 [13:55:50] Logged the message, Master [14:03:25] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [14:08:51] (03CR) 10GWicke: CX: Use RESTBase API for page fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [14:10:43] 6operations, 6Analytics-Kanban: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268864 (10fgiunchedi) thanks for the report @milimetric! indeed the culprit was due to multiple interactions/renaming with: timers, simple counters and extended c... [14:11:20] 6operations, 6Analytics-Kanban: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268875 (10fgiunchedi) p:5Triage>3Normal a:3fgiunchedi [14:17:55] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [14:21:34] bblack: good news, the strongswan folks say they've seen boxes with 10k SAs defined. they suggested i look at pcrypt and tcrypt: http://www.spinics.net/lists/linux-crypto/msg07040.html [14:22:36] PROBLEM - puppet last run on analytics1026 is CRITICAL Puppet last ran 19 hours ago [14:29:06] RECOVERY - puppet last run on analytics1026 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:30:22] (03CR) 10Santhosh: CX: Use RESTBase API for page fetch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [14:31:24] (03CR) 10GWicke: CX: Use RESTBase API for page fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [14:33:38] (03PS1) 10Springle: depool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209488 [14:34:27] (03CR) 10Springle: [C: 032] depool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209488 (owner: 10Springle) [14:34:27] jgage: wait, what? we need other software just to get ipsec to use multiple cores? I kind of assumed it would already do that, at least each association as a separate scheduler tasks for ESP? [14:34:50] or whatever, down in the network stack. [14:34:55] it's not really limited to one CPU is it? [14:36:29] moritzm: hello and nice to meet you [14:37:29] bblack: not sure about strongswan, but it doesn't handle ESP traffic itself. it just sets up the connection. [14:38:30] (03Merged) 10jenkins-bot: depool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209488 (owner: 10Springle) [14:38:31] matanya: hi [14:39:44] !log springle Synchronized wmf-config/db-eqiad.php: depool db1019 (duration: 00m 14s) [14:39:50] Logged the message, Master [14:40:09] jgage: right, so the kernel should be fine at scaling the crypto for multiple associations, right? [14:40:10] 6operations, 6Analytics-Kanban: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268990 (10Milimetric) 5Open>3Resolved thanks very much @fgiunchedi, I'm not sure what that means but I'll try to get Andrew to translate :) [14:40:32] jgage: I'm not getting why someone has developed separate software to "parallelize" ipsec, and why we might need it... [14:41:49] 6operations, 7Icinga, 7Monitoring: check_puppetrun: print "agent disabled" reason - https://phabricator.wikimedia.org/T98481#1268992 (10Gage) 3NEW [14:42:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [14:44:17] bblack: this thread seems to explain it: http://comments.gmane.org/gmane.linux.network/279262 [14:44:18] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1269020 (10Ottomata) Bump, @AndyRussG have you had a chance to look at these? [14:45:59] 6operations, 10Analytics-Cluster: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1269031 (10Ottomata) [14:46:04] jgage: ok if it's just a matter of where IRQs land, we've already solved that problem and they'll be distributed over cores evenly [14:46:33] (as the next responder in the thread mentions, we use RPS(/RSS/etc...)) [14:46:48] 6operations, 7Icinga, 7Monitoring: check_puppetrun: print "agent disabled" reason - https://phabricator.wikimedia.org/T98481#1269034 (10Gage) Relatedly, I have learned that the reason must be quoted or only get the first word is stored: gage@curium:~$ sudo puppet agent --disable gage ipsec gage@curium:~... [14:47:20] bblack: cool [14:48:02] I guess we'll see as we scale out the deploy whether there are any pragmatic issues there, but hopefully not [14:49:32] bblack: i also showed them check_strongswan, and they said "why aren't you using vici?" https://wiki.strongswan.org/projects/strongswan/wiki/VICI [14:49:43] however that swanctl util is not in the deb [14:50:17] at least there's a potential alternative to parsing ipsec statusall [14:50:42] eh [14:51:02] manybubbles, ^d, thcipriani, marktraceur: Who wants to SWAT this morning? [14:51:10] I'm gonna stop looking at strongswan links before I convince myself there's no way this is ready for prime time at our scale. [14:51:14] kart_, Mjbmr: Ping for SWAT in about 9 minutes. [14:51:15] we need to test and find out :P [14:51:22] heh, yeah [14:51:35] the only viable-seeming alternative is libreswan, and their docs suck [14:51:45] well the whole ecosystem is a mess tbh [14:51:47] doesn't matter which one [14:52:03] arguably that comes from the standards being a mess to begin with [14:52:40] anyways, what's remaining before we can start trying limited deployment, aside from kernel/ipv6 issues? puppet cert? [14:54:31] well, and I guess puppet refactoring for how associations are mapped out [14:55:15] anomie: I also just added some stuff for SWAT [14:55:29] anomie: ack [14:55:30] anomie: ...or I would have except for loss of session data. [14:55:42] argh phab is being weird, when i click on the project tag and then "open tasks" i no longer get what i expect [14:55:51] anomie: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=158001&oldid=157877 [14:55:58] basically i think the only other thing is detemrinign some optimal config values [14:56:19] for rekey times and margin? [14:57:04] bblack: yeah [14:57:13] I think it would be good to leave a large margin gap, in case that gives us a chance to correct a problem showing up in the logs before rotation or something [14:57:43] for default phab shows open tasks, for the default no more closed task. [14:57:51] speaking of logs, i wonder why we use RSYSLOG_TraditionalFileFormat [14:57:56] maybe something like keying every 4 hours and margin of 1? [14:58:02] i disabled it to get highrez timestamps on my ipsec test boxes [14:58:08] <^d> anomie: I'd rather not, got a meeting in a bit I'm prepping for [14:58:22] (also, is there any thought in the stack to dithering/randomizing the rekey times, or are they all gonna hit at the same time?) [14:58:42] * anomie would rather not too, but everyone else seems to be hiding... [14:58:43] there is, that setting is called rekeyfuzz [14:58:48] ok [14:59:01] anomie: can swat [14:59:05] thcipriani: ok! [14:59:52] Mjbmr: it looks like a lot of your patches list 206736 as a dependency, but that needs a rebase. [14:59:59] so I guess set up a rekey interval that's long enough to not be wasteful, but short enough to be reasonable on security. set margin to like 1/4 of rekey time, and try to fuzz it widely. [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, Mjbmr, legoktm: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150507T1500). [15:01:43] thcipriani: it's not gonna make any problem, it's already merged, right? [15:02:12] (03CR) 10Alexandros Kosiaris: CX: Use RESTBase API for page fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [15:02:53] Mjbmr: doesn't seem to be: https://gerrit.wikimedia.org/r/#/c/206736/ [15:03:43] (03CR) 10Alexandros Kosiaris: [V: 04-1] "configure: error: Package requirements (lttoolbox-3.2 >= 3.3.0) were not met:" [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [15:03:47] thcipriani: oh, I never submit a patch on top of other patches, Dereckson did. [15:03:47] (03CR) 10KartikMistry: [C: 031] "OK now!" [puppet] - 10https://gerrit.wikimedia.org/r/209202 (https://phabricator.wikimedia.org/T97888) (owner: 10KartikMistry) [15:03:51] (03PS9) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [15:03:55] bblack: sounds reasonable [15:04:10] thcipriani: do 206736 then 206737 [15:04:52] Mjbmr: yup, looks like 206736 needs a rebase before it can merge, can you do that? [15:05:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209207 (https://phabricator.wikimedia.org/T97888) (owner: 10KartikMistry) [15:05:19] kart_: you're up first, FYI [15:05:25] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [15:05:26] ok, but it's gonna take a little time for me to do it. [15:05:33] thcipriani: yep [15:06:18] Mjbmr: that's fine. I've got stuff to deploy in the interim and the window is open until 9am. [15:06:42] 6operations: rsyslog: use high precision timestamps or explain why not - https://phabricator.wikimedia.org/T98488#1269095 (10Gage) 3NEW [15:10:00] (03PS2) 10Mjbmr: Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [15:10:06] (03CR) 10jenkins-bot: [V: 04-1] Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [15:10:07] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [15:10:32] (03PS5) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) [15:12:44] kart_: you inserted a syntax error ^ [15:12:55] missing a command at the very first Build-Depends lin [15:12:57] line [15:13:03] (03Merged) 10jenkins-bot: CX: Enable ContentTranslation for Wikis scheduled on 20150507 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209207 (https://phabricator.wikimedia.org/T97888) (owner: 10KartikMistry) [15:13:53] akosiaris: blah [15:14:13] 6operations, 7HHVM: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1269125 (10Springle) 3NEW [15:14:35] (03PS3) 10Mjbmr: Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [15:14:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add versioned dependency on updated firmware-bnx2 firmware (needed by Linux 3.18 and later (see Debian #779128) (Bug: T97411) [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/209480 (owner: 10Muehlenhoff) [15:15:16] thcipriani: done [15:16:07] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: CX enable content translations [[gerrit:209207]] (duration: 00m 12s) [15:16:16] ^ kart_ test please [15:16:18] Logged the message, Master [15:16:27] Mjbmr: kk, ty [15:16:33] yw [15:16:36] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [15:17:01] akosiaris: can you merge, https://gerrit.wikimedia.org/r/#/c/209202/ [15:18:13] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Add languages for CX deployment on 20150507 [puppet] - 10https://gerrit.wikimedia.org/r/209202 (https://phabricator.wikimedia.org/T97888) (owner: 10KartikMistry) [15:19:16] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:19:47] thcipriani: all good. [15:19:57] kart_: thanks [15:20:20] legoktm: centralauth on wmf5 next [15:20:26] o/ [15:24:16] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:24:33] There's concerns that the enwp job queue is stuck since it's growing so much and pushing 20 million. Can someone peek and poke at it as needed? [15:24:50] 6operations, 10Wikimedia-Git-or-Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1269168 (10Paladox) Hi why not ask the person who did the upgrade on gerrit on Wikimedia to see if he/she will do it again but to 2.11. [15:25:03] T13|mobile: i'd wager this is fallout from last saturday, when someone accidentally disabled the job queue [15:25:53] Can someone here re enable that, please? [15:26:44] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1269171 (10Legoktm) 5Resolved>3Open Happening again? ``` mysql:sul@dbstore1002 [centralauth]> select max(gu_id) from globaluser; +------------+ | max(gu_id) |... [15:27:28] !log db connection EINTR noise in logs, see T98489 [15:27:35] legoktm: are there localization updates in CentralAuth? [15:27:36] Logged the message, Master [15:27:49] thcipriani: no, should just be a sync-file [15:28:14] * thcipriani wipes brow [15:28:26] :P [15:28:30] the actual patch was https://gerrit.wikimedia.org/r/#/c/209316/ [15:32:03] "jobs": 19977207 [15:32:09] And growing... [15:32:50] T13|mobile: just to reassure you, the job queue is (probably) working again, it was broken only for a short while [15:33:35] PROBLEM - puppet last run on mw2094 is CRITICAL puppet fail [15:34:19] I've watched it go from 19.82 to 19.98 in the last hour... [15:34:42] 19.99 [15:34:47] !log thcipriani Synchronized php-1.26wmf5/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php: Update CentralAuth [[gerrit:209492]] (duration: 00m 17s) [15:34:54] Logged the message, Master [15:34:57] ^ legoktm [15:35:14] T13|mobile: looks like they're all refreshLinks jobs [15:36:17] thcipriani: ok, this isn't something I can really test [15:36:44] ok, well, file synced then, continuing with wmf4 [15:37:00] T13|mobile: some jobs actually generate more jobs when executed :D [15:37:47] (03PS6) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/195905 (https://phabricator.wikimedia.org/T91493) [15:38:05] hmm [15:38:17] I understand, it's just concerning that sitting here pressing [ctrl]¤[shift]¤[r] it's growing 50-220 per second and never goes down. Maybe it needs a little more resources to chew it down a bit? [15:38:40] well, it's executing jobs [15:39:25] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 3.88 ms [15:40:36] T13|mobile: for example: (simplifying, since i don't know exactly how it works) say you edit a template used on 200 000 pages. rather than generate 20 000 jobs to update the pages immediately, which itself would take a long time, MediaWiki instead generates (say) 100 jobs, each of which generates 2000 jobs, each of which actually updates a page. [15:40:54] this causes the queue length to fluctuate in fascinating, unpredictable and entirely meaningless ways [15:41:23] (03PS1) 10Andrew Bogott: Set force_snat_range for labs floating ips. [puppet] - 10https://gerrit.wikimedia.org/r/209506 [15:41:43] MatmaRex: I get that. [15:41:51] okay :) [15:42:05] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1269198 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The system board has been changed, everything posted as it should. Updated the bios and iLom settings. Verified MAC address... [15:42:21] ^d: wmf4 messed up like last time? [15:42:43] <^d> I'm really busy. [15:44:07] I'm sure the template I made two edits to in the last couple days with 126K+ transclusions helped bump that number. [15:44:08] manybubbles: SMalyshev: call in 15 mins. from my side we can skip it. do you have something important? [15:44:19] thanks cmjohnson1 [15:44:32] yw [15:44:35] T13|mobile: wat >.> [15:44:56] Lydia_WMDE: we can skip. I want to have the meeting again when robla comes back but we can wait [15:44:57] ? [15:45:06] manybubbles: ack [15:45:22] T13|mobile: don't complain about job queue length when you're the one who made it so long! :P [15:45:25] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:45:38] 250K != 20M [15:47:48] hmn, it *just* hit 20000096 [15:48:54] !log thcipriani Synchronized php-1.26wmf4/extensions/CentralAuth/includes/LocalRenameJob/LocalRenameUserJob.php: Update CentralAuth [[gerrit:209493]] (duration: 00m 21s) [15:49:01] Logged the message, Master [15:49:53] so basically all mines are settin' up ShortUrl extension for these wikis: newiki eswiki sawiki sawikisource sawikiquote sawiktionary sawikibooks, so running update.php and populateShortUrlTable.php are required for them. [15:49:58] legoktm: that's really wierd, centralauth on dbstore1002... [15:50:43] springle: yeah...I have no idea :/ [15:51:08] the replication ruels exist, the records are streaming into the binlog, the tables are intact [15:51:25] RECOVERY - puppet last run on mw2094 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:52:23] Mjbmr: this is new deploy territory for me, especially in light of https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change I may need some help on this one. [15:52:50] anomie: do you have a second for a deploy assist? [15:52:58] thcipriani: Sure, what's up? [15:53:33] so Mjbmr's patches evidently require schema changes, which I have never done during a swat. [15:54:14] anomie: what SOP in this instance? [15:55:03] thcipriani: I believe mwscript maintenance/patchSql.php --wiki=foowiki path/to/file.sql will do it. I haven't done it myself though, it doesn't come up much. [15:55:40] Oh, according to the page you linked just sql.php will do it too [15:55:43] 7Puppet, 6Reading-Infrastructure-Team, 6Release-Engineering, 5Patch-For-Review: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1269268 (10Jdforrester-WMF) [15:55:50] * anomie didn't know that [15:57:17] RoanKattouw did one ShortUrl setup last evening for knwiki. [15:59:51] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1269302 (10Springle) Something weird is happening. Replication for s7 is running, the replication rules (added earlier in this bug) still exist, the statements are... [16:00:09] kart_: Respected human, time to deploy Content Translation deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150507T1600). Please do the needful. [16:00:35] yes. jouncebot [16:01:23] oh my that is a lot of jobs [16:01:56] kart_: I'm still trying to get Mjbmr's shorturl stuff out the door for SWAT [16:03:04] thcipriani: no issue. this is code deployment, shouldn't affect config. [16:03:09] but correct me :) [16:05:03] kart_: blerg, go for it, still trying to find the sql to run :( [16:05:45] Mjbmr: is there consensus? [16:06:06] yeah, check those. [16:07:47] thcipriani: isn't that just update.php ? [16:07:54] haha [16:07:54] no [16:08:14] Mjbmr: checkout the How to do a schema change page [16:08:33] https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change [16:08:37] RoanKattouw_away: And if anyone runs update.php I'll be on the first flight to SFO to slap them in the face [16:08:56] I saw it Krenair at the first time, thanks. [16:09:56] hmm, I guess that "to" should be changed to "from" :P [16:10:31] Mjbmr: looks like it'll be: mwscript sql.php --wiki=newiki /srv/mediawiki-staging/php-1.26wmf4/extensions/ShortUrl/schemas/shorturls.sql [16:10:42] (03CR) 10Ottomata: "Hmm, we shouldn't make the varnishkafka and varnishncsa instances use the same endpoints for logging. This would cause duplicate events " [puppet] - 10https://gerrit.wikimedia.org/r/209175 (owner: 10Ori.livneh) [16:10:49] ori: ^ [16:11:44] godog: \o/ there are deployment lines in gdash again. Thanks! [16:13:06] (03CR) 10Thcipriani: [C: 032] Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [16:13:22] Mjbmr: ok, going on this one. [16:14:51] (03PS10) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [16:15:16] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [16:16:47] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1269372 (10Springle) It's all of S7 affected: arwiki cawiki eswiki fawiki frwiktionary hewiki huwiki kowiki metawiki rowiki ukwiki viwiki ...and centralauth. The... [16:18:03] (03Merged) 10jenkins-bot: Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) (owner: 10Dereckson) [16:18:36] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [16:19:51] !log kartik Started scap: Update ContentTranslation [16:20:01] Logged the message, Master [16:20:17] !log creating newiki shorturl table [16:20:25] Logged the message, Master [16:20:46] (03PS2) 10Andrew Bogott: Set force_snat_range for labs floating ips. [puppet] - 10https://gerrit.wikimedia.org/r/209506 [16:23:09] Mjbmr: any special arguments to populateShortUrlTable aside from wiki? [16:23:24] no [16:23:36] PROBLEM - High load average on labstore1001 is CRITICAL 57.14% of data above the critical threshold [24.0] [16:23:49] !log populateShortUrlTable on newiki [16:23:55] Logged the message, Master [16:24:10] let me know how long does it take! [16:24:16] kk [16:24:26] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 57.14% of data above the critical threshold [35.0] [16:24:37] Mjbmr: 17s [16:24:44] great! [16:25:26] PROBLEM - Disk space on labvirt1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 87528 MB (3% inode=99%) [16:25:43] kart_: bah, can I sync config during scap? I'd guess not: there's a lock file right? [16:26:06] overlapping syncs should not be possible [16:26:10] for lots of reasons [16:26:38] (03PS11) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [16:26:44] bd808: darn good reasons. [16:26:45] whatever was staged in /srv/mediawiki-staging when the scap started is getting shipped to the cluster [16:26:50] <^d> bd808: Does sync-file/dir support bash file expansion? [16:27:05] RECOVERY - Disk space on labvirt1001 is OK: DISK OK [16:27:10] <^d> eg: sync-file {Foo,Bar}.php [16:27:12] ^d: not in the code but in your shell when you run it [16:27:32] kk: Mjbmr I'm going to get the rest of the ShortURL schema changes done in the interim [16:27:33] <^d> Ah gotcha ok. Never tried and was worried it might do something weird so I asked [16:27:35] That won't work though [16:27:42] there is a way... [16:27:46] it's dark magic [16:28:39] <^d> I mean it's nbd, I just use sync-dir if sync-file can't do it. it's fast enough [16:28:40] thcipriani: alright, thank you [16:28:51] <^d> bd808: and when in doubt, scap :D [16:29:19] ^d: there is code inside scap that can sync multiple files but no cli interface to talk to it [16:29:45] waiting for sync-common 1% left :) [16:30:04] sync-file (and sync-dir) treat the first arg as the thing to sync and everything after that as the log message [16:30:50] kart_: The last one left is snapshot1004.eqiad.wmnet which will probably hang indefinately [16:31:10] <^d> Can it skip syncing for now? [16:31:16] <^d> Or does it /have/ to be up to date? [16:31:25] kart_: you can get it unstuck by opening a second ssh session to tin and killing the ssh command that you own connecting to snapshot1004.eqiad.wmnet [16:32:01] ^d: I think we need to figure out a way to add a watchdog timer in scap for this step [16:32:13] bd808, can we get a cli for that multiple-file code? [16:32:26] Krenair: patches welcome ;) [16:33:35] kart_: you really should kill that snapshot1004.eqiad.wmnet ssh process or you will be here all day [16:33:50] I can't do it for you because I don't have root [16:34:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:34:33] bd808: doing. [16:35:01] 6operations, 6Search-Team, 7Elasticsearch: Backup elasticsearch indicies - https://phabricator.wikimedia.org/T91404#1269466 (10demon) [16:35:30] bd808: done [16:36:07] bd808: and do I need to do that again? I forgot what was the command to rerun? [16:36:13] !log kartik Finished scap: Update ContentTranslation (duration: 16m 21s) [16:36:20] Logged the message, Master [16:36:44] !log create shorturl table in sawiki, sawikisource, sawikiquote, sawiktionary, sawikibooks [16:36:49] Logged the message, Master [16:37:20] kart_: I'm running it [16:37:34] bd808: thanks! [16:37:35] !log Running sync-common manually on snapshot1004.eqiad.wmnet [16:37:40] Logged the message, Master [16:39:45] <^d> aude: About? [16:39:46] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [16:40:30] ^d: yes [16:40:47] bd808: lemme know when sync-common is done, got a few config changes left for SWAT :( [16:41:12] what's the problem? [16:41:16] thcipriani: I'm running directly on snapshot1004. Do what you need to do [16:41:24] kk [16:42:04] <^d> aude: Not a problem. Just could use some input on https://gerrit.wikimedia.org/r/#/c/208168/ (getting addWiki to run populateSitesTable automatically) if you've got a few mins. [16:42:07] thcipriani: snapshot1004 continues to be sick. I doubt that my sync will actually work there but I'm trying just for fun [16:43:01] ah, that... [16:43:08] it's a nuisance [16:43:47] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Enable ShortUrl on newiki [[gerrit:206736]] (duration: 00m 21s) [16:43:53] Logged the message, Master [16:44:02] ^ Mjbmr [16:44:05] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:44:51] ^d: when do you want this merged? [16:45:07] we have special cased some wikis (e.g. ruwiki) to have https [16:45:10] <^d> no real timeline, just scratching an itch [16:45:22] the script doesn't handle that at the moment but we really need a solution for that [16:45:28] otherwise, strip-protocols [16:45:30] (03CR) 10Thcipriani: [C: 032] Enable Extension:Shorturl on sa. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [16:45:35] (03Merged) 10jenkins-bot: Enable Extension:Shorturl on sa. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [16:45:55] and then an entry for the new site is needed on all other wikis with the sites table [16:46:38] Mjbmr: I'm going to do sa wiki since I've added the tables, then push es and alphabetical changes to another swat window since it seems like they've become outdated, somehow (Can Merge: No) [16:48:24] ^d: commented and i am back at the office next week to look at it again [16:49:31] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Enable shortURL on saprojects [[gerrit:201216]] (duration: 00m 14s) [16:49:37] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [16:49:38] Logged the message, Master [16:49:46] fwiw, Coren is investigating ^ [16:50:08] SWAT extended remix has completed [16:50:27] Yeah, I think I can ack it for now. It looks like a combination of higher-resync-speed + network choking causing a bit of pileup. [16:51:05] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [16:51:26] RECOVERY - Host curium is UPING OK - Packet loss = 0%, RTA = 3.31 ms [16:52:30] ACKNOWLEDGEMENT - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] Coren Predictable side effect of the increased resync speed. System is sluggish, but remains usable - keeping a close eye on it. [16:53:56] <^d> aude: thx! :) [16:55:15] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [16:55:41] ^ me [16:55:50] i guess my maintenence config in icinga expired [16:55:53] booting the new kernel? [16:55:55] yeah [16:56:07] !log sync-common on snapshot1004 finished in 12:36 [16:56:10] via linux-meta install right? [16:56:13] Logged the message, Master [16:56:14] technically older, i'm going from 4.0 to 3.19.6 [16:56:27] what's linux-meta install? i just apt-get installed it. [16:56:39] "apt-get install linux-meta" is what you want [16:56:45] cool ok [16:56:48] to get initramfs and other associated bits [16:57:01] maybe delete the 4.0 packages first, I donno [17:01:06] RECOVERY - Host curium is UPING OK - Packet loss = 0%, RTA = 2.99 ms [17:02:01] (03CR) 10Andrew Bogott: [C: 032] Set force_snat_range for labs floating ips. [puppet] - 10https://gerrit.wikimedia.org/r/209506 (owner: 10Andrew Bogott) [17:03:37] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1269523 (10coren) @mark suggest it might be worthwhile to ensure that the labstores and their shelves are all on the same phase to avoid the possibility of an electr... [17:04:52] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1269532 (10Dzahn) @Joe so re: our IRC talk. So to try the switch of a cluster i picked 'PDF' andi made [[ https://gerrit.wikimedia.org/r/#/c/209388/1/hieradata/common.yaml |... [17:15:52] legoktm: would my guess that part of the reason the jobqueue is still ever expanding might be related to SULF? [17:16:01] doubt it [17:17:22] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1269542 (10Cmjohnson) I verified that both labstores and their shelves are on the same phase. -CJ [17:51:10] (03PS1) 10Andrew Bogott: Revert "Set force_snat_range for labs floating ips." [puppet] - 10https://gerrit.wikimedia.org/r/209537 [17:54:43] (03CR) 10jenkins-bot: [V: 04-1] Revert "Set force_snat_range for labs floating ips." [puppet] - 10https://gerrit.wikimedia.org/r/209537 (owner: 10Andrew Bogott) [17:55:48] Hm, I think Jenkins is ailing, is someone working on that? https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27/1066/console [17:56:37] andrewbogott: that's weird...the slave can't connect to pypi.python.org [17:57:00] legoktm: Oh! That might be my fault, hang on… [17:57:00] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209537 (owner: 10Andrew Bogott) [17:57:12] :P [17:57:38] Nope, I take it back — I don’t think it’s my fault. [17:57:41] I can ping, certainly. [17:59:06] https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27/1067/console I think it's going to timeout again [17:59:58] --- python.map.fastly.net ping statistics --- [17:59:59] 13 packets transmitted, 0 received, 100% packet loss, time 11999ms [18:00:50] oh fastly [18:01:17] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209537 (owner: 10Andrew Bogott) [18:01:25] hmm [18:01:28] I think it's just that slave [18:01:45] nope...now it's working fine??? [18:02:42] (03CR) 10Andrew Bogott: [C: 032] Revert "Set force_snat_range for labs floating ips." [puppet] - 10https://gerrit.wikimedia.org/r/209537 (owner: 10Andrew Bogott) [18:03:32] greg-g: i need a deploy slot on monday for enabling arbitrary access (on the initial set of wikis) [18:03:42] 6operations, 10Traffic, 5Patch-For-Review: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1269630 (10Gage) This kernel is now installed on berkelium & curium. * IPsec ESNs work (fixed in 3.19.3) * Aesni security patch for CVE-2015-3331 is included (fixed in 3.19.3) * Aes... [18:03:56] aude: take it :) [18:04:32] ok :) [18:04:55] there is no schedule for next week, but how about 13:00 UTC monday and wednesday (for next usage tracking) [18:05:10] i can add this on the deploymetns page [18:07:14] aude: yeah, add it in the upcoming section, I'll add the weekly outline soon [18:07:43] done [18:08:01] * aude off for the rest of the day [18:26:40] (03PS1) 10Manybubbles: Upgrade Elasticsearch plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/209541 [18:27:02] (03CR) 10Manybubbles: [C: 04-1] Upgrade Elasticsearch plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/209541 (owner: 10Manybubbles) [18:27:25] (03CR) 10Manybubbles: "No merging until we're ready to deploy. We can cherry pick into beta." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/209541 (owner: 10Manybubbles) [18:37:31] (03PS2) 10Mjbmr: Enable ShortUrl on es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206737 (https://phabricator.wikimedia.org/T96668) (owner: 10Dereckson) [18:41:45] Really :/ [18:45:11] (03PS1) 10Mjbmr: Merge "Enable Extension:Shorturl on sa. projects" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209550 [18:45:56] (03Abandoned) 10Mjbmr: Merge "Enable Extension:Shorturl on sa. projects" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209550 (owner: 10Mjbmr) [18:54:22] (03PS2) 10Aaron Schulz: Set $wgActivityUpdatesUseJobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206862 (https://phabricator.wikimedia.org/T91284) [18:54:25] (03CR) 10jenkins-bot: [V: 04-1] Set $wgActivityUpdatesUseJobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206862 (https://phabricator.wikimedia.org/T91284) (owner: 10Aaron Schulz) [18:59:02] (03PS1) 10coren: Creat tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 [18:59:49] (03CR) 10jenkins-bot: [V: 04-1] Creat tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [19:01:22] (03PS2) 10coren: Creaet tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 [19:02:00] bblack: I'd appreciate a quick once-over to this if you have a few minutes ^^ [19:04:36] greg-g: can I deploy two CA patches so I can start renaming invalid usernames? patches are https://gerrit.wikimedia.org/r/209538 and https://gerrit.wikimedia.org/r/209539 [19:05:51] legoktm: sure thing [19:06:14] thanks [19:09:50] (03PS1) 10Andrew Bogott: Add labvirt1007, 1008, 1009 to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/209568 [19:10:31] (03CR) 10jenkins-bot: [V: 04-1] Add labvirt1007, 1008, 1009 to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/209568 (owner: 10Andrew Bogott) [19:16:54] !log legoktm Synchronized php-1.26wmf5/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/209538 and https://gerrit.wikimedia.org/r/209539 (duration: 00m 16s) [19:17:03] Logged the message, Master [19:17:50] !log legoktm Synchronized php-1.26wmf4/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/209538 and https://gerrit.wikimedia.org/r/209539 (duration: 00m 16s) [19:17:56] Logged the message, Master [19:59:56] (03PS1) 10Ori.livneh: Use optipng to shave a few bytes from chrome images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209613 [19:59:58] (03PS1) 10Ori.livneh: Move images/ to w/static/images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209614 [20:00:00] (03PS1) 10Ori.livneh: Add optimized project logos to static/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209615 [20:00:02] (03PS1) 10Ori.livneh: Set project logos to logos added in I8c9a6a567 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209616 [20:01:20] (03CR) 10Ori.livneh: [C: 032] Use optipng to shave a few bytes from chrome images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209613 (owner: 10Ori.livneh) [20:01:55] (03CR) 10Ori.livneh: [C: 032] Move images/ to w/static/images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209614 (owner: 10Ori.livneh) [20:03:14] (03CR) 10Ori.livneh: [V: 032] Use optipng to shave a few bytes from chrome images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209613 (owner: 10Ori.livneh) [20:03:33] (03CR) 10Ori.livneh: [V: 032] Move images/ to w/static/images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209614 (owner: 10Ori.livneh) [20:03:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Add optimized project logos to static/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209615 (owner: 10Ori.livneh) [20:08:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Set project logos to logos added in I8c9a6a567 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209616 (owner: 10Ori.livneh) [20:12:21] (03PS1) 10Ori.livneh: Fix-up for I1fcb3f17d: correct logo path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209617 [20:12:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I1fcb3f17d: correct logo path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209617 (owner: 10Ori.livneh) [20:13:43] !log Trebuchet fetch for scap/scap failed on mw1222.eqiad.wmnet [20:13:51] Logged the message, Master [20:14:11] !log Trebuchet checkout failed for scap/scap on mw1222.eqiad.wmnet, mw1113.eqiad.wmnet, mw1104.eqiad.wmnet [20:14:17] Logged the message, Master [20:14:33] !log updated scap to 5d681af (Better handling for php lint checks) [20:14:42] Logged the message, Master [20:17:58] !log ori Synchronized wmf-config: I3846e34ed, I1fcb3f17d, I8c9a6a567, I1a73c83f7, and Iacbd92931: serve optimized, cacheable logos from /static (duration: 00m 19s) [20:18:12] Logged the message, Master [20:27:07] 10Ops-Access-Requests, 6operations: Grant deployment access for beta cluster - https://phabricator.wikimedia.org/T98523#1270389 (10Jdouglas) 3NEW [20:32:22] Hi ops, stat1002 - to which I login everyday - is kicking me out Permission denied (publickey) now. This happened once before and wizard ottomata not doing anything but stare at logs on his end fixed it. Help? [20:34:51] * yuvipanda puts on his robe and wizard hat [20:35:01] I kid I kid. I'm putting on socks to go to the office [20:35:17] In case this helps: I use stat1003 and never had a problem [20:35:27] madhuvishy: try again. i'm watching the logfile [20:35:51] mutante: just did [20:36:04] RoanKattouw: that kicks me out too [20:36:09] madhuvishy: it looks like you are not getting to stat1002 [20:36:16] RoanKattouw: stat1003 doesn't have all the hadoop access I think [20:36:33] mutante: oh? [20:36:35] !log renaming users with invalid usernames (https://phabricator.wikimedia.org/T5507) [20:36:47] Oh, I see [20:36:51] morebots: ? [20:36:52] Logged the message, Master [20:36:54] I am a logbot running on tools-exec-1203. [20:36:54] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [20:36:54] To log a message, type !log . [20:36:56] ok :P [20:37:07] Yeah I was just using it to get to the EventLogging DB tables [20:37:11] madhuvishy: how do you connect to it? what's your commandline [20:37:31] mutante: ssh stat1002.eqiad.wmnet [20:37:40] mutante: also tried with madhuvishy@ [20:37:50] madhuvishy: so the "failed public key" is already at the bastion host [20:38:36] * madhuvishy pretends to understand what that means [20:39:21] mutante: ummmm [20:39:25] madhuvishy: so it's a 2 step connection right. first it connects to bast1001.wikimedia.org which has a public IP and from there it connect to stat1002. the problem is already at step 1 [20:39:37] mutante: oh right [20:39:42] madhuvishy: did you load your ssh key into an agent? like ssh-add ? [20:40:52] mutante: I don't think I did [20:41:26] mutante: i did for my normal one. but not the one i use for prod. [20:41:50] madhuvishy: that would explain it, do you have the "ssh-add" command? load mviswanathan@mviswanthansMBP.corp.wikimedia.org [20:42:33] madhuvishy: or alternatively you can try: ssh -i /path/to/that/key stat1002.eqiad.wmnet [20:42:38] mutante: yay that helped [20:42:42] madhuvishy: :) [20:42:52] mutante: so why was it working all this while [20:44:13] Ubuntu has a magical key agent that is supposed to auto-detect and auto-add keys, but one in every ~30 times it mysteriously doesn't work [20:44:16] madhuvishy: it was still loaded in the agent probably [20:44:50] RoanKattouw: mutante aah. alright. thanks much! [20:45:21] ah:) yea, i don't even have auto things, i manually add it (once per day) [20:46:17] I have a passphrase on my key so it's easy to tell when the agent is auto-adding: it opens a dialog asking me for the passphrase [20:47:16] My keys time out of the agent every 30 minutes so I get to type my passphrases a lot :) [20:47:22] I've never had issues with Ubuntu's ssh keys [20:53:53] !log Updated kibana to bb9fcf6 (Merge remote-tracking branch 'upstream/kibana3') [20:53:54] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209568 (owner: 10Andrew Bogott) [20:54:02] Logged the message, Master [20:55:11] (03CR) 10Andrew Bogott: [C: 032] Add labvirt1007, 1008, 1009 to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/209568 (owner: 10Andrew Bogott) [21:01:24] !log dumps are interrupted on snapshot1004 while I do a manual run for testing/debugging purposes. please let it run and don't start any other processes on the box, thanks [21:01:30] Logged the message, Master [21:02:26] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [21:02:50] (Cannot access the database: Can't connect to MySQL server on '10.64.16.29' (4) (10.64.16.29)) [21:02:53] on https://meta.wikimedia.org/w/index.php?title=User:Ori_Livneh/global.js&action=submit [21:02:59] springle: ^ [21:03:16] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [21:07:25] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1270522 (10Multichill) I probably signed >>! In T87097#1262387, @Dzahn wrote: > @multichill > > please try viewing that document again. after talking with chasemp i added you to the follo... [21:11:16] <_joe_> ori: I think sean has a bug for that [21:11:30] <_joe_> ori: HHVM sets the connection timeout for mysql to 1 second [21:12:07] <_joe_> lemme find it, I had no time to work on it because I was busy hating puppet [21:12:17] <_joe_> you may have time to take a look [21:13:03] <_joe_> https://phabricator.wikimedia.org/T98489 [21:25:55] !log deployed RESTBase 6043e3ada (v0.6.2) [21:26:02] Logged the message, Master [21:26:21] (03PS1) 10Yuvipanda: tools: Don't spam toolschecker log with qstat output [puppet] - 10https://gerrit.wikimedia.org/r/209638 [21:26:26] (03CR) 10jenkins-bot: [V: 04-1] tools: Don't spam toolschecker log with qstat output [puppet] - 10https://gerrit.wikimedia.org/r/209638 (owner: 10Yuvipanda) [21:26:59] _joe_: aha, thanks [21:27:14] _joe_: any news re: hhvm upgrade? [21:27:16] (03PS2) 10Yuvipanda: tools: Don't spam toolschecker log with qstat output [puppet] - 10https://gerrit.wikimedia.org/r/209638 [21:27:55] <_joe_> ori: it's on apt.w.o, tried on the imagescaler which has memory consumption issues in convert(1) for some images [21:28:18] <_joe_> probably we should raise the memory limit in limits.sh a bit [21:28:23] * ori nods [21:28:29] what about the other apaches? [21:28:43] <_joe_> still on 3.3.1 [21:28:56] <_joe_> tomorrow I was planning to upgrade the canaries, but [21:29:15] <_joe_> a lot of instability and strange errors lately [21:29:30] <_joe_> better not to add another variable to the equation [21:34:49] ok [21:39:25] PROBLEM - salt-minion processes on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:35] PROBLEM - dhclient process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:36] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:33] what the heck? [21:41:20] huh [21:41:34] (03CR) 10Yuvipanda: [C: 032] tools: Don't spam toolschecker log with qstat output [puppet] - 10https://gerrit.wikimedia.org/r/209638 (owner: 10Yuvipanda) [21:41:36] Right now I’m rsyncing a giant file to labvirt1008. [21:41:41] But that shouldn’t be causing… that. [21:41:54] any chnce you're hogging all available bandwidth on that host? :-P [21:42:01] sure looks that way [21:42:04] although the rsync is throttled [21:42:04] :-D [21:42:15] maybe the throttle isn'tdoing what you expected [21:43:43] this does not look to me like a network bandwidth thing: https://dpaste.de/V4oG [21:44:09] andrewbogott: What is your source and destination? [21:44:23] virt1008 -> labvirt1008 [21:44:26] ruh roh [21:44:41] Coren, see the link I just pasted? [21:45:10] wait times on labvirt1008 are through the damn roof [21:46:20] that could be helped along by the rsync all right [21:46:46] It could just be the copy, but if so I don’t know why I haven’t been seeing things like this all week [21:46:55] This this is like the 100th instance I’ve rsynced this way [21:48:06] andrewbogott: Lemme check, there may be something fishy on the host. [21:48:17] I killed the rsync, we’ll see if it recovers [21:48:23] certain amountof swapping it looks like [21:48:57] [194941.766087] kvm: zapping shadow pages for mmio generation wraparound [21:49:16] I definitely don’t know what that is [21:50:17] First time I see it, but it preceeds all the kernel hung tasks [21:51:53] It might be recovering, hard to tell [21:52:08] I guess icinga doesn’t think so [21:53:26] I don't think so either [21:53:55] 6operations, 10Wikimedia-Mailing-lists: Create an alias for mailman list - https://phabricator.wikimedia.org/T98415#1270648 (10Dzahn) @Tfinc @JGulingan It's possible and i added the alias on the mail server. +# T98415 +search: wikimedia-search@lists.wikimedia.org The second step was having to add it as an... [21:54:03] 6operations, 10Wikimedia-Mailing-lists: Create an alias for mailman list - https://phabricator.wikimedia.org/T98415#1270649 (10Dzahn) a:3Dzahn [21:55:05] 6operations, 10Wikimedia-Mailing-lists: Create an alias for mailman list - https://phabricator.wikimedia.org/T98415#1270655 (10Dzahn) 5Open>3Resolved Without also changing the list config the messages would be held for moderation with reason "message has implicit destination" docs about that are here: ht... [21:55:36] RECOVERY - salt-minion processes on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:56:23] huh [21:57:17] apergos, Coren, I restarted the instance migration but to labvirt1003. If it freaks out in the same way then I guess we’ll learn… something? [21:57:23] But I don’t really suspect the migration. [21:57:59] it just might be pushing something unappy over the edge is all [21:58:19] maybe. That server should be happy, though, everything’s green in ganglia [21:58:20] Or, was. [21:59:06] RECOVERY - dhclient process on labvirt1008 is OK: PROCS OK: 0 processes with command name dhclient [21:59:18] well I'm camped on syslog though I won't be around that much longer [21:59:22] it is 1 am here [21:59:44] wait is still super high [22:00:20] so how is swap use? still high? [22:01:31] 43G 39G 3.9G [22:01:42] Is that what you were seeing before? [22:01:48] yep [22:02:06] I don't know what it is on a good day of course [22:02:10] but jesus that's a lot [22:02:16] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:02:26] How can I tell who’s gobbling all the memory? Everything looks normal in ‘top' [22:02:44] I tried M and I saw a whole lot o nothin [22:02:50] labvirt1001 which is (in theory) busier is 43G 31G 12G [22:03:18] huh [22:03:20] andrewbogott: https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py [22:03:53] thx [22:03:55] trying... [22:04:00] syslog is still quiet, it's not bad off enough to fall over [22:07:15] PROBLEM - salt-minion processes on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:16] PROBLEM - dhclient process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:25] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:08:19] andrewbogott: afaict, you may just have been really badlucky - I see a lot of kvms on 1008 that are pegging their cpus. [22:08:53] Two of them are at 100^ pretty much permanently. [22:08:56] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:09:02] 100% isn’t so much though… [22:09:51] Sure, but right now there are easily a half dozen working hard. If you had 2-3 times that for a while that might have made things hairy for a while. [22:10:14] But it looks okay since you stopped the rsync so it's hard to tell what went on before. [22:10:15] RECOVERY - salt-minion processes on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:10:27] maybe… I’m trying to understand the icinga graph though. I’ve often seen hosts with cpu usage up around 80% at length, no problems. [22:10:30] This one is running more like 20% [22:10:40] but then there’s all this additional ‘wait' [22:12:06] RECOVERY - dhclient process on labvirt1008 is OK: PROCS OK: 0 processes with command name dhclient [22:12:34] in the meantime I don't see anything more odd from kernel or syslog [22:12:52] an I'm not sure those mmio messages are really an indicator, we see them sporaically for several days before as well [22:13:07] sorry for the typos, really oughta think about bed soon [22:13:15] This box has 48 cores. A few things running at 100% really shouldn’t matter. [22:13:40] And it doesn’t look ok to me… ganglia is still ugly and my shell there hesitates a lot [22:13:48] yeah my shell as well [22:13:55] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:13:58] which says to me swap and/or io [22:14:01] meh [22:14:23] lemme suspend a few instances and see if things mellow out [22:14:33] andrewbogott: can i take a peek too? [22:14:39] ori: yes please [22:14:54] great, you tke my spot, I wasn't doing much with it anyways :-) [22:15:17] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:15:29] andrewbogott: which host? [22:15:35] ori: labvirt1008 [22:17:06] (03PS1) 10Ottomata: Adjust Hadoop memory settings [puppet] - 10https://gerrit.wikimedia.org/r/209642 [22:19:57] (03CR) 10Kaldari: [C: 04-1] Import lists for the Browse experiment on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [22:21:56] PROBLEM - salt-minion processes on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:09] PHP Notice: Undefined index: SERVER_NAME in /srv/mediawiki/wmf-config/CommonSettings.php on line 199 [22:23:22] why does that come up in eval.php labswiki but not other wikis? [22:23:31] ori: finding anything interesting? [22:23:46] PROBLEM - dhclient process on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:49] andrewbogott: i can't get anything to run, the system is in overload [22:23:59] yeah [22:24:29] it has no free memory [22:25:40] maybe I made it worse by trying to suspend instances... [22:26:13] how many kvm instances is it supposed to have? [22:26:32] maybe… 80? [22:27:11] 10Ops-Access-Requests, 6operations: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1270765 (10Jdouglas) 3NEW [22:27:37] 10Ops-Access-Requests, 6operations: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1270775 (10Jdouglas) p:5Triage>3Normal [22:27:55] ori: I’m noting that due to the magic of ‘dist-upgrade’ that box and labvirt1009 have a slightly different kernel than the other labvirts. [22:28:01] Hard to know if that matters. [22:28:26] RECOVERY - dhclient process on labvirt1008 is OK: PROCS OK: 0 processes with command name dhclient [22:28:57] andrewbogott: sorry was in a meeting [22:29:01] * yuvipanda reads backlog [22:29:55] RECOVERY - salt-minion processes on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:30:01] ori: I’m about to reboot it. [22:30:09] andrewbogott: hang on [22:30:28] ori: There were patches available to the kernel it’s running now, so I’m doing a dist-upgrade again. [22:30:30] But, hanging on :) [22:31:16] andrewbogott: kk, go for it [22:31:20] thanks [22:31:32] waiting for super-slow dist-upgrade to run first… [22:32:12] (03PS1) 10Dzahn: datasets: create directory for dataset rsyncing [puppet] - 10https://gerrit.wikimedia.org/r/209644 [22:32:57] (03CR) 10jenkins-bot: [V: 04-1] datasets: create directory for dataset rsyncing [puppet] - 10https://gerrit.wikimedia.org/r/209644 (owner: 10Dzahn) [22:33:34] 22:32:16 stderr: error: object file .git/objects/eb/2474568ba4663702ed2927a4ba715f61bedd1f is empty [22:33:45] 22:32:16 fatal: loose object eb2474568ba4663702ed2927a4ba715f61bedd1f (stored in .git/objects/eb/2474568ba4663702ed2927a4ba715f61bedd1f) is corrupt [22:34:21] 22:32:16 java.io.IOException: remote file operation failed: /mnt/jenkins-workspace/workspace/operations-puppet-tox-py27 at hudson.remoting.Channel@14bf2cd3:integration-slave-precise-1012: java.io.IOException: Could not fetch from any repository [22:35:10] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/209644 (owner: 10Dzahn) [22:36:03] and then it's ok again [22:36:26] (03PS2) 10Dzahn: datasets: create directory for dataset rsyncing [puppet] - 10https://gerrit.wikimedia.org/r/209644 [22:37:03] (03CR) 10Dzahn: [C: 032] datasets: create directory for dataset rsyncing [puppet] - 10https://gerrit.wikimedia.org/r/209644 (owner: 10Dzahn) [22:37:27] PROBLEM - puppet last run on mw1201 is CRITICAL Puppet has 1 failures [22:38:21] !log rebooting labvirt1008, running dist-upgrade, rebooting again [22:38:29] Logged the message, Master [22:40:12] (03CR) 10BBlack: "Seems like a basically-sane approach in general. Some thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [22:41:06] heh, labvirt1008 says ‘The system is going down for reboot NOW!’ [22:41:10] and then… doesn’t reboot [22:41:13] time for mgmt I guess [22:41:26] PROBLEM - salt-minion processes on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:06] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:47:26] RECOVERY - Host labvirt1008 is UPING OK - Packet loss = 0%, RTA = 5.05 ms [22:47:46] RECOVERY - salt-minion processes on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:48:46] (03CR) 10Ottomata: "Hm, I need to think about this a little more. The current patch will cause more containers to be allocated to the newer DataNodes that th" [puppet] - 10https://gerrit.wikimedia.org/r/209642 (owner: 10Ottomata) [22:51:55] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:36] RECOVERY - puppet last run on mw1201 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:53:47] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1270861 (10bd808) a:3fgiunchedi @akosiaris and/or @fgiunchedi: would either of you be interested in helping me with this? I'd love to move T90892 forward. [22:55:06] RECOVERY - Host labvirt1008 is UPING OK - Packet loss = 0%, RTA = 1.56 ms [22:56:58] PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail [22:57:01] I'm going to put up a Flow patch for SWAT. I was trying to merge it first so I could put the bump on the deployment page, but Jenkins is not cooperating: [22:57:02] Building remotely on integration-slave-trusty-1011 (phpflavor-hhvm contintLabsSlave UbuntuTrusty) in workspace /mnt/jenkins-workspace/workspace/mwext-Flow-qunit [22:57:06] https://integration.wikimedia.org/ci/job/mwext-Flow-qunit/5878/console [22:57:17] It's been hung on that line for almost 20 minutes. [22:57:40] matt_flaschen: ;/ [22:58:47] !log restarting all instances on labvirt1008, crossing fingers [22:58:55] Logged the message, Master [22:59:21] Do we list the public entrance IPs for the sites anywhere? (if someone connected to the site they would connect to 'x' ip ) [22:59:44] Jamesofur|cloud: zero used to maintain such a list. they still probably do [23:00:04] RoanKattouw, ^d, rmoen: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150507T2300). Please do the needful. [23:00:21] Ok. I'm going for the Gather updates [23:00:31] thanks yuvipanda [23:01:15] RoanKattouw: I got this [23:01:30] rmoen: Thanks man [23:08:47] First Flow: https://gerrit.wikimedia.org/r/209654 [23:10:10] Second: https://gerrit.wikimedia.org/r/209655 [23:12:18] Added to Deployments page [23:14:37] Coren, yuvipanda, labvirt1008 is coming back up and looks ok for the moment. I need to go right now… I’ll try to do a bit more balancing later on this evening. [23:14:50] Hopefully things will hold together in the meantime :) [23:14:56] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:15:01] ok [23:17:15] can you guys do https://gerrit.wikimedia.org/r/206737/ [23:18:39] Added to Deployments page [23:19:29] 20.6M [23:20:51] !log rmoen Synchronized php-1.26wmf5/extensions/Gather/: Update Gather with Cherry-picks (duration: 00m 15s) [23:21:02] Logged the message, Master [23:22:06] (03CR) 10Mattflaschen: "MediaWiki-Vagrant uses Parsoid without RESTBase. I don't think that's the reason VE is slow to load, though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [23:22:33] (03CR) 10Mattflaschen: "AT least, that's apparently part of it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [23:28:17] rmoen: Argh, can I get a last minute SWAT slot? [23:28:28] lol, maybe [23:28:39] Really it depends on jenkins [23:28:50] ;/ [23:29:14] rmoen: Yeah, I know. :-( [23:29:23] * James_F will make the MW-core patches for you. [23:29:31] James_F: ty [23:31:34] !log rmoen Synchronized php-1.26wmf4/extensions/Gather: Update Gather with cherry-picks (duration: 00m 14s) [23:31:36] rmoen: We'll see where we are. [23:31:41] Logged the message, Master [23:32:24] ok now Flow [23:32:37] * James_F glares at Jenkins. [23:33:02] matt_flaschen: merging wmf5 submodule update.. waits for jenkins [23:35:48] so submodule updates are ok to verify correct? [23:37:18] rmoen: https://gerrit.wikimedia.org/r/209661 (wmf5) and https://gerrit.wikimedia.org/r/209660 (wmf4) [23:37:23] (adding to calendar now) [23:37:57] matt_flaschen: should be on test wiki [23:38:16] James_F: ok [23:38:46] rmoen, okay. It's a script that we're going to run on all wikis. I'm probably not going to test it separately on testwiki. [23:39:00] matt_flaschen: ok so go for wmf4 ? [23:39:05] Yeah [23:39:46] !log rmoen Synchronized php-1.26wmf5/extensions/Flow: Bump Flow with cherry-picks (duration: 00m 14s) [23:39:52] Logged the message, Master [23:41:44] !log rmoen Synchronized php-1.26wmf4/extensions/Flow/: Bump flow with cherry-picks (duration: 00m 13s) [23:41:51] Logged the message, Master [23:43:14] Thanks. [23:44:31] (03CR) 10Robmoen: [C: 032] Enable ShortUrl on es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206737 (https://phabricator.wikimedia.org/T96668) (owner: 10Dereckson) [23:45:36] Mjbmr: waiting for gerrit [23:45:47] ok [23:46:00] moving on to ve changes while i wait [23:46:16] Whee. [23:46:41] (03PS1) 10Yuvipanda: tools: Fix broken continuous job checks [puppet] - 10https://gerrit.wikimedia.org/r/209663 [23:46:46] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix broken continuous job checks [puppet] - 10https://gerrit.wikimedia.org/r/209663 (owner: 10Yuvipanda) [23:46:54] (03PS2) 10Yuvipanda: tools: Fix broken continuous job checks [puppet] - 10https://gerrit.wikimedia.org/r/209663 [23:47:02] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix broken continuous job checks [puppet] - 10https://gerrit.wikimedia.org/r/209663 (owner: 10Yuvipanda) [23:48:46] James_F: Should be on test wiki [23:49:17] (03Merged) 10jenkins-bot: Enable ShortUrl on es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206737 (https://phabricator.wikimedia.org/T96668) (owner: 10Dereckson) [23:50:11] rmoen: Testing now [23:50:16] ok [23:51:00] rmoen: Yup, looks good. [23:51:01] ok [23:51:04] (03PS1) 10Yuvipanda: tools: Fix redirecting stderr properly [puppet] - 10https://gerrit.wikimedia.org/r/209665 [23:51:09] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix redirecting stderr properly [puppet] - 10https://gerrit.wikimedia.org/r/209665 (owner: 10Yuvipanda) [23:51:13] (03PS2) 10Yuvipanda: tools: Fix redirecting stderr properly [puppet] - 10https://gerrit.wikimedia.org/r/209665 [23:51:14] !log rmoen Synchronized php-1.26wmf5/extensions/VisualEditor/: Update VE for cherry-picks (duration: 00m 11s) [23:51:21] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix redirecting stderr properly [puppet] - 10https://gerrit.wikimedia.org/r/209665 (owner: 10Yuvipanda) [23:51:26] Logged the message, Master [23:51:45] rmoen: are you going to do the schema change? [23:51:53] ... [23:51:59] what schema change ? [23:52:08] the ShortUrl [23:52:14] ;/ [23:52:19] uh... [23:52:23] mwscript sql.php --wiki=eswikibooks /srv/mediawiki-staging/php-1.26wmf4/extensions/ShortUrl/schemas/shorturls.sql [23:52:24] probably not [23:52:31] 1 sec [23:53:29] also must run populateShortUrlTable.php [23:53:52] Mjbmr: I'm not sure i feel up to it [23:53:57] never done those things [23:54:06] Might have to call for reinforcements [23:54:24] well, RoanKattouw was here. [23:54:55] !log rmoen Synchronized php-1.26wmf4/extensions/VisualEditor/: Update VE with Cherry-picks (duration: 00m 12s) [23:55:01] Logged the message, Master [23:55:02] RoanKattouw: Would you mind helping this kind sir? I'm not sure I'm the best for this [23:55:40] I'd like to save my first schema change for a more open window [23:56:12] James_F: all done [23:56:12] Sorry [23:56:21] I wasn't paying attention to IRC [23:56:22] rmoen: Thanks. Tested in prod. Looks good. [23:56:22] Catching up [23:56:28] thanks RoanKattouw [23:56:31] Oh yes [23:56:40] Yeah do what Mjbmr says [23:56:58] I did the exact same thing yesterday on a different wiki, it's super quick [23:57:25] [16:52] Mjbmr mwscript sql.php --wiki=eswikibooks /srv/mediawiki-staging/php-1.26wmf4/extensions/ShortUrl/schemas/shorturls.sql [23:57:31] That's harmless for sure, just creates a table [23:57:34] So i had no idea about schema changes [23:57:42] just saw the config change [23:57:53] ok [23:58:17] Then deploy the config change [23:58:26] ok [23:58:36] Then do mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=eswikibooks [23:58:46] run this from mediawiki-staging ? [23:58:50] Which will not be instantaneous but in my experience it's pretty quick [23:58:52] Doesn't matter where from [23:59:07] ok [23:59:14] mwscript knows where to look, it's magic