[00:00:03] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1846491 (10BBlack) [00:00:03] 7Blocked-on-Operations, 7Varnish: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1846492 (10BBlack) [00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151203T0000). [00:34:09] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1846621 (10BBlack) [01:09:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [01:15:25] (03PS1) 10Ori.livneh: Unify evaluate_cookie and evaluate_cookie_mobile [puppet] - 10https://gerrit.wikimedia.org/r/256617 [01:16:12] (03PS1) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [01:21:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:29:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [01:30:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [01:35:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [01:40:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:41:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [02:25:06] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 54s) [02:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:36] (03PS4) 10Yuvipanda: k8s: switch to using systems' CA [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [02:51:19] that's a new message for the l10nupdate bot [03:16:12] 6operations, 6Labs: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10yuvipanda) 3NEW [03:24:34] (03PS3) 10Bmansurov: Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) [03:24:37] (03CR) 10Bmansurov: Enable RelatedArticles and Cards on the Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [03:32:22] (03CR) 10BBlack: [C: 04-1] "It's actually loading fine in production, and I don't think we want to make this declarative yet anyways, as there's more cookie-code chan" [puppet] - 10https://gerrit.wikimedia.org/r/256617 (owner: 10Ori.livneh) [03:46:05] (03PS1) 10Yuvipanda: labs: Use the https origin instead of ssh [puppet] - 10https://gerrit.wikimedia.org/r/256623 [03:59:16] (03PS1) 10Yuvipanda: puppet: Kill puppet::self::geoip [puppet] - 10https://gerrit.wikimedia.org/r/256624 (https://phabricator.wikimedia.org/T120159) [03:59:39] (03PS2) 10Yuvipanda: puppet: Kill puppet::self::geoip [puppet] - 10https://gerrit.wikimedia.org/r/256624 (https://phabricator.wikimedia.org/T120159) [04:03:35] (03CR) 10Yuvipanda: [C: 032] labs: Use the https origin instead of ssh [puppet] - 10https://gerrit.wikimedia.org/r/256623 (owner: 10Yuvipanda) [04:05:04] greg-g: yeah. I rewrote the sync process over the holiday weekend. https://phabricator.wikimedia.org/T119746 [04:05:18] (03CR) 10Yuvipanda: "I also used the now-working salt to run:" [puppet] - 10https://gerrit.wikimedia.org/r/256623 (owner: 10Yuvipanda) [04:07:01] yuvipanda: \o/ I always wondered what the point of that using ssh transport was [04:10:20] (03PS3) 10Yuvipanda: puppet: Kill puppet::self::geoip [puppet] - 10https://gerrit.wikimedia.org/r/256624 (https://phabricator.wikimedia.org/T120159) [04:10:28] bd808: yeah... [04:10:35] bd808: it's there to annoy me, it turns out [04:10:39] so one manifest down, 4 to go [04:10:45] well, kindof. let me actually kill the one [04:10:49] I think 'config' goes next [04:13:32] (03CR) 10Yuvipanda: [C: 032] puppet: Kill puppet::self::geoip [puppet] - 10https://gerrit.wikimedia.org/r/256624 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [04:15:16] that went well [04:15:21] so one down, 4 to go! [04:15:54] I also hate config classes [04:17:00] yuvipanda: me too! [04:17:12] "parameters, have you heard of them?" [04:22:23] ori: yeah [04:23:25] also the fact that both the server *and* the client are the same puppet role irritates me and causes a fuckton of problems too [04:24:23] * yuvipanda decrees that 1. self hosted puppetmaster conversion is one way (you can not go back) and 2. if you want to point your instance to another puppetmaster you will use a different role [04:24:27] so there should be three: [04:24:36] 1. self-hosted puppetmaster (master and client) [04:24:58] 2. a role for puppet master (that should basically be the same as prod puppetmasters with some variables) [04:25:07] 3. a hiera variable to figure out where you want to point your puppet to [04:30:29] looks like the 'client' role might be easier to kill [04:33:05] wow @ base::puppet::conf [04:38:41] base::puppet::params [04:42:39] (03PS1) 10Yuvipanda: puppet: Kill the client role [puppet] - 10https://gerrit.wikimedia.org/r/256625 [04:48:26] > [04:48:28] This cannot be Class, and there is no way to collect classes. [04:48:30] well done, puppet [04:48:32] hmm [04:48:43] now I either need to implement role based hiera lookup for labs [04:48:49] or hack around this [05:00:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [05:02:08] ugh [05:02:12] rabbit hole too deep [05:02:18] * yuvipanda ponders [05:02:56] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [150.0] [05:04:48] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [05:06:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [05:08:31] 7Puppet, 6operations, 6Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10yuvipanda) 3NEW [05:08:40] nope, can't kill that without ^ [05:32:20] don't do ittt [05:50:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 3 05:50:08 UTC 2015 (duration 50m 7s) [05:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:56:27] PROBLEM - puppet last run on mw2032 is CRITICAL: CRITICAL: puppet fail [06:00:57] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1847113 (10Andrew) Yes, correct. https://gerrit.wikimedia.org/r/#/c/253366/ [06:16:34] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1847133 (10GWicke) The decommission is now done. It might be worth going with XFS for the reimage, as discussed in T120004. [06:23:37] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 1 failures [06:23:58] RECOVERY - puppet last run on mw2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:24:43] ori: if you make a patch to get rid of the graphite alerts I'll +1 [06:29:05] _joe_ just added them; submitting a revert would be a bit underhanded [06:29:09] let's just try to convince him [06:29:57] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:30:03] ori: yeah that's why I said +1 rather than +2 :) [06:31:07] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:07] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:36] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:06] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:37] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:09:22] (03PS1) 10Giuseppe Lavagetto: puppet_ssldir: make the override logic more explicit [puppet] - 10https://gerrit.wikimedia.org/r/256636 [07:09:32] <_joe_> yuvipanda: ^^ [07:09:36] <_joe_> try it on toollabs [07:09:52] <_joe_> and beware, you should first fix the puppet config [07:12:58] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1847214 (10RobLa-WMF) @ottomata: I don't know all of the details, but I think the ID idea is a good one. One //possible hitch//: according to the W3... [07:14:04] <_joe_> yuvipanda: I'll test it [07:15:34] ok thanks [07:15:45] I / you can modify the tools.pp to be not-that too [07:16:44] <_joe_> yuvipanda: do you have a salt master in toollabs? [07:16:48] nope [07:18:37] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:20:42] <_joe_> uhm still not good enough it appears, wtf [07:21:12] <_joe_> now it unfucks puppet but can't install the cert globally, ofc [07:21:55] _joe_: so my end goal with whatever I'm doing is that puppet certs are always in /var/lib/puppet/ssl [07:22:04] no variance based on self hosted puppetmaster or not [07:22:15] <_joe_> yuvipanda: anyways, leave it live for now, the way to unfuck puppet in tools is a) for i in /etc/puppet.conf /etc/puppet/puppet.conf.d/10-self.conf; do perl -i"" -pe 's/\/puppet\/server\/ssl/\/puppet\/client\/ssl/' $i; done [07:22:26] <_joe_> yuvipanda: that is not possible [07:22:29] why not [07:22:32] <_joe_> so just give up [07:22:39] <_joe_> it's a race conditon [07:22:51] <_joe_> think of when you create a machine in labs [07:23:01] <_joe_> it runs puppet against the central puppetmaster [07:23:16] <_joe_> and stores the cert and csr in /var/lib/puppet/ssl [07:23:28] <_joe_> then you transform your machine to a server [07:23:34] <_joe_> a puppetmaster [07:23:40] <_joe_> or you want to change puppetmaster [07:23:53] <_joe_> you cannot remove your own cert mid-flight through puppet [07:24:13] <_joe_> anyways, read what I wrote there ^^ [07:24:25] right, so one possible solution there is to require a manual step (akin to puppet cert signing) when you switch. [07:24:28] <_joe_> I guess you have a dsh group for toolabs? [07:24:43] I think the inconvenience of that one time thing is a possibly worthy tradeoff for the headache-fixing [07:24:48] <_joe_> yuvipanda: eh? so a solution to your difficulty is making user's experience worse? [07:24:50] _joe_: yaeh, but I tried salt today and it actually worked! [07:25:23] I was amazed, etc [07:25:31] _joe_: well, I'll keep trying until I find a way :) [07:25:37] <_joe_> did you remember not to do that on the puppetmaster? :P [07:25:47] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:26:07] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:26:37] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:26:46] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:27:12] <_joe_> yuvipanda: do you care much if I just trick toollabs into working instead of adding another workaround? [07:27:23] <_joe_> for the 4th different logic that is used [07:27:27] _joe_: nope, go ahead. [07:27:28] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:36] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:27:41] <_joe_> I'll reason on how to standardize that [07:27:49] <_joe_> at least let's be wrong in a consistent way! [07:28:07] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:28:17] well, I *am* going to kill this path variance [07:28:28] in some form or other [07:28:37] let me see where this rabbit hole leads :) [07:28:55] <_joe_> I think you're going in the wrong hole [07:29:05] <_joe_> please ask & wait for my review [07:29:41] I will once I have patches up. I will not merge anything without review. [07:29:44] <_joe_> I think the right way to go is pretty different, but I have to do other things [07:29:46] well, anything related to this thing [07:30:01] <_joe_> yuvipanda: study puppet environments please [07:30:06] <_joe_> they could be of help [07:31:02] how is that related to removing module/puppet? [07:31:04] I just want base::puppet and modules/puppetmaster to exist [07:31:09] I will also read up on environments [07:31:24] <_joe_> that's a way to change things in an orderly manner [07:32:46] <_joe_> and to allow people for example to add their own classes from their own repo to their project without needing a self-hosted puppetmaster ;) [07:34:13] we'd still want to get rid of modules/puppet, right? :) and there are about 48 current instances (probably more) that use self hosted puppetmaster that we need to find some solution for [07:34:35] so all I'm trying to do is a noop refactor that provides the same functionality just without the puppet module :) [07:34:53] <_joe_> yuvipanda: read above, I think we could do things so that people don't need a self-hosted puppetmaster at all [07:35:22] _joe_: I agree, but what do you do with the instances that currently exist? [07:36:05] <_joe_> you tell people it's gonna get unmaintained on day X, and to move to environments all of their customizations [07:36:11] hahahahahahaha :) [07:36:17] if only... [07:36:31] <_joe_> you're seeing this the wrong way [07:36:39] <_joe_> you give people reasonable time [07:36:51] mediawiki_singlenode was deprecated about 2y ago [07:36:55] still has 45 instances lfet [07:36:57] *left [07:37:12] <_joe_> if they don't comply, you simply remove the classes and they will ahve broken instances that eventually will stop working [07:37:25] <_joe_> I think you guys should be more strict about this [07:37:41] <_joe_> anyways, off to my morning routine [07:38:23] ok, so the patch basically works, except I need to somehow 'rm -rf /var/lib/puppet/ssl'. now to figure out how to do that automatically [07:38:27] _joe_: cya! enjoy your fish! [07:39:31] <_joe_> (that's friday) [07:39:39] bah, thought it was tuesday and thursday [07:41:21] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1847232 (10Smalyshev) a:5Smalyshev>3mark [07:54:11] (03PS2) 10Giuseppe Lavagetto: puppet_ssldir: make the override logic more explicit [puppet] - 10https://gerrit.wikimedia.org/r/256636 [07:57:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [150.0] [07:58:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [150.0] [07:58:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [07:58:17] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:59:33] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1847265 (10Joe) btw, it should be noted that the LVM on those disks does not cover the whole free space as of now, so you still have ro... [08:02:43] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_ssldir: make the override logic more explicit [puppet] - 10https://gerrit.wikimedia.org/r/256636 (owner: 10Giuseppe Lavagetto) [08:06:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [08:06:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [08:08:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [08:12:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:19:32] (03PS3) 10Hashar: Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [08:20:21] (03CR) 10Hashar: "I have reopenened T110607 and tweaked the commit message. "Revert "Revert Foo"" is messy :D" [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [08:25:57] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:29:41] 7Puppet, 6operations, 6Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847316 (10yuvipanda) p:5High>3Triage [08:29:44] (03PS1) 10Giuseppe Lavagetto: graphite::alerts::reqstats: improve alerts [puppet] - 10https://gerrit.wikimedia.org/r/256640 [08:32:22] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847327 (10yuvipanda) Ok, so right now `role::puppet::self` can do one of 3 things: # Be 'self hosted puppetmaster' - you have a p... [08:32:32] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847328 (10yuvipanda) a:3yuvipanda Ok, so right now `role::puppet::self` can do one of 3 things: # Be 'self hosted puppetmaster'... [08:33:39] _joe_: good morning. Whenever you have time, I could use a recommendation to vary base::service_unit depending on the target OS Ubuntu (upstart) vs Debian (systemd). [08:34:16] _joe_: I thought about varying with if os_version() else .. . Might not be up to our standards :D [08:34:32] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847332 (10yuvipanda) Doing all these in a nice way should also allow us to unify the location of the SSL stuff for puppet. Right n... [08:34:59] hashar: the docstring for base::service_unit (in modules/base/manifests/service_unit.pp) explain how to do this [08:35:20] <_joe_> hashar: base::Service_unit does that for you [08:35:32] <_joe_> you just need to define all the initscripts [08:35:35] <_joe_> look at the code [08:37:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:37:38] * hashar is ashamed [08:37:45] I can't even read the doc nowadays :(( [08:37:50] thank you both yuvipanda and _joe_ ! [08:38:00] (03CR) 10Giuseppe Lavagetto: [C: 032] graphite::alerts::reqstats: improve alerts [puppet] - 10https://gerrit.wikimedia.org/r/256640 (owner: 10Giuseppe Lavagetto) [08:38:28] <_joe_> nice how I assumed my docs were shit [08:38:29] <_joe_> :P [08:41:53] euh... [08:42:05] i just got an exception when trying to log in on wikitech... [08:42:16] MediaWiki internal error. [08:42:16] Exception caught inside exception handler. [08:43:05] seems that I did succeed, i opened the site in a new tab, and i'm logged in, but still.. [08:44:43] (03CR) 10Alexandros Kosiaris: "Yup. Looks like that will work" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [08:46:02] thedj: https://dpaste.de/hFWH/raw [08:46:32] ori: i'll file a ticket [08:46:36] thanks [08:49:31] (03PS2) 10Yuvipanda: puppet: Use puppetmaster hiera variable directly [puppet] - 10https://gerrit.wikimedia.org/r/256625 [08:49:33] (03PS1) 10Yuvipanda: [WIP] / HACK: Enforce single ssldir for puppet [puppet] - 10https://gerrit.wikimedia.org/r/256642 [08:50:13] (03PS1) 10Hashar: xvfb: switch to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) [08:54:14] thedj: it's a dupe, waiting on ori or aaron to respond :/ https://phabricator.wikimedia.org/T117553#1819736 [08:56:48] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:57:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:58:14] legoktm: why is the underlying exception thrown, though? [08:58:46] something deep in LDAPAuthentication is passing an invalid parameter [08:59:03] but there are so many wfSuppressWarnings() calls, there's a good chance it's intentional? [08:59:22] But the MW core change makes it so that warning anywhere in a hook function will throw an exception [08:59:45] yeah, that probably needs to be reconsidered [08:59:52] thanks for flagging it, i'll talk about it with aaron tomorrow [09:00:46] can we just revert it for now? [09:01:31] I'm not sure [09:01:46] i'm too sleepy to think it through [09:02:03] i created the revert commit on gerrit to remind myself to do something about it tomrorow [09:02:17] ok, thanks [09:02:18] off for now [09:03:46] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:05:06] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [09:15:21] (03CR) 10Faidon Liambotis: [C: 031] Add BGP MED support [debs/pybal] (bgp-med) - 10https://gerrit.wikimedia.org/r/255544 (owner: 10Mark Bergsma) [09:20:57] (03PS3) 10Muehlenhoff: Assign per-datacentre Salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/256462 (https://phabricator.wikimedia.org/T111006) [09:25:37] (03PS1) 10Jcrespo: Enable TLS by default on all core mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/256648 (https://phabricator.wikimedia.org/T111654) [09:26:14] (03CR) 10Phuedx: Enable RelatedArticles and Cards on the Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [09:26:38] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review: ml.wikipedia.org not redirecting to mobile site while accessing from a mobile device; many "Error: Module not found" errors - https://phabricator.wikimedia.org/T115191#1847448 (10Praveenp) [09:38:51] (03PS1) 10Muehlenhoff: Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 [09:39:25] (03PS1) 10ArielGlenn: dumps: add stages lists that don't contain directory creation [puppet] - 10https://gerrit.wikimedia.org/r/256651 [09:40:53] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847481 (10yuvipanda) Also the different reasons instances have role::puppet::self applied: # For testing puppet changes # To act... [09:40:55] (03CR) 10ArielGlenn: [C: 032] dumps: add stages lists that don't contain directory creation [puppet] - 10https://gerrit.wikimedia.org/r/256651 (owner: 10ArielGlenn) [09:47:58] (03CR) 10Filippo Giunchedi: [C: 04-1] "salt already has a 'site' grain with the DC, could we use that instead?" [puppet] - 10https://gerrit.wikimedia.org/r/256462 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [10:02:45] yesterday I couldn't make all dbs fail its puppet with critical [10:02:57] I will try again now [10:03:17] (03PS2) 10Jcrespo: Enable TLS by default on all core mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/256648 (https://phabricator.wikimedia.org/T111654) [10:04:25] (03CR) 10Jcrespo: [C: 032] Enable TLS by default on all core mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/256648 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:05:14] <_joe_> \o/ [10:05:56] my puppet critical intent failed again :-) [10:07:02] BTW Enable TLS by default means by puppet default, not connection default [10:07:23] _joe_: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Tools&diff=next&oldid=189752 ? [10:07:42] I'm not sure if that's correct, as most tools hosts should not use the tools-puppetmaster [10:08:16] <_joe_> valhallasw`cloud: uhm, ffs [10:08:25] <_joe_> valhallasw`cloud: sorry, yuvi forgot to mention that [10:08:56] <_joe_> so now base::certificates is failing? [10:08:59] <_joe_> it should not [10:09:10] I don't know -- I just saw the edit and was confused [10:09:17] <_joe_> valhallasw`cloud: can you give me an example of a node _not_ using the puppetmaster? [10:09:24] _joe_: tools-bastion-01 [10:09:30] _joe_: everything not k8s, basically [10:09:33] <_joe_> valhallasw`cloud: agreed, it's an hack I can remove shortly I think [10:09:38] <_joe_> valhallasw`cloud: meh, thanks, and sorry [10:09:49] if I understood correctly, at least [10:10:23] <_joe_> valhallasw`cloud: ahha ok [10:10:26] <_joe_> lemme look [10:12:14] <_joe_> valhallasw`cloud: good news: it's ok [10:13:24] _joe_: ok, great :-) I have no clue how the tools-puppetmaster is set up (whether it auto-updates etc), to be honest. [10:14:44] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1847538 (10fgiunchedi) thanks @andrew, would that make this fixed @muehlenhoff ? [10:15:24] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847541 (10yuvipanda) Hah, it looks like someone else already did most of the work! I can actually just set the puppetmaster hiera... [10:16:12] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1847543 (10yuvipanda) (I have tested both 1 and 2 and they both work) [10:18:34] yuvipanda: apparently you're still awake ;-) so maybe you can clarify the situation [10:19:18] valhallasw`cloud: what did I do? [10:19:23] am also pretty sleepy :| [10:19:30] yuvipanda: what tools-puppetmaster does and doesn't do [10:19:48] see ^ up to 11:06 (2:06 your time) [10:20:03] whatttt [10:20:19] err [10:20:21] (03PS1) 10Jcrespo: Depool es1013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256654 [10:20:23] let me read it fully [10:20:29] but that diff shouldn't affect anything [10:20:56] valhallasw`cloud: so the tools puppetmaster is just for k8s stuff [10:21:12] valhallasw`cloud: and _joe_'s edit would be noop everywhere else since they don't have that role applied [10:21:23] I see [10:21:36] is there a reason not to use tools-puppetmaster also for every other host? [10:22:23] valhallasw`cloud: yes, it complicates things (at least right now, see that 'kill puppet module' task :)) [10:22:33] ok [10:22:39] valhallasw`cloud: if you use tools-puppetmaster, everything comes from module/puppet, and if you do not, it comes from module/base/puppet [10:22:45] which are almost but not quite identical copies [10:22:53] I'm trying to fix that by killing modules/puppet [10:23:02] (03CR) 10Jcrespo: [C: 032] Depool es1013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256654 (owner: 10Jcrespo) [10:23:09] aaah [10:24:42] (03PS1) 10Faidon Liambotis: Fix a few deprecated erb variable accesses [puppet] - 10https://gerrit.wikimedia.org/r/256655 [10:25:51] (03PS1) 10ArielGlenn: dumps monitor no longer writes to stdout [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256656 (https://phabricator.wikimedia.org/T110888) [10:26:06] valhallasw`cloud: anyway, am going to bed now. night [10:26:15] yuvipanda: ok! sleep tight [10:26:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1013 for maintenance (duration: 00m 30s) [10:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:41] (03CR) 10Faidon Liambotis: [C: 032] Fix a few deprecated erb variable accesses [puppet] - 10https://gerrit.wikimedia.org/r/256655 (owner: 10Faidon Liambotis) [10:28:06] (03PS2) 10ArielGlenn: dumps monitor no longer writes to stdout [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256656 (https://phabricator.wikimedia.org/T110888) [10:28:50] (03PS2) 10Hashar: xvfb: switch to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) [10:28:56] 6operations: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1847554 (10fgiunchedi) p:5Triage>3High same here from esams. did it use to work? it might be due to misc varnish refactoring, and/or the fact that it is a 13G file [10:29:06] Oh, I cannot depool es1013 because of dumps [10:29:27] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps monitor no longer writes to stdout [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256656 (https://phabricator.wikimedia.org/T110888) (owner: 10ArielGlenn) [10:29:52] (03Abandoned) 10Muehlenhoff: Assign per-datacentre Salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/256462 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [10:31:56] 6operations, 10Traffic: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1847560 (10fgiunchedi) [10:35:45] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1847565 (10fgiunchedi) p:5Triage>3Normal [10:36:19] <_joe_> !log imported dh-python into precise/universe from the ubuntu cloud archive [10:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:18] 6operations, 7Mobile, 5Patch-For-Review: Investiage if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1847567 (10fgiunchedi) p:5Triage>3Normal [10:38:12] 6operations, 10Datasets-General-or-Unknown, 10Traffic: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1847569 (10ArielGlenn) [10:38:21] (03PS3) 10Hashar: xvfb: switch to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) [10:40:41] (03CR) 10Hashar: [C: 031 V: 032] "I have cherry picked it on CI puppetmaster and it is a noop for Ubuntu hosts :-)" [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [10:42:02] 6operations, 7Mobile, 5Patch-For-Review: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1847570 (10fgiunchedi) [10:47:00] (03PS1) 10Jcrespo: Reconfiguring es1013 (performance_schema, ferm) [puppet] - 10https://gerrit.wikimedia.org/r/256657 [10:47:43] (03PS2) 10Jcrespo: Reconfiguring es1013 (performance_schema, ferm) [puppet] - 10https://gerrit.wikimedia.org/r/256657 [10:50:48] (03PS1) 10Muehlenhoff: Additional server groups per datacentre [puppet] - 10https://gerrit.wikimedia.org/r/256658 [10:51:19] !log restarting, upgrading and general maintenance for es1013 (depooled) [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:08] (03PS1) 10Hashar: xvfb: systemd support for Debian [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) [10:55:19] * hashar whistles [10:57:40] (03PS2) 10Hashar: xvfb: systemd support for Debian [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) [11:00:23] (03CR) 10Hashar: [C: 031 V: 032] "Had to use absolute path for Xvfb and:" [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [11:02:22] (03PS3) 10Jcrespo: New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 [11:03:05] 7Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Use systemd for xvfb service on Debian/Jessie - https://phabricator.wikimedia.org/T95003#1177720 (10hashar) We need puppet changes to be merged: * https://gerrit.wikimedia.org/r/#/c/256643/ * https://gerrit.wikimedia.org/r/#... [11:04:27] 7Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Use systemd for xvfb service on Debian/Jessie - https://phabricator.wikimedia.org/T95003#1847633 (10hashar) a:5hashar>3None [11:06:16] (03PS1) 10Jcrespo: Add additional variable binlog_format that can be used on templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256660 (https://phabricator.wikimedia.org/T109179) [11:06:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Additional server groups per datacentre [puppet] - 10https://gerrit.wikimedia.org/r/256658 (owner: 10Muehlenhoff) [11:09:30] (03PS4) 10Jcrespo: New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 [11:10:22] (03CR) 10jenkins-bot: [V: 04-1] New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 (owner: 10Jcrespo) [11:14:58] (03PS1) 10Giuseppe Lavagetto: debian/control: add build-dep from dns-python [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256661 [11:35:58] !log restarting cassandra on aqs cluster (subsequently) to effect openjdk security update [11:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:19] (03PS1) 10Giuseppe Lavagetto: Re-introduce the missing changelog entry [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256662 [11:39:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Re-introduce the missing changelog entry [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256662 (owner: 10Giuseppe Lavagetto) [11:41:56] It seems link updates are broken on wikitech still? [11:42:01] Can't remove or add anything to a category [11:42:15] AaronSchulz: Maybe the job runner failures from wikitech are related? [11:55:48] (03CR) 10Jcrespo: [C: 032] Add additional variable binlog_format that can be used on templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256660 (https://phabricator.wikimedia.org/T109179) (owner: 10Jcrespo) [11:56:02] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [11:56:15] (03PS5) 10Jcrespo: New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 [12:02:53] (03PS1) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 [12:04:53] (03CR) 10Thiemo Mättig (WMDE): "OMFG. You have to look at my commit message here to be able to read it: https://gerrit.wikimedia.org/r/#/c/256663/1//COMMIT_MSG" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (owner: 10Thiemo Mättig (WMDE)) [12:07:19] (03PS2) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) [12:21:39] 6operations, 10hardware-requests: eqiad: (2) spare servers request for ORES - https://phabricator.wikimedia.org/T119598#1847846 (10akosiaris) I 've talked about this with @mark. He's against using those server spares and with good reason. It was suggested to wait for a batch of new misc servers to arrive in... [12:22:16] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1847847 (10akosiaris) [12:22:33] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:23:02] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1830629 (10akosiaris) [12:30:29] (03PS1) 10Filippo Giunchedi: redis: fix config file world-readability [puppet] - 10https://gerrit.wikimedia.org/r/256666 [12:55:19] (03PS8) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [12:56:32] (03CR) 10Phuedx: [C: 04-1] Enable RelatedArticles and Cards on the Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [12:59:17] (03PS7) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [13:38:01] (03CR) 10JanZerebecki: "Who is "we" in each of your usages?" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [13:38:29] (03CR) 10Bmansurov: Enable RelatedArticles and Cards on the Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [13:38:40] (03PS4) 10Bmansurov: Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) [13:39:04] (03CR) 10JanZerebecki: [C: 04-1] "Probably same problem as https://gerrit.wikimedia.org/r/#/c/255149/ ." [puppet] - 10https://gerrit.wikimedia.org/r/255150 (owner: 10JanZerebecki) [14:02:43] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1847990 (10fgiunchedi) a:3Deskana looks like the related tickets are blocked on OIT? also is there anything for #operations here? [14:06:01] !log installed dpkg updates across the cluster [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:52] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1848004 (10chasemp) a:5chasemp>3None [14:15:21] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:59] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1848023 (10Chase) >>! In T118176#1843290, @RobH wrote: > Ok, I've had an IRC discussion with @chase and @dzahn about this workflow No you haven't, different person. I'm not the Chase you're lookin... [14:20:22] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:22] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [14:20:22] PROBLEM - check_apache2 on payments1002 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [14:20:22] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [14:20:56] Jeff_Green: ^ known? [14:21:27] godog: yes, thank yo [14:22:14] 7Puppet, 6operations, 6Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1848050 (10chasemp) p:5Triage>3Normal [14:22:33] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1848051 (10chasemp) p:5Triage>3Normal [14:22:47] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: icinga config broken due to duplicate labs-ns1 / labcontrol2001 - https://phabricator.wikimedia.org/T120050#1848053 (10chasemp) p:5Triage>3Normal [14:22:58] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: icinga config broken due to duplicate labs-ns1 / labcontrol2001 - https://phabricator.wikimedia.org/T120050#1843587 (10chasemp) @andrew, I think you know the deal here? [14:23:08] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1848056 (10chasemp) p:5Triage>3Normal [14:23:16] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1843529 (10chasemp) all silenced for now but yes agreed. [14:25:05] 6operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#1848066 (10chasemp) p:5Triage>3High [14:25:21] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:22] (03CR) 10Alexandros Kosiaris: [C: 031] "Yup, that is looks probably true. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/256508 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [14:30:21] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:43] (03CR) 10Alexandros Kosiaris: "General approach looks good, got an inline question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256666 (owner: 10Filippo Giunchedi) [14:30:47] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1848074 (10chasemp) a:3coren [14:31:03] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1848075 (10chasemp) a:3coren [14:35:11] RECOVERY - check_apache2 on payments1002 is OK: PROCS OK: 150 processes with command name apache2 [14:35:11] RECOVERY - check_apache2 on payments1001 is OK: PROCS OK: 130 processes with command name apache2 [14:35:12] RECOVERY - check_apache2 on payments1003 is OK: PROCS OK: 145 processes with command name apache2 [14:35:12] RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 3.036 second response time [14:37:12] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [14:38:18] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1848085 (10Ottomata) Hm, not sure I follow. We are proposing that a schema be ID-able via a URI, and also remotely locatable if that URI happens to... [14:39:44] 7Puppet, 6Labs, 10Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1848087 (10chasemp) [14:39:56] (03CR) 10Filippo Giunchedi: redis: fix config file world-readability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256666 (owner: 10Filippo Giunchedi) [14:40:08] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1848089 (10Krinkle) [14:41:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "not against the cleanup, but you need to find a class that is used on the icinga server and this class is not" [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [14:41:20] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1772839 (10Krinkle) This is causing problems on wikitech since link updates are not running. E.g. pages added or r... [14:42:07] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Document setup of cn=repluser and cn=admin [puppet] - 10https://gerrit.wikimedia.org/r/256206 (owner: 10Muehlenhoff) [14:42:12] (03PS3) 10Alexandros Kosiaris: openldap: Document setup of cn=repluser and cn=admin [puppet] - 10https://gerrit.wikimedia.org/r/256206 (owner: 10Muehlenhoff) [14:42:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] openldap: Document setup of cn=repluser and cn=admin [puppet] - 10https://gerrit.wikimedia.org/r/256206 (owner: 10Muehlenhoff) [14:44:26] (03PS2) 10Filippo Giunchedi: redis: fix config file world-readability [puppet] - 10https://gerrit.wikimedia.org/r/256666 [14:46:47] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1848113 (10coren) [14:48:18] !log restbase start deployment of 262da91a [14:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:38] (03PS1) 10Filippo Giunchedi: cassandra: provision restbase1009 with 128 tokens [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) [14:48:54] (03CR) 10Alexandros Kosiaris: [C: 031] redis: fix config file world-readability [puppet] - 10https://gerrit.wikimedia.org/r/256666 (owner: 10Filippo Giunchedi) [14:54:23] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848131 (10chasemp) [14:54:31] 6operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766457 (10chasemp) [14:54:38] (03PS1) 10Filippo Giunchedi: cassandra: provision restbase100[789] with /srv for multi instance [puppet] - 10https://gerrit.wikimedia.org/r/256691 [14:55:14] (03PS2) 10Filippo Giunchedi: cassandra: provision restbase100[789] with /srv for multi instance [puppet] - 10https://gerrit.wikimedia.org/r/256691 [14:55:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: provision restbase100[789] with /srv for multi instance [puppet] - 10https://gerrit.wikimedia.org/r/256691 (owner: 10Filippo Giunchedi) [14:57:17] !log restbase end of deployment of 262da91a [14:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:04] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848152 (10chasemp) [15:01:07] 6operations: Implement MOTD warning for handling private data for shell users on (all?) systems - https://phabricator.wikimedia.org/T83527#1848156 (10fgiunchedi) p:5Normal>3Low [15:01:17] godog: AaronSchulz: Coren: Job queue has been broken on wikitech for the past 7 days per https://phabricator.wikimedia.org/T117394 - probably just an issue with visibility of the IP between servers. Would require it to be made visible to job runners, and/or reverse the change so that wikitech does not use the main job runners but uses its own or something [15:01:17] like that. [15:01:22] it used to work fine [15:01:39] 6operations, 7Pybal: Make PyBal respect advertised BGP capabilities - https://phabricator.wikimedia.org/T81305#1848160 (10fgiunchedi) [15:02:02] 6operations: Admin module should allow group management of system users - https://phabricator.wikimedia.org/T84279#1848163 (10fgiunchedi) [15:02:56] <_joe_> Krinkle: 7 days? [15:03:02] <_joe_> Krinkle: blame your manager! [15:03:39] probably more like a month, but at least it's affecting my productivity since a week [15:03:58] <_joe_> uhm a month is pretty different [15:04:03] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:04] because I can't find anything I'm looking for, and stuff I document doesn't stick or remains undiscoverable [15:04:07] <_joe_> if it was a week, I would've had an explanation [15:04:12] Right [15:04:17] It predates the freeze [15:04:25] but it is loss of data and user impact [15:04:32] <_joe_> let me verify something [15:04:34] <_joe_> yup [15:04:44] unless we like re-run a maintenance script to rebuild the link tables, which we'll probbaly need ot do anyway at this point [15:06:02] RECOVERY - pybal on lvs1007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:06:28] <_joe_> ok, it's definitely not that it can't connect to redis [15:07:30] <_joe_> so the problem is that the jobrunners should connect to silver? [15:08:20] <_joe_> Krinkle: as far as I can see, there is no way this ever worked [15:08:47] _joe_: Because of ferm you mean? [15:08:51] <_joe_> yes [15:09:04] _joe_: I imagine something happened last month that made wikitech no longer have its own job runner cron but instead use the main pool [15:09:07] This most definitely did work [15:09:33] <_joe_> I don't think it ever had its own jobrunner [15:09:53] <_joe_> maybe something changed when we moved to terbium? or someone changed the config [15:09:54] then maybe it wasn't using jobrunner but the in-request sampled job runner? seems odd, but possible [15:10:20] But I imagine it would have a job runner, similar to how we run ohter maintenance scripts locally there [15:11:04] !log restarting cassandra on restbase100[56] (subsequently) to effect openjdk security update [15:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:13] <_joe_> so maybe something changed in the config (I mean mediawiki-config) and we might need to revert that? [15:11:29] Yeah.. [15:11:29] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1848209 (10JanZerebecki) I think that only means that a client that gets a URL ending in '/' for an API should not assume it can extr... [15:13:12] Krinkle: what's an easy way to reproduce btw? [15:13:27] godog: Add or remove a category to a page and observe that it does not happen [15:13:35] https://wikitech.wikimedia.org/wiki/Nutcracker [15:13:41] https://wikitech.wikimedia.org/wiki/Category:MediaWiki_production [15:13:42] is empty [15:13:49] I added 5 pages to it over the past few hours [15:14:03] and logstash is full of exceptions from RunJobs.php [15:14:06] for labswiki [15:14:17] all db connect failure unable to access labswiki db [15:15:19] The strange thing is, afaik, mediawiki-config only controls where jobs get pushed to. the redis queue is afaik shared (and can be) between labswiki and rest of prod. It's just the jobrunners that are different. But maybe it used to have its own queue as well? (Or use mysql?) [15:15:37] <_joe_> 'wmgUseClusterJobqueue' => array( [15:15:41] <_joe_> 'default' => true, 'labswiki' => false, [15:15:47] <_joe_> [15:15:56] <_joe_> so in theory labswiki is not using it [15:16:05] <_joe_> UHM [15:16:18] <_joe_> maybe I should pull the last updates [15:17:23] <_joe_> but no, that's still it [15:17:34] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1848214 (10Ottomata) Hm, I think I see. We are coupling the URI to the ID, which according to the W3C should not be relied upon. Ok, noted. [15:18:57] _joe_: labswiki is not in default? [15:19:10] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1848216 (10mobrovac) From my POV, the URL **is** the ID. [15:20:11] labswiki' => false [15:20:12] Ah, right [15:20:23] it was off-screen :D [15:20:40] (03PS2) 10Muehlenhoff: Add comment on server_id parameter in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255115 [15:21:26] so assuming it isn't using that, what is it using instead? Somehow wikitech is queuing its jobs somewhere, and a job runner tries to run it [15:21:27] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1848232 (10Joe) FWIW, labswiki is not supposed to use the cluster's jobqueue at all: https://github.com/wikimedia... [15:21:47] <_joe_> they are running on the standard jobrunners [15:21:54] <_joe_> which they shouldn't do [15:21:57] <_joe_> let's see [15:22:18] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1848240 (10RobH) I meant to tag @chasemp not @chase, my bad. [15:23:54] !log stopping pdns on labcontrol2001 [15:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:21] <_joe_> Krinkle: according to the configuration, silver shouldn't even know where to look [15:25:00] <_joe_> let me check if it is actually what's enqueueing jobs [15:25:43] <_joe_> I can definitely confirm there is traffic to the redises [15:26:14] (03PS1) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [15:27:03] (03CR) 10jenkins-bot: [V: 04-1] Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:28:45] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[789] instances [dns] - 10https://gerrit.wikimedia.org/r/256694 [15:28:49] <_joe_> Krinkle: ok so I can confirm, even if it has that config, labswiki is now using the main jobqueue [15:29:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[789] instances [dns] - 10https://gerrit.wikimedia.org/r/256694 (owner: 10Filippo Giunchedi) [15:30:18] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1848260 (10Joe) I just confirmed with tcpdump: silver (wikitech) is submitting jobs to the jobqueue even if it sho... [15:30:54] (03PS1) 10Alexandros Kosiaris: Specific size_limit specifically for repluser [puppet] - 10https://gerrit.wikimedia.org/r/256696 [15:31:04] <_joe_> Krinkle: so we have two ways to try to solve this: 1) we fix the jobrunners communication with the db 2) we realize what the heck is going on there [15:31:34] (03PS4) 10JanZerebecki: Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) [15:31:58] Would 1) mean letting the jobs continue to run somewhere besides wikitech? [15:32:26] I’m pretty sure that for a lot of jobs that just won’t work at all. some of those jobs need ldap access, they may even make nova calls (not sure about that last) [15:32:50] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[789] instances [puppet] - 10https://gerrit.wikimedia.org/r/256697 [15:33:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [15:33:25] (03PS2) 10Filippo Giunchedi: cassandra: add restbase100[789] instances [puppet] - 10https://gerrit.wikimedia.org/r/256697 [15:33:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[789] instances [puppet] - 10https://gerrit.wikimedia.org/r/256697 (owner: 10Filippo Giunchedi) [15:33:42] (03CR) 10Alexandros Kosiaris: [C: 031] Add comment on server_id parameter in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255115 (owner: 10Muehlenhoff) [15:33:50] (03CR) 10JanZerebecki: "Thank you for testing before deploying!" [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [15:34:15] <_joe_> andrewbogott: ok we do have a big problem then [15:34:54] andrewbogott: Good point [15:35:00] (03CR) 10Filippo Giunchedi: "just to make sure, this means than when the additional ssd come in, we will need to decomission these nodes again to bump num_tokens" [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:35:02] andrewbogott: dbis not the only thing, just the first thing that fails [15:35:12] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [15:35:21] 7Puppet, 6Labs, 10Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1848306 (10chasemp) [15:35:25] andrewbogott: Do you know where we run maintenance scripts for wikitech? [15:35:37] I assume on silver, but I mean location in puppet [15:35:53] Krinkle: do you mean, e.g. draining the jobqueue? [15:36:00] <_joe_> ok the only commit that I can think of is https://gerrit.wikimedia.org/r/#/c/250170/ [15:36:39] <_joe_> here it is, line 193 of https://gerrit.wikimedia.org/r/#/c/250170/5/wmf-config/CommonSettings.php,cm [15:36:48] <_joe_> Krinkle: I told you, blame your manager!! [15:37:04] Krinkle: puppet/modules/openstack/manifests/openstack-manager.pp has some of the special-case stuff. [15:37:30] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1848313 (10Joe) a:3Joe [15:38:05] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1772839 (10Joe) Found the problem - the jobqueue file gets included disregarding the fact that we're on labswiki... [15:39:53] _joe_: Hm.. what is the realm/site for labswiki, then? [15:40:11] that's realm=production dc=eqiad just the same, no? [15:40:30] <_joe_> yeah look at line 183 in the old case [15:40:33] (03CR) 10Phuedx: "Sorry I didn't spot this sooner." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [15:40:39] <_joe_> you had logging mc db and redis [15:40:55] <_joe_> they added the jobqueue [15:41:05] <_joe_> but the jobqueue should not be required in most cases [15:41:12] <_joe_> err, in special cases [15:41:12] switching on realm is for detecting whether or not something is running /in/ labs, e.g. beta [15:41:13] Ah, it made it unconditional [15:41:19] <_joe_> yup [15:41:27] <_joe_> and kept the conditional later in the file [15:41:29] OK. Got it [15:41:30] Yeah [15:41:35] <_joe_> which was what I found first [15:41:38] <_joe_> I'm fixing it [15:41:39] Well spotted [15:41:41] thanks [15:43:22] _joe_: openstack-manager.pp contains a job runner [15:43:28] so there is a job runner from cron on silver [15:43:29] (03PS1) 10Giuseppe Lavagetto: Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) [15:43:30] <_joe_> Krinkle: heh, exactly [15:43:52] 6operations, 10Datasets-General-or-Unknown, 10Traffic: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1848344 (10Halfak) Yeah. I'm pretty sure that this worked at one point. Further, I remember that downloading big files was a problem once before and was (I thought) resolv... [15:44:13] _joe_: You deploying? [15:44:36] <_joe_> Krinkle: if needed, yes [15:45:00] <_joe_> Krinkle: give me another pair of eyes, though [15:45:13] I'm working a lot on documentation today, it'd help me if I can finish that sprint today. [15:45:15] Sure. [15:45:23] Perhaps in the evenign swat in a few hours [15:45:46] <_joe_> this is a bug, I can ask to deploy this in SWAT like in 15 minutes [15:46:12] <_joe_> but I'd prefer us to be sure this is legitimate to deploy according to releng [15:46:31] Where is it required now? [15:46:49] If we do it before SWAT, I can also run the maintenance script to rebuild links and get some doc work done today. [15:47:03] oh, /me reads COMMITMSG [15:47:15] Reedy: It is in the cluster job queue config variable already [15:47:20] (03CR) 10Reedy: [C: 031] Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) (owner: 10Giuseppe Lavagetto) [15:47:22] it ended up being included twice in the refactor commit [15:47:24] Yup, already found it [15:47:33] <_joe_> Reedy: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L843-L848 [15:47:41] _joe_: I already +1'd it :P [15:47:55] (03PS2) 10Reedy: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 [15:48:07] We should push that out too cause it's confusing not being in the list [15:48:16] (03CR) 10Krinkle: [C: 031] Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 (owner: 10Reedy) [15:52:24] (03PS2) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [15:52:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:53:12] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:53:21] (03CR) 10jenkins-bot: [V: 04-1] Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:53:32] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail [15:54:19] (03PS3) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [15:55:13] (03CR) 10jenkins-bot: [V: 04-1] Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:57:20] 6operations, 10Datasets-General-or-Unknown, 10Traffic: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1848407 (10BBlack) [15:57:21] 6operations, 10Analytics-Cluster: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1848408 (10BBlack) [15:57:49] (03PS4) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [15:59:01] (03PS3) 10Giuseppe Lavagetto: eventlogging: use the system-wide puppet CA [puppet] - 10https://gerrit.wikimedia.org/r/243665 (https://phabricator.wikimedia.org/T114638) [15:59:14] 6operations, 10Analytics-Cluster: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1848418 (10BBlack) a:3BBlack Yeah this is all the same issue and still present. I think @fgiunchedi is on the right track here about streaming, I'm going to write up a gene... [16:00:04] kart_: Dear anthropoid, the time has come. Please deploy Content Translation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151203T1600). [16:00:04] kart_: A patch you scheduled for Content Translation is about to be deployed. Please be available during the process. [16:00:20] (03CR) 10Luke081515: [C: 031] Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) (owner: 10Giuseppe Lavagetto) [16:00:21] ok. That's me. [16:00:54] (03PS1) 10Andrew Bogott: Turn off services on labcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/256701 (https://phabricator.wikimedia.org/T118591) [16:02:18] (03CR) 10Andrew Bogott: [C: 032] Turn off services on labcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/256701 (https://phabricator.wikimedia.org/T118591) (owner: 10Andrew Bogott) [16:02:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:03:24] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:03:37] (03PS2) 10Krinkle: Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) (owner: 10Giuseppe Lavagetto) [16:03:45] (03CR) 10Krinkle: [C: 031] Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) (owner: 10Giuseppe Lavagetto) [16:05:52] <_joe_> anyone SWATting today? [16:06:59] _joe_: no normal SWAT. There is CX update on. [16:07:23] <_joe_> Krinkle: I'd say we deploy [16:07:28] <_joe_> greg gave me a green light [16:08:03] Well, wait for CX ;) [16:08:14] /ask [16:08:15] If they're deploying PHP code, it may conflict :) [16:08:55] kart_: What are you deploying? :) [16:09:25] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [16:10:06] <_joe_> Reedy: I am waiting for them to finish anyways [16:10:27] godog: there are a bunch of warnings about certs expiring in a few weeks on icinga… do you know if there’s a ticket for that already, or if someone is working on it? [16:10:55] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:05] andrewbogott: no ticket afaik, robh might know perhaps? [16:11:46] <_joe_> kart_: let me know when you are done deploying [16:12:18] <_joe_> Reedy: is there any way to sync a file just to an host? run sync common on it? [16:12:26] _joe_: sure [16:12:35] yeah, if it's staged on tin [16:15:28] (03PS1) 10BBlack: cache_misc: move pass-blocks to layer-common code [puppet] - 10https://gerrit.wikimedia.org/r/256704 (https://phabricator.wikimedia.org/T119394) [16:15:30] (03PS1) 10BBlack: cache_misc: stream and hit-for-pass for large objects [puppet] - 10https://gerrit.wikimedia.org/r/256705 (https://phabricator.wikimedia.org/T104004) [16:16:04] (03CR) 10BBlack: [C: 04-2] "We're actually consuming and using this XFF data in the varnishes to do things like detect trusted proxies (operamini, nokia), and possibl" [puppet] - 10https://gerrit.wikimedia.org/r/255539 (owner: 10Alexandros Kosiaris) [16:16:48] godog: they are all on the tracking calendar, but yes we have like 5 or 6 coming up in january [16:16:53] andrewbogott: ^ [16:17:14] they are all old rapidssl certs for the most part as well, so we'll be migrating them to globalsign [16:17:47] 6operations, 10Analytics-Cluster, 10Traffic, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1848484 (10BBlack) [16:17:47] robh: should I acknowledge in icinga? [16:17:49] andrewbogott: though you mentioning it does point out i should likely create all the tasks for purchase approval sooner than later [16:17:57] uhh, let me ack them later today when i tie them to tickets [16:18:00] so each ack will have a task # [16:18:06] robh: ok, great! [16:18:13] I try not to ack things without referencing a task [16:18:19] ack/silence/whatevs [16:18:39] yeah, that would’ve been my next question [16:19:03] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [16:20:41] 6operations, 6Commons, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1848487 (10matmarex) 5stalled>3Open 78 days passed, has that happened? [16:21:03] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [16:21:44] 6operations, 6Commons, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1848490 (10Harej) The upload hasn't happened yet, no. [16:22:34] (03PS4) 10Giuseppe Lavagetto: eventlogging: use the system-wide puppet CA [puppet] - 10https://gerrit.wikimedia.org/r/243665 (https://phabricator.wikimedia.org/T114638) [16:22:36] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:22:44] godog: ^^ Did we add to the swift cluster yet? [16:24:26] Reedy: yup [16:24:31] sweet [16:25:22] 6operations, 6Commons, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1848496 (10Reedy) @fgiunchedi just confirmed that the swift cluster has been expanded, so you should be good to do... [16:26:06] Reedy: I don't think swift was the holdup though [16:26:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] eventlogging: use the system-wide puppet CA [puppet] - 10https://gerrit.wikimedia.org/r/243665 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [16:26:38] harej said it was "delayed indefinitely" but then MatmaRex un-stalled it [16:27:57] oh, it seemed stalled on swift. feel free to re-stall. :P [16:29:11] I think it was sort of [16:29:20] But not a hard blocker [16:29:21] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848510 (10chasemp) Desired recommendation: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS [16:29:45] true, yeah I was referring to https://phabricator.wikimedia.org/T88758#1156490 [16:30:13] (03PS1) 10Aude: Enable data access for beta meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256707 [16:33:24] <_joe_> kart_: are you gusy done? [16:33:29] <_joe_> *guys [16:33:46] _joe_: no. it will take time. Sorry :/ [16:34:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] debian/control: add build-dep from dns-python [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256661 (owner: 10Giuseppe Lavagetto) [16:34:50] <_joe_> kart_: is that php too? [16:35:00] <_joe_> if it's not, I'd deploy my thing [16:35:36] _joe_: I'm updating CX extension. [16:35:49] and stuck with submodule part. bd808/Reedy helping me. [16:36:01] <_joe_> ook, sorry [16:36:05] _joe_: TBH, you could just go [16:36:18] Waiting for jenkins atm [16:36:35] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848542 (10mark) As this is a test server with limited life time, we can use an out of warranty spare for this. (Approved) [16:37:03] <_joe_> Reedy: uhm, ok [16:37:34] _joe_: yes. config change? Finish it :) [16:38:20] (03PS2) 10BBlack: cache_misc: move pass-blocks to layer-common code [puppet] - 10https://gerrit.wikimedia.org/r/256704 (https://phabricator.wikimedia.org/T119394) [16:38:34] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: move pass-blocks to layer-common code [puppet] - 10https://gerrit.wikimedia.org/r/256704 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [16:38:42] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1848550 (10Deskana) >>! In T120136#1847990, @fgiunchedi wrote: > looks like the related tickets are blocked on OIT? also is there anythi... [16:40:55] _joe_: I'm ready to deploy now. my change is merged. [16:40:57] (03PS2) 10BBlack: cache_misc: stream and hit-for-pass for large objects [puppet] - 10https://gerrit.wikimedia.org/r/256705 (https://phabricator.wikimedia.org/T104004) [16:41:23] _joe_: your PS is config or code to deploy? [16:41:36] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: stream and hit-for-pass for large objects [puppet] - 10https://gerrit.wikimedia.org/r/256705 (https://phabricator.wikimedia.org/T104004) (owner: 10BBlack) [16:42:03] kart_: config [16:43:12] _joe_: Please go ahead. [16:44:55] <_joe_> ok [16:45:11] <_joe_> sorry I was just doing something else :P [16:45:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Inclusion of jobqueue files is not unconditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256698 (https://phabricator.wikimedia.org/T117394) (owner: 10Giuseppe Lavagetto) [16:45:29] I'm sitting on my 'tin' can :) [16:46:10] (03PS1) 1020after4: Install arc on the jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) [16:46:49] (03PS2) 1020after4: Install arc on the jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) [16:48:17] (03CR) 1020after4: "This should work once we get arc on the slaves:" [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) (owner: 1020after4) [16:48:39] _joe_: ping when done please. [16:48:47] <_joe_> kart_: ok will do [16:50:37] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: Fix the jobqueue on wikitech (duration: 00m 28s) [16:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:46] <_joe_> kart_: let me verify [16:50:55] 6operations, 10Analytics-Cluster, 10Traffic, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1848582 (10BBlack) 5Open>3Resolved This should be resolved now with the change above applied. I've tested the files from this ticket and... [16:52:21] okay! [16:52:59] <_joe_> kart_: green light from me I'd say [16:53:12] <_joe_> Krinkle: no more packets to the redises from silver [16:53:17] <_joe_> it should be solved now [16:53:20] cool [16:53:27] * _joe_ takes a break [16:53:31] I've run refreshLinks.php on silver for my recent edits [16:53:36] we may want to schedule a complete run at some point [16:53:43] <_joe_> yup [16:53:54] <_joe_> I think terbium does that ATM [16:53:59] oh? [16:54:03] _joe_: thanks! [16:54:08] <_joe_> but not often enough [16:54:17] <_joe_> Krinkle: I'll look in a few :) [16:54:23] It does it on a schedule? [16:54:26] Why would it do that [16:54:33] (also, it wouldn't work for labs wiki from terbium I guess?) [16:54:40] <_joe_> no it works [16:54:43] at least I noticed that mwscript doesn't work on tin [16:54:43] <_joe_> terbium can connect [16:54:48] but maybe that was fixed for tin/terbium [16:54:49] coo [16:54:50] l [16:54:55] so I didn't have to ssh to silver to run it [16:54:58] :D [16:55:22] <_joe_> eheh [16:55:25] <_joe_> ok, bbiab [16:55:30] thx _joe_ [16:57:14] !log kartik@tin Started scap: Update ContentTranslation [16:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:04] andrewbogott akosiaris: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151203T1700). [17:00:25] patchless! [17:01:10] andrewbogott: no patch. good. my scap is on :) [17:02:03] (03PS5) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [17:03:06] (03CR) 10jenkins-bot: [V: 04-1] Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [17:03:07] !log kartik@tin Finished scap: Update ContentTranslation (duration: 05m 52s) [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:13] ... [17:03:20] A scap in 6 minutes!? [17:03:55] that does seem a bit fast [17:04:13] kart_: Did it succeed? [17:05:07] I see cdb file updates in the scap logs. COuld be legit [17:05:37] I'd presume there were l10n updates with the change to master [17:06:15] the rsyncs were fast. Looks like ~15s per host [17:06:21] which is awesome [17:07:00] niice [17:07:04] Reedy: yep [17:07:31] Who added speed booster :) [17:07:42] I did scap after many months :) [17:07:48] There's been a lot of work recently [17:07:59] Great [17:08:09] bd808: localisationupdate is taking 10 minutes ish [17:08:34] So I guess it doesn't sound too farfetched [17:08:50] Logging looks a little lapse now though [17:08:51] 02:25 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 54s) [17:08:55] 05:50 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 3 05:50:08 UTC 2015 (duration 50m 7s) [17:09:03] this change only touched 43 l210n caches which would help [17:09:16] wait, l10nupdate took 3 hours and 25 minutes? [17:09:20] well, over 3.5 hours [17:09:37] That.. [17:09:49] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848673 (10chasemp) >>! In T118588#1848510, @chasemp wrote: > Desired recommendation: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS >>! In... [17:10:03] the RL purge takes a long time, but what was it doing in between? [17:10:16] yeah [17:10:24] 2 hours 15 doing ???? [17:10:31] * bd808 reads the script [17:10:38] (03PS6) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [17:10:40] https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1 [17:11:13] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1848677 (10RobH) [17:12:31] it's doing rl clears that whole time between the irc messages. the computation of the duration must be really broken [17:13:13] heh `$(date -ud @"$LENGTH" +'%-Mm %-Ss'` [17:13:16] it drops the hours [17:13:52] lol [17:14:03] bd808: This sounds like "It shouldn't take more than an hour" [17:14:07] Why has it got so slow? [17:14:31] I think that's worth filing a perf task for [17:14:52] I don't remember it ever being fast, but yeah it might be work looking into what extensions/WikimediaMaintenance/refreshMessageBlobs.php is up to [17:15:00] also, it isn't the time for the whole thing [17:15:03] not just refreshBlobs [17:15:26] yes, its the time from start to finish [17:15:44] So the comment isn't quite right being on the cache refresh [17:16:00] that belongs on a log localisationupdate finished (time taken) [17:16:09] vs the time give on the sync-l10n message which is only the time for the rsync and rebuild (not the time used to build the cdbs in the first place) [17:16:45] https://www.mediawiki.org/wiki/Extension_talk:LocalisationUpdate#Help.21 mentions that LU itself takes 20 min to spit a "no updates" [17:16:51] Might be related or not. [17:17:21] I don't remember it taking so long on WMF [17:17:27] Filed https://phabricator.wikimedia.org/T120240 [17:17:30] LU does a ton of file stat calls. It is very much bound by the local disk speed [17:17:39] Hmm [17:17:57] I wonder if it's worth tmpfs? And/or hhvm (which will come when tin upgraded) [17:18:20] we've talked about tmpfs before but never tried it as far as I know [17:18:43] hhvm actually makes the stats worse rather than better as hhvm doesn't really have a stat cache [17:18:57] I tried moving it to tmpfs, there wasn't enough space:P [17:19:01] ssd would make a huge differnce [17:19:03] heh [17:19:18] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1848731 (10BBlack) So, I did some tracing this morning using one of the files mentioned earlier in this ticket. The commons f... [17:19:31] in the glorious future all spinning rust will die [17:19:33] (not specifically LU but i18n cache rebuilds in general) [17:19:41] hmm [17:19:41] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/refreshMessageBlobs.php [17:19:54] foreach project in wmf { foreach lang in project {} } [17:20:26] and a waitforslaves after each message? [17:20:44] That's my fault [17:20:57] yeah, that's too frequent [17:21:06] what isn't your fault Reedy? ;) [17:21:12] waitforslaves [17:21:24] I wonder if it should be every 50 and one at the end or something [17:21:34] how many rows? [17:21:35] * Reedy looks [17:21:56] 6operations, 10Gerrit: Upgrade gerrit to latest 2.8.x (minor version upgrade) - https://phabricator.wikimedia.org/T65847#1848757 (10demon) 5Open>3declined a:3demon Upgrading to a later 2.8 wouldn't be useful at all. [17:21:57] feck [17:22:05] 210003 on enwiki [17:22:27] That's 10 for starters [17:23:02] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1848764 (10RobH) If approved, please remove promethium from the spares page, and in the edit summary, please list this task #. Additionally, I typically resolve the #hardware-request once its... [17:23:41] It probably wouldn't hurt to turn it into a foreach ( $wgConf->getLocalDatabases() as $wiki ) { } type scritp either [17:24:18] Krinkle: is https://phabricator.wikimedia.org/T120218 a consequence of the busted job queue? [17:24:30] andrewbogott: No [17:24:36] Even more so with hhvm... Cache the filemtimes [17:24:44] that task is because we use an ancient unmaintained version of SMW [17:24:48] oh, it’s swm being broken [17:24:51] well, that doesn’t surprise me [17:24:56] And something in core changed [17:25:06] um… smw [17:25:21] and we probably updated SMW in a sweep across the code bases, but we don't deploy SMW from master [17:25:39] bd808: I'll get my poking stick out when I get home [17:25:41] usually when we update hooks, we update all callers in git [17:25:43] anyway [17:25:51] I guess SMW is still blocked? [17:26:06] (03PS7) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [17:26:07] I don’t know that it’s blocked, i just think no one wants to own it [17:27:11] We can probably fix those easily enough [17:27:29] Reedy and I never got around to figuring out a way to deploy modern versions of SMW moslty because of recurring promises to drop the need for it on wikitech [17:27:30] last I checked (which was years ago) the smw deployment model wasn’t compatible with our deployment model [17:27:35] hence never upgrading [17:27:43] Might have a look later too [17:27:44] hah, yes, as bd808 says [17:27:49] That is, teh errors, not upgrading it [17:27:55] Back later [17:28:15] we'd need to build it into a big pile of stuff using composer similar to the process used for wikibase [17:29:16] There's something on the horizon [17:29:29] * bd808 slow claps [17:32:25] (03PS2) 10Dzahn: icinga: remove user from dialout group [puppet] - 10https://gerrit.wikimedia.org/r/256508 (https://phabricator.wikimedia.org/T110893) [17:34:31] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1848796 (10Smalyshev) I don't have enough expertise to decide which way is better, but I'd of course prefer one that does not require r... [17:38:04] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1848810 (10BBlack) Testing further, there's currently a complaint on commons about: https://commons.wikimedia.org/wiki/File:Ro... [17:39:52] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1848831 (10mark) a:5mark>3RobH Adding 2 drives (or SSDs if we have them) and extending the LVM VG seems easy and cheap to do, let's... [17:42:12] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1848871 (10BBlack) Thinking this through logically - either there's a code/data bug in how we're purging thumbnails in the gen... [17:46:58] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1848923 (10RobH) [17:47:33] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1848931 (10RobH) [17:51:03] (03CR) 10JanZerebecki: [C: 031] Enable data access for beta meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256707 (owner: 10Aude) [17:55:21] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1849037 (10RobH) The order was placed for the hardware to fulfill this request, but the estimated deliv... [17:55:53] (03CR) 10Dzahn: [C: 032] icinga: remove user from dialout group [puppet] - 10https://gerrit.wikimedia.org/r/256508 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [17:56:42] ^ that must have been from when the icinga/nagios server was actually .. dialing out [17:56:48] with that USB dongle to send SMD [17:56:50] SMS [17:57:01] robh: ^ [17:57:58] that seems entirely reasonable and correct to me [17:58:10] but im not 100% certain about it either, so seeing alex agree is nice ;] [17:59:01] i'm watching neon closely, but yea, given that it is next to "nagios" [17:59:14] and that is the only explanation to me for someting dialing out [17:59:53] jynus: hola! [17:59:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [18:00:01] hola, nuria [18:00:10] jynus: do you have time for 1 question? [18:00:14] sure [18:00:37] jynus: we are seeing slow inserts on db logging machine [18:00:45] m4? [18:00:49] jynus: is there any db maintenance going on? [18:00:49] eventlogging? [18:00:53] jynus: yes [18:01:14] there was yesterday a short time with a problem, but on the slave [18:01:17] not on the master [18:01:40] there is no maintenance right now, that I know of [18:02:15] I've commented a problem to otto some minutes ago [18:03:28] there is low cpu right now [18:04:27] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1849070 (10BBlack) Digging through some of our repos looking for a possible related code regression as a shot in the dark, I d... [18:04:33] jynus: k [18:05:00] I see your inserts going on [18:05:13] do you want me to investigate more? [18:05:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [5000000.0] [18:06:38] jynus: not sure yet, we are seeing the consumer that is consuming from kafka and inserting into mysql lag a lot [18:06:41] and we aren't sure why [18:06:50] busy DB and slow inserts would explain that [18:06:52] but could be somehting else [18:06:58] <_joe_> ottomata: how often does it read from etcd? [18:07:14] <_joe_> can that be an issue? [18:07:36] there are only 2 active threads there [18:07:41] insert [18:07:49] and delete, from the purging [18:07:58] and delete, from the purging [18:08:12] _joe_: not often, but its been lagging since yesterday, so I don't think your change is related [18:08:16] (03PS1) 10Ottomata: Make burrow to monitor the proper eventlogging mysql kafka consumer group [puppet] - 10https://gerrit.wikimedia.org/r/256728 [18:08:18] <_joe_> (and sorry, I'm out) [18:08:29] <_joe_> ottomata: yeah not related to the change, but to etcd itself [18:08:40] looking [18:08:42] <_joe_> ottomata: try using strace on the process that is lagging [18:08:57] <_joe_> ttyl, if you need me, send me a page [18:08:59] do you want me to try disabling the purging to check it is not that? [18:09:09] no _joe_ it should only read from etc every 90 daysish [18:09:10] if not self._token or time.time() > self.expiry_timestamp: [18:09:20] so on startup, and then whenever the key expires [18:09:39] purging? hm, hold off jynus am still poking around [18:09:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:09:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [18:10:22] I am assuming you are not inserting data from before 45 days, aint you? [18:10:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:11:33] ori was doing some queries on the master yesterday [18:12:01] hm, no we are about 24 hours behind [18:12:10] maybe 25 [18:12:29] (03CR) 10Faidon Liambotis: [C: 04-1] "Pretty solid work. See inline for a few comments." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [18:15:35] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1849149 (10RobH) [18:15:45] jynus: can I look at processes on db1046 (that is the master, ja)? [18:15:49] not sure how to log in there [18:15:52] (mysql procs) [18:16:10] there is an increas in swapping since Dec 1, 21 UTC [18:16:18] let me disable the purge, just to discard that [18:16:29] yes, that is what I mentioned [18:17:20] (03CR) 10Ottomata: [C: 032] Make burrow to monitor the proper eventlogging mysql kafka consumer group [puppet] - 10https://gerrit.wikimedia.org/r/256728 (owner: 10Ottomata) [18:17:29] ok [18:17:35] jynus: go ahead [18:18:34] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:18:57] !log disabling event scheduler on db1046 (m4-master) [18:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:55] mforns: jynus is temporarily disabling the eventlogging purge process (is that right jynus)? [18:20:01] ori: is brrd a thing you're using? [18:20:03] the only threads running right now is the eventlog insert [18:20:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:20:12] I see [18:20:20] my connection, the replication, and 2 monitoring processes [18:20:31] only one "running" [18:20:34] your thread [18:20:50] let's wait to see if it improves, then we can try something else [18:20:55] ok [18:21:25] (03PS3) 10Dzahn: icinga cleanup: move gsb monitoring to ./monitor/ [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) [18:21:39] jynus: ishmael isn't a thing anymore, right? [18:22:15] no, but you should have access to slow query loggin on tendril [18:22:33] (03CR) 10coren: "Notes inline, new changeset incoming shortly." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [18:22:41] jynus: Ok I thought so, I'm looking at some old deployment refs to it on tin that can likely be cleaned up then. [18:23:47] mforns: still seems pretty slow [18:24:04] (03CR) 10Dzahn: [C: 032] "after changing in labs/private and ops/private now compiles fine: http://puppet-compiler.wmflabs.org/1423/" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:24:06] (03PS1) 10BBlack: Revert "wgHTCPRouting: use separate address for upload" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256730 [18:24:11] (03PS2) 10BBlack: Revert "wgHTCPRouting: use separate address for upload" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256730 [18:24:20] https://tendril.wikimedia.org/report/slow_queries?host=^db1046&user=eventlog&schema=log&qmode=eq&query=&hours=12 [18:24:25] it might be catching up though [18:24:34] ottomata, yes, a bit better, but still slower than normal [18:24:38] this is the report of slow queries on db1046 in the last 12 hours [18:24:47] when we started looking at this it was about 25 hours behind, now its aonly about 20 [18:25:10] there is someone doing selects there, that I would discourage [18:25:30] oh, it the application itself [18:25:35] :-) [18:25:39] (03CR) 10Jdlrobson: [C: 031] Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [18:25:47] let me see if there are other users [18:25:57] I can also disable replication for a while [18:26:41] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1849215 (10BBlack) I've also staged a revert of the mediawiki-config side of the multicast split at https://gerrit.wikimedia.o... [18:26:48] let me disable replication too on the slaves, it will not hurt [18:27:00] ok [18:27:42] !log disabling eventlogging_sync process on dbstore1002 and db1047 and replication on the other m4 slaves [18:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:55] ok... [18:28:05] I need to disable some alerts, it will take me a while [18:30:59] (03CR) 10Chad: "Yuvi mind removing your -2 and having another look?" [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [18:31:41] seems to be better now [18:32:44] I think all slaves have been stopped now [18:33:24] !log neon - remove icinga user from "dialout" group [18:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:03] (03CR) 10Dzahn: "removed user from the group manually on neon" [puppet] - 10https://gerrit.wikimedia.org/r/256508 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:34:31] there is some issue with dbstore2002 now [18:35:26] independently of this, have you thought of importing on several threads? [18:35:54] one thread is very inefficient, mysql can handle more than that, but obviouly it will take more resources [18:37:12] yes we can easily do more threads [18:37:16] just haen't done it :) [18:37:21] haven't needed to [18:37:23] but we should be able to do so [18:37:28] so this is unrelated [18:37:44] actually it is ok, it means the slaves can keep up [18:38:07] (03PS8) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [18:38:09] mforns: should I just try to run a second in a screen and see what happens [18:38:10] ? [18:38:15] it should auto rebalance [18:38:22] (03CR) 10Dzahn: "yea, who is "we", that's a legit question. WLM started in NL and is community-organized with help from the WMNL and WMDE chapters afaik" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [18:38:55] the number if inserts have gone up [18:39:05] https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [18:39:09] (03PS9) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [18:39:12] if you have access [18:39:20] check the com_insert graph [18:40:10] ja looking [18:40:25] i don't see them much up? am i looking at the wrong one? [18:40:32] i see the7d one, and it looks level relatively [18:40:42] last 5 mins spikey, but only a ilttle bit [18:40:43] ? [18:40:43] (03PS2) 10Dzahn: ferm: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256489 [18:41:18] yep, for some moment I thought I saw it going up [18:41:35] ok, jynus i'm going to try to run a second process and see what happens [18:42:01] does is know not to insert the same tables twice? :-) [18:42:07] *it [18:42:15] (03CR) 10Dzahn: [C: 032] "just cause it's the last warning with our common lint config" [puppet] - 10https://gerrit.wikimedia.org/r/256489 (owner: 10Dzahn) [18:42:28] yes [18:42:36] ok, then [18:42:45] the 2nd process will auto balance with the first and consume different messages from kafka to insert [18:42:54] I would expect it to scale quasi-linearly up to 4 threads [18:43:29] ottomata, cool [18:43:46] one thing that I have not commented is that db1046 was running out of space not a long time ago [18:44:07] (03PS2) 10Dzahn: wikilabels: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/256491 [18:44:17] (03CR) 10Dzahn: [C: 032] wikilabels: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/256491 (owner: 10Dzahn) [18:44:49] what I do not see is a huge slowdown on inserts [18:45:00] in the last days [18:45:07] jynus, aha [18:45:17] normal seems 3000/hour [18:45:24] now runnign 2 procs [18:45:25] (03PS2) 10Dzahn: role: fix "ensure found on line but not the first" [puppet] - 10https://gerrit.wikimedia.org/r/256493 [18:45:36] right now it seems to be doing 2500/hour [18:45:44] would that explain the issue? [18:45:52] !log disabling puppet on labcontrol1002 to test openldap with pdns [18:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:12] jynus, do you know which whas the insertion rate a couple hours ago? [18:46:13] only 3000 / hour? [18:46:17] that seems slow? [18:46:20] inserts [18:46:22] not rows [18:46:30] oh ok [18:46:36] 300/hour makes sense [18:46:37] there are many rows inserted on a single insert [18:46:46] *3000/hour [18:47:09] my question is, would a slowdown of 500 insert/s explain the delay? [18:47:44] I can give you an aproximation of how many rows that is [18:47:47] i guess it depends on how many rows per insert [18:47:47] yeah [18:47:53] (03PS3) 10Dzahn: role: fix "ensure found on line but not the first" [puppet] - 10https://gerrit.wikimedia.org/r/256493 [18:47:55] we do an average of 180 events per sec it looks like [18:49:01] (03CR) 10Dzahn: [C: 032] role: fix "ensure found on line but not the first" [puppet] - 10https://gerrit.wikimedia.org/r/256493 (owner: 10Dzahn) [18:49:08] mforns: based on what madhuvishy just said in analytics, i'm going to restart the 1st eventlogign mysql consumer too [18:49:20] jynus, I don't know if 500/hour drop is enough to cause the lag [18:50:12] ottomata, I dln't get what madhu said... [18:50:25] (03PS1) 10BBlack: interface: use allow-hotplug for ::manual and ::tagged [puppet] - 10https://gerrit.wikimedia.org/r/256734 (https://phabricator.wikimedia.org/T110530) [18:50:45] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:51:22] (03CR) 10jenkins-bot: [V: 04-1] interface: use allow-hotplug for ::manual and ::tagged [puppet] - 10https://gerrit.wikimedia.org/r/256734 (https://phabricator.wikimedia.org/T110530) (owner: 10BBlack) [18:52:03] mforns: so, i'm fixing burrow notification emails from this [18:52:08] (03PS1) 10Dzahn: varnish:misc: add smokeping on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/256736 [18:52:11] i just fix it (so it noticed the consumer lag) [18:52:16] the first email said 12 partitoins were behind [18:52:19] then i started the other consumer [18:52:25] then another email said only 6 were behind... [18:52:31] I see [18:52:35] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.024 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [18:52:38] not sure if this is what happened [18:52:44] could it be that the other process got "stuck"? [18:52:48] but maybe the 2nd one grabbed 6 of the partitions, and somehow powered through? [18:52:53] not yet sure [18:53:30] ottomata, so why did you want to restart the first one? [18:53:57] maybe in a while I can start what I stopped to see if it affects negatively the whole system? [18:54:28] mforns: if it got stuck, maybe a restart would help :/ [18:54:41] if I can tell you, the inserts are a bit large for my taste [18:54:56] but I cannot compare with 1 week ago [18:54:59] ottomata, aha [18:55:03] or at least, not easily [18:55:25] jynus, we insert up to 4000 events per insert statement [18:55:33] jynus, is that too much? [18:55:36] 6operations: move smokeping behind misc-web varnish - https://phabricator.wikimedia.org/T120258#1849395 (10Dzahn) 3NEW a:3Dzahn [18:55:43] ah! so there you have the number or rows per insert :-) [18:55:44] hey folks. I've got a beta only config change to merge. [18:55:59] ha, jynus that is the max [18:56:06] ah, ok [18:56:10] i can see the inserts in the logs too, but they vary, somteimes 10, sometimes 4000 [18:56:13] its size and time based [18:56:37] the thing is, larger ones are faster [18:57:05] yes, that's why we grouped them by schema and batched insertion [18:57:10] but they are also difficult to debug, larger to rollback if there are errors [18:57:18] (03PS5) 10BryanDavis: Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [18:57:21] so it is the operational things that make more complex [18:57:25] (03CR) 10BryanDavis: [C: 032] Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [18:57:39] hmm,well, mforns, jynus it does seem to be catching up, i think.. [18:57:40] but again, not somethin to discuss right now [18:57:49] (03Merged) 10jenkins-bot: Enable RelatedArticles and Cards on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [18:57:50] but slowly [18:58:00] ok, ottomata let me enable the thigs I have disable one by one [18:58:07] k [18:58:07] jynus, how long should it take to insert 4000 events to the same table? [18:58:09] to discare those as the issue [18:59:05] mforns, a few seconds, the actua time was on a previous link [18:59:42] ok [18:59:44] mforns: jynus, if i am looking at the logs correctly, it just took 21 seconds to insert 3882 rows [18:59:49] 2015-12-03 18:58:31,664 (MainThread) SaveTiming_12236257 queue is large or old, flushing [18:59:54] 2015-12-03 18:58:52,039 (Thread-15 ) Data inserted 3882 [19:00:12] there is only one insert thread in this process, and, i believe it only inserts after a flush, is that correct mforns? [19:00:15] (03PS2) 10BBlack: interface: use allow-hotplug for ::manual and ::tagged [puppet] - 10https://gerrit.wikimedia.org/r/256734 (https://phabricator.wikimedia.org/T110530) [19:00:17] ottomata, yes. jynus is that something you would expect? [19:00:49] mforns: we should add info to that Dat inserted log message [19:00:54] about which schema was just inserted [19:00:57] would be very helpful i think [19:01:02] and also one of the timestamps from the batch [19:01:04] if possible [19:01:05] between 34-20 seconds [19:01:11] ottomata, the flushes appear to occur simultaneously with the insertions, but this is only because of the throttling [19:01:27] less for smaller batches [19:01:37] oh, mforns i assumed the flush message would show first, and then the inseted message [19:01:38] no? [19:01:54] ottomata, totally +1 to adding that to the log [19:02:11] ottomata, no, first it inserts something that is in the queue [19:02:20] then the queue has space to receive another flush [19:02:37] mforns: prepping patch.. [19:02:47] also tokudb has some issues [19:02:56] ah [19:02:57] with furious flishing [19:03:06] 6operations, 7HTTPS, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1849439 (10Dzahn) [19:03:12] that are like "and now I stall for a few seconds" [19:03:18] 6operations, 7HTTPS: move smokeping behind misc-web varnish - https://phabricator.wikimedia.org/T120258#1849441 (10Dzahn) [19:03:35] (03CR) 10BBlack: [C: 031] "Compiler confirmed no-op on labnet1002 and labvirt1009 (examples of the two affected nova roles)" [puppet] - 10https://gerrit.wikimedia.org/r/256734 (https://phabricator.wikimedia.org/T110530) (owner: 10BBlack) [19:03:47] Innodb is more stable, but we do not have enough disk to contrain that, uncompressed [19:04:01] aha [19:04:56] !log starting m4 slave again on dbstore2002 [19:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:43] mforns: huh? flush log line happens after insert? that doesn't sound right [19:06:19] the consumer main thread reads from kafka, and enqueus, and sleeps while the queue is too large [19:06:28] ottomata, maybe the logs have a confusing language [19:06:47] mforns: i'm saying, the flushing message comes when [19:06:47] events_batch.append((scid, scid_events)) [19:06:57] ottomata, yes [19:06:57] the reduction on inserts seems to correlate with the purging restarting [19:07:01] and the inserts for a given batch dotn' happen unless that happens [19:07:10] so [19:07:11] i would thikn [19:07:20] flushing [19:07:20] ... [19:07:21] inserted %d events [19:07:28] so maybe I could leave that disabled for a while [19:07:52] mforns: [19:07:52] https://gerrit.wikimedia.org/r/#/c/256738/ [19:08:17] ottomata, but the events that get flushed (queued for insertion) will take a while to get inserted, other batches of events may precede in the queue [19:08:26] ah right [19:08:39] ok, so the order in the logs isn't necessarily indicitive [19:08:45] ok well, that patch should help us correlate :) [19:09:44] (03CR) 10BBlack: [C: 032] interface: use allow-hotplug for ::manual and ::tagged [puppet] - 10https://gerrit.wikimedia.org/r/256734 (https://phabricator.wikimedia.org/T110530) (owner: 10BBlack) [19:10:41] !log restarting eventlogging_sync on db1047 and dbstore1002 [19:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:56] Imagine the queue is full, so the main thread is sleeping. At some point, the worker thread manages to insert an event batch to the db (-> Data inserted xxx), and the queue has now one free slot. So now the main thread has space to continue consuming from kafka and at some point push an event batch to the queue (xxxx is big or old. flushing...). [19:10:57] please tell me if you see things getting degraded again [19:11:02] ottomata, ^ [19:11:20] ja [19:11:25] but both logs are not the same events [19:11:46] ok jynus [19:12:14] Last modification on Edit_13457736 on m4-master.eqiad.wmnet: 20151203005617 [19:12:43] that is 8 hours more than a few minutes ago [19:13:01] cool [19:13:15] mforns: this look ok to you? [19:13:15] https://gerrit.wikimedia.org/r/#/c/256738/ [19:13:28] * mforns looks [19:13:35] (03Abandoned) 10Alexandros Kosiaris: tlsproxy::localssl: Force X-Forwarded-For to $remote_addr [puppet] - 10https://gerrit.wikimedia.org/r/255539 (owner: 10Alexandros Kosiaris) [19:13:55] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: puppet fail [19:14:54] look at the spike at 19UTC on inserts: https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [19:14:57] ottomata, you must know that many times the event batch that is pushed to the queue does not correspond to just one insert [19:15:22] the event batch is split by set of non-null fields [19:15:25] (as something good) [19:15:57] HuH/ [19:15:58] ? [19:16:02] ottomata, so even if you see "Edit_133345456 is old, flushing 4000 events" [19:16:26] it doesn't mean you'll see: "Data inserted: 400, Edit_133345456" [19:16:35] jynus: that could be me turning on the 2nd process [19:16:37] you can see: [19:16:54] ottomata, I assumed so [19:16:54] "Data inserted: 1200, Edit_133345456" [19:17:05] "Data inserted: 800, Edit_133345456" [19:17:08] "Data inserted: 200, Edit_133345456" [19:17:14] "Data inserted: 76, Edit_133345456" [19:17:15] really? huh. [19:17:16] etc. [19:17:24] ok, well maybe we can' tcorrelate the batch to the inserts then, but that's ok [19:17:30] but that means that multi-thread is interesting [19:17:30] more info is better anyway? [19:17:43] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:17:48] yes, ottomata, but you're right, it's better nevertheless [19:18:22] one thing I definitelly could help you with is with the importing process [19:18:40] importing? [19:18:44] INSERT is slow, compared with LOAD DATA [19:19:02] I wrote about it some time ago: http://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ [19:19:18] heh [19:20:21] so I offer in the future try to optimize that [19:21:04] 6operations, 10Traffic, 5Patch-For-Review: Fix ethernet startup race on HP LVS w/ jessie - https://phabricator.wikimedia.org/T110530#1849518 (10BBlack) [19:21:06] 6operations, 10Traffic: upgrade lvs1001-3 to jessie - https://phabricator.wikimedia.org/T119517#1849517 (10BBlack) [19:21:12] although, to be fair, probably mysql is not the right tool [19:21:38] indeed, jynus we are importing this data into hadoop now too [19:21:45] and plan to support more throughput with that, but not with mysql [19:22:08] something more analytics-based would be ideal [19:22:08] if we had a way to join mediawiki data in hadoop, we could move away from mysql i think [19:22:17] there is [19:22:24] not "join" :-) [19:22:29] yeah [19:22:31] but there is replication to hadoop [19:22:33] (03PS2) 10Alexandros Kosiaris: Trust the upstream proxy to have the correct client IP [puppet] - 10https://gerrit.wikimedia.org/r/255404 [19:22:34] from mysql [19:22:36] jynus: it is harder than you think, due to no good way of detecting changes in mw dbs [19:22:49] would ahve to do full import every time [19:22:53] ottomata, I am not assuming it is easy! [19:22:55] heheh [19:22:55] :) [19:22:56] (03CR) 10jenkins-bot: [V: 04-1] Trust the upstream proxy to have the correct client IP [puppet] - 10https://gerrit.wikimedia.org/r/255404 (owner: 10Alexandros Kosiaris) [19:23:08] i'm sure there are ways [19:23:13] why a full importy? [19:23:25] i think 2 reasons [19:23:28] hadoop is not good at updates [19:23:41] ah! to haddoop you mean [19:23:45] and, how to you tell what has been changed in mw dbs in the last say, day, or 1 hour, aross all tables? [19:23:46] yeah [19:23:59] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1849540 (10BBlack) [19:24:30] well, that would be it, just replicating the tables as is ( was merely answering madhuvishy ) [19:24:40] I am not saying eventlogging should disappear [19:24:41] aye [19:24:45] oh ja [19:24:59] ANyaywayyy, so jynus purge is back on now? [19:25:03] no [19:25:07] not yet [19:25:25] ok [19:25:27] I was waiting to see if there was strange patters now [19:25:42] mforns: can I merge and deploy this change to logging? [19:25:44] it can wait anyway afull day [19:26:10] or if you want to test it right no, I can enable it at 100% speed [19:26:13] ottomata, looking at it, I'm not sure if this parameter passing is ok [19:26:24] but only if we can tell the difference [19:26:25] have you tested it? [19:26:29] mforns: nope :) [19:26:36] just a sec [19:26:42] jynus: let's wait [19:27:00] as in, a few minutes or 1 day ? [19:27:06] a day [19:27:08] til we are caught up [19:27:10] ok, then [19:27:28] mforns: i need food anyway, and to leave this cafe [19:27:33] will be back in an hourish to check up on things [19:27:35] ok ottomata [19:27:36] maybe we can meet tomorrow for a test? [19:27:43] i'm leaving my 2nd consumer proc up and running for now [19:27:48] ok [19:27:50] it sjust running in a screen on eventlog1001 [19:27:52] in case you need to kil it [19:27:59] ok got it [19:28:27] mforns: am logging output from it to /tmp/otto-eventlogging-mysql-consumer2.log [19:28:35] ok [19:28:44] thx [19:28:52] 6operations, 10Traffic, 7Browser-Support-Internet-Explorer, 7HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1849564 (10Dzahn) friendly bump [19:29:04] jynus: yeah, if we do it'll have to be in my morning i think i'm only working hafl day tomorro [19:29:15] ok ,back in a little while... [19:29:28] just ping me [19:30:28] (03CR) 10Alexandros Kosiaris: [C: 032] "TL;DR this is going to work, I 'll merge it" [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [19:30:47] (03PS4) 10Alexandros Kosiaris: xvfb: switch to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [19:30:53] (03CR) 10Alexandros Kosiaris: [V: 032] xvfb: switch to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/256643 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [19:36:37] (03Abandoned) 10Alexandros Kosiaris: Trust the upstream proxy to have the correct client IP [puppet] - 10https://gerrit.wikimedia.org/r/255404 (owner: 10Alexandros Kosiaris) [19:38:19] 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1849615 (10chasemp) 3NEW a:3Cmjohnson [19:39:01] 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1849632 (10chasemp) [19:39:14] 6operations, 6Labs, 10hardware-requests: Get Ops bare metal test server - https://phabricator.wikimedia.org/T118588#1849633 (10chasemp) 5Open>3Resolved a:3chasemp [19:39:41] (03PS2) 10Alexandros Kosiaris: misc-web: Force HTTPS for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/255405 [19:40:01] 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1849615 (10chasemp) [19:40:04] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:40:09] (03PS3) 10Alexandros Kosiaris: misc-web: Force HTTPS for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/255405 [19:40:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] misc-web: Force HTTPS for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/255405 (owner: 10Alexandros Kosiaris) [19:41:35] !log disabling pybal on lvs100[123] over the next few minutes (for reinstall to jessie later after confirmation everything is still ok on [456]) [19:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:42] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1849646 (10RobH) 3NEW a:3RobH [19:44:44] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1849646 (10RobH) [19:44:52] (03PS3) 10Ori.livneh: redis: fix config file world-readability [puppet] - 10https://gerrit.wikimedia.org/r/256666 (owner: 10Filippo Giunchedi) [19:45:09] (03CR) 10Ori.livneh: [C: 032 V: 032] redis: fix config file world-readability [puppet] - 10https://gerrit.wikimedia.org/r/256666 (owner: 10Filippo Giunchedi) [19:45:14] 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1849662 (10chasemp) chatted with chris a bit and he is out atm we will catch up on this early next week [19:46:15] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [19:48:18] (03CR) 10Eevans: [C: 031] cassandra: provision restbase1009 with 128 tokens [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [19:48:46] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1849671 (10Dzahn) 5Open>3Resolved a:3Dzahn the linked RT ticket is closed as Robh pointed out and certs are valid until 2016-10-20 [19:49:43] 6operations, 10Traffic, 7HTTPS: Samsung GT-S3650 can't connect to Wikipedia - https://phabricator.wikimedia.org/T108298#1849683 (10Dzahn) 5Open>3declined a:3Dzahn declining because we really can't do much about it [19:53:22] Hey bblack! Just wondering if you could give a guess of how long you reckon it would take netops to get around to https://phabricator.wikimedia.org/T120010 ? :) [19:53:47] (03PS2) 10Alexandros Kosiaris: Have misc-web talk directly to etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/255406 [19:53:49] (03PS1) 10Alexandros Kosiaris: etherpad: Have nodejs listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/256743 [19:53:53] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [19:53:59] (03CR) 10GWicke: [C: 031] "@Filippo: We can also leave those instances smaller until we are fully converted to several instances, at which point we can convert those" [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [19:54:04] PROBLEM - pybal on lvs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [19:54:54] ACKNOWLEDGEMENT - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black preparing for https://phabricator.wikimedia.org/T119517 [19:54:54] ACKNOWLEDGEMENT - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black preparing for https://phabricator.wikimedia.org/T119517 [19:54:54] ACKNOWLEDGEMENT - pybal on lvs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black preparing for https://phabricator.wikimedia.org/T119517 [19:56:21] (03CR) 10Yuvipanda: k8s: switch to using systems' CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [19:56:30] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Have nodejs listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/256743 (owner: 10Alexandros Kosiaris) [19:56:51] akosiaris: \o/ is this going to mean no apache? [19:57:47] (03CR) 10Yuvipanda: toolschecker: switch to using base::puppet::ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243666 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [19:58:44] yuvipanda: yes [19:58:50] but I have one problem... [19:59:09] so damn express.js has this setting called "trust proxy" [19:59:32] yuvipanda: https://gerrit.wikimedia.org/r/#/c/255404/ [19:59:46] so I got 2 choices.. either merge that and hope nobody abuses it [19:59:51] or not merge it [19:59:52] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1849717 (10aaron) The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on... [20:00:08] yuvipanda: I 've already abandoned it, but reluctantly [20:00:12] what do you think ? [20:00:26] akosiaris: how did we deal with this with apache? [20:01:10] yuvipanda: https://phabricator.wikimedia.org/rOPUPc8449d423f2cf10141627df15ac11ab3842e8754 [20:01:15] aka, use X-Real-IP [20:01:28] as in it's bblack's problem ;-) [20:01:49] (03PS3) 10Giuseppe Lavagetto: toolschecker: use system's CA certs [puppet] - 10https://gerrit.wikimedia.org/r/243666 (https://phabricator.wikimedia.org/T114638) [20:01:54] haha [20:01:56] :D [20:02:04] but seriously, X-Real-IP due to all the logic we got in varnish seems to me more trustable [20:02:11] I might be wrong on that btw [20:02:29] <_joe_> yuvipanda: ^^ is reviewable now [20:02:40] <_joe_> fancy merging it? [20:02:51] <_joe_> I'll amend the k8s one too [20:02:53] _joe_: yeah looking now [20:03:32] (03CR) 10Yuvipanda: [C: 032] toolschecker: use system's CA certs [puppet] - 10https://gerrit.wikimedia.org/r/243666 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [20:03:41] are there known issues with the job queue right now? [20:04:24] oh, hm: 16:50 logmsgbot: oblivian@tin Synchronized wmf-config/CommonSettings.php: Fix the jobqueue on wikitech (duration: 00m 28s) [20:04:44] that's when it started to behave oddly [20:05:45] not sure who initiated this sync (oblivian?); ori, do you know anything about this? [20:05:59] not sure what oddly means... I can see queue size increases [20:06:00] _joe_: mediawiki::maintenance::refreshlinks is about --dfn-only [20:06:09] gwicke: that's _joe_ [20:06:14] _joe_: that runs on terbium indeed, but only deals with cleaning up garbage that may be left behind. [20:06:21] https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=8&fullscreen [20:06:32] <_joe_> gwicke: uhm that is actually strange [20:06:40] ah, that's more clear, thanks [20:06:40] _joe_: It does not deal with updateLinks jobs being broken and then forgotten. Those do not repair themselves over time. [20:06:42] it's not sending as many requests as usual, and it's pulsating in a strange way [20:07:03] <_joe_> gwicke: https://grafana.wikimedia.org/dashboard/db/job-queue-health [20:07:09] it's the heartbeat... next you are going to see a straight line [20:07:12] :-D [20:07:35] <_joe_> gwicke: how can that have that effect? [20:07:45] I don't know anything about the change [20:07:50] <_joe_> I mean if it has happened at the same time [20:08:02] <_joe_> gwicke: lemme find it for you [20:08:04] the times seem to line up exactly [20:08:18] <_joe_> yeah that's why I'm not dismossing your concern [20:09:44] _joe_: (that patch works fine!) [20:11:03] <_joe_> gwicke: so that patch partially reverted something that ori did, because it was sending wikitech jobs to the jobqueue, where they do not belong [20:11:26] I might be missing a piece of the puzzle here, how is the request rate to the REST API connected to the joqueue ? [20:11:48] akosiaris: many of those requests are re-render traffic from the job queue [20:12:08] ah, the old re-render parsoid jobs ? [20:12:09] the equivalent of linkupdates [20:12:12] ok [20:12:25] now they happen via the REST API, yeah makes sense, thanks [20:12:43] <_joe_> gwicke: I can't find either a reason or a smoking gun [20:12:54] <_joe_> but if you feel like trying, you can revert that change [20:13:13] _joe_: could also be some weird failure in how config changes are applied to job queue runners [20:13:19] _joe_: https://logstash.wikimedia.org/#dashboard/temp/AVFpejDbptxhN1Xao0Gy [20:13:35] seems it got worse? [20:13:55] I guess that may be the job runners [20:14:09] previously the ones already queued ran, but failed [20:14:18] maybe not it can't even run because the config is missing, but they'r still in the queue? [20:14:19] Not sure.. [20:14:28] <_joe_> Krinkle: I'm pretty sure labswiki is not sending anything to the jobrunners anymore [20:14:32] AaronSchulz: hjelp! [20:14:33] looks like the same error though [20:14:35] <_joe_> they're still in the queue, yes [20:14:59] > /rpc/RunJobs.php?wiki=labswiki&type=RestbaseUpdateJobOnDependencyChange&maxtime=30&maxmem=300M [20:15:07] They are rerender jobs for labs wiki [20:15:15] * Krinkle has been doing a lot of editing on wikitech today [20:15:33] <_joe_> Krinkle: I don't dare clean our queue [20:15:42] gwicke: can you confirm it is only effecting labswiki? [20:16:02] it's affecting all jobs from what I can tell [20:16:58] <_joe_> Krinkle: so my hypothesis is that we have a months worth of wikitech jobs that run and are retried [20:17:05] <_joe_> so maybe they now fail fast? [20:17:09] Right [20:17:36] previously they failed on [20:17:37] > Can't connect to MySQL server on '208.80.154.136' (4) [20:17:40] after our change they fail on [20:17:56] the distribution of projects that are being re-rendered is still fairly nominal: https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-cf-performance?panelId=25&fullscreen [20:17:59] Error connecting to 208.80.154.136: Can't connect to MySQL server on '208.80.154.136' (4) [20:18:04] So the same error [20:18:05] just more of them [20:18:10] in logstash [20:18:12] about 1000x as much [20:18:25] but yeah, I think the removal of that config makes them fail in a way it doesn't handle as well [20:18:28] <_joe_> I have honestly no idea why [20:18:37] <_joe_> so ok, back to the code [20:18:56] I'll grab some food, bbiam [20:18:57] The past 5 hours after our change, the error is triggered 1303457x [20:18:58] <_joe_> Krinkle: let's revert that patch? [20:19:09] can we hack them to succeed for now? [20:19:18] In the 16 hours before it, 24718x [20:19:22] <_joe_> ori: I am unsure what is going on [20:19:27] yeah, maybe revert for now is best [20:19:38] Afaik it is only affecting wikitech and mostly because I did a lot of edits to templates [20:19:45] <_joe_> oh [20:19:53] so it's more volume because there is more input, not becuase it regressed or anythig [20:20:03] but I'm fine with a revert to see if it changes anything [20:20:09] <_joe_> ori: anyways, including the jobqueue files independently of the wiki was wrong [20:20:25] <_joe_> Krinkle: labswiki is not sending jobs to the jobqueue at all [20:20:28] <_joe_> since the change [20:20:33] Yeah, which is intended. [20:20:35] <_joe_> I had a tcpdump opened [20:20:37] So these failures are from old jobs [20:20:41] 6operations: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1849809 (10Southparkfan) 3NEW [20:20:50] which previously were successfully acknowledged and then fail [20:20:56] and now it can't acknowledge them or something like that [20:21:07] SouthparkfanZNC: hah, awesome. thanks [20:21:30] check the error logs on the job runners [20:21:32] see where it fails [20:21:41] and just have it silently discard messages for unknown wikis [20:21:45] <_joe_> I am trying to see it now [20:21:51] I don't use SouthparkfanZNC but mkay [20:22:47] There is no change in the errors in logstash from job runners. Same error, from the same kind of mw* servers. Just more of the same [20:22:51] https://logstash.wikimedia.org/#/dashboard/temp/AVFpejDbptxhN1Xao0Gy [20:22:53] <_joe_> ori: there is nothing in the local hhvm log [20:22:58] The same as for the past month [20:23:14] (main job runners can't connect to labswiki db, which is by design) [20:23:37] <_joe_> where should I expect to see those errors? fluorine? [20:23:38] It's just that now that we removed this bad config, the existing queued jobs are failing in a loop or something like that [20:23:47] fluorine or logstsash [20:23:55] <_joe_> ori: is there a way to remove jobs from the queue? [20:24:07] Is there a manageJobs maintenance script or something so we can drop the old wikitech jobs that are in teh wrong pool? [20:25:35] let's try to srem labswiki from `jobqueue:aggregator:s-wikis:v2` on the aggregator redises [20:25:51] <_joe_> ori: should I do that? [20:25:59] let me look first [20:27:08] <_joe_> Krinkle: so your idea is that now the jobrunners can't ack that the jobs are failing [20:27:11] <_joe_> that makes sense [20:27:37] _joe_: Yeah, I haven't verified the job ids, but I suspect it is running the same ones many times or in a loop [20:27:42] whereas previously they would be marked as failure [20:27:51] ori: makes sense to remove those [20:27:53] somehow the exception is happening in a worse way now because the config is no longer there [20:28:05] !log ran srem jobqueue:aggregator:s-wikis:v2 labswiki on rdb1001 aggr [20:28:09] now restarting jobrunners [20:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:29] !log on palladium: salt -G 'cluster:jobrunner' cmd.run 'service jobrunner status | grep running && service jobrunner restart' ; salt -G 'cluster:jobrunner' cmd.run 'service jobchron status | grep running && service jobchron restart' [20:29:44] gotta go now, team meeting, bbiab [20:31:58] <_joe_> didn't seem to have any effect [20:33:09] <_joe_> gwicke: so yes, this is the result of the jobs failing faster somehow [20:33:46] which jobs? [20:33:46] (03PS1) 10Ottomata: Fix burrow email template [puppet] - 10https://gerrit.wikimedia.org/r/256749 [20:33:51] all of them? [20:33:55] <_joe_> the jobs on labswiki [20:33:58] <_joe_> no just labswiki [20:34:10] <_joe_> which apparently makes them fail in a loop [20:34:12] <_joe_> uhm [20:34:21] how would this affect other jobs? [20:34:32] <_joe_> just the enqueuing of them [20:35:06] <_joe_> so let me try something [20:35:23] I don't understand how jobs for a small wiki that shouldn't be there in the first place failing faster would lead to the pattern we are seeing [20:36:07] <_joe_> if you have a better explanation, please share [20:36:52] Coren, as per a brief conversation I just had with Chase, we’re shelving tomorrow’s ironic meeting. The upshot is that we’re waiting for Chris to install some hardware next week and then I’m going to work on the fallback ‘labs puppet on physical hardware’ plan [20:36:58] _joe_: which change is this again? [20:37:01] since Ironic seems… more than three weeks away [20:37:03] (03CR) 10Ottomata: [C: 032] Fix burrow email template [puppet] - 10https://gerrit.wikimedia.org/r/256749 (owner: 10Ottomata) [20:37:28] andrewbogott: Yeah, given all the delays before we even had the hardware happy. [20:37:36] <_joe_> gwicke: https://gerrit.wikimedia.org/r/#/c/256698/ [20:37:41] yup yup Coren thanks andrewbogott [20:39:26] _joe_: I'd have to look more deeply into this, but the removal in line 193 looks a bit odd [20:39:35] require "$wmfConfigDir/jobqueue-{$wmfDatacenter}.php"; [20:39:53] <_joe_> gwicke: if you look at commonsettings [20:39:56] <_joe_> let me find it [20:40:42] <_joe_> gwicke: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L841 [20:41:23] <_joe_> that variable is defined here: [20:41:25] <_joe_> https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L15903-L15906 [20:41:49] (03PS1) 10Ottomata: Properly qualify variable in erb template [puppet] - 10https://gerrit.wikimedia.org/r/256750 [20:42:25] <_joe_> see the change that introduced the problem: https://gerrit.wikimedia.org/r/#/c/250170/5/wmf-config/CommonSettings.php,cm [20:43:03] (03CR) 10Ottomata: [C: 032] Properly qualify variable in erb template [puppet] - 10https://gerrit.wikimedia.org/r/256750 (owner: 10Ottomata) [20:44:51] 6operations, 10ops-eqiad, 6Labs: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1849886 (10chasemp) [20:45:23] gwicke: It could be exausting slots if it fails in a loop the way ai described earlier [20:45:24] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [20:45:34] <_joe_> !log opening connection from mw1001 to silver, mysql [20:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:41] <_joe_> uhm this doesn't work [20:45:52] <_joe_> still don't have grants [20:46:13] <_joe_> Krinkle: anyways, this is much worse than the preceding situation [20:46:21] * Krinkle is in a meeting [20:46:33] (03PS1) 10Ottomata: to_emails should be a list in burrow role [puppet] - 10https://gerrit.wikimedia.org/r/256751 [20:46:40] did the drop of labswiki fro the queue and reboot not fix it? [20:46:49] <_joe_> not at all [20:47:35] Hey guys, an IP address block I did isn't showing up in the logs [20:47:44] <_joe_> 10% of all submitted jobs on mw1001 from the last 100K is to labswiki [20:48:28] and irc.wikimedia.org put a 1 in front of the 0 of the range I blocked for the on-wiki block [20:48:49] <_joe_> gwicke: so while I am not sure what can cause this problem, I'm ok with rolling back and letting or.i and aaron figure out what is happening to the queue [20:48:52] (03CR) 10Ottomata: [C: 032 V: 032] to_emails should be a list in burrow role [puppet] - 10https://gerrit.wikimedia.org/r/256751 (owner: 10Ottomata) [20:48:54] (03PS1) 10Alexandros Kosiaris: trebuchet: change ferm rule to deployable networks [puppet] - 10https://gerrit.wikimedia.org/r/256752 [20:49:57] _joe_: yeah, that might be the best way to move forward [20:50:18] <_joe_> sigh, ok [20:50:41] (03PS1) 10Giuseppe Lavagetto: Revert "Inclusion of jobqueue files is not unconditional" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256753 [20:51:02] <_joe_> I'm unconvinced this will fix anything [20:51:16] (03PS2) 10Giuseppe Lavagetto: Revert "Inclusion of jobqueue files is not unconditional" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256753 [20:51:30] _joe_: yeah, but in any case it'll give us one more bit of information [20:51:47] nvm on those.. [20:51:51] <_joe_> gwicke: did you look at the code? [20:51:51] and it sounds like the consequences either way aren't catastrophic [20:52:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Inclusion of jobqueue files is not unconditional" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256753 (owner: 10Giuseppe Lavagetto) [20:52:06] _joe_: not yet in depth as I was in another parallel discussion [20:52:14] <_joe_> ah, ok :) [20:52:22] Bsadowski1: did the block show up? [20:52:23] I'll have a look, though [20:52:32] <_joe_> anyways, 4 people looked and it seemed ok to all of us [20:52:35] (03Merged) 10jenkins-bot: Revert "Inclusion of jobqueue files is not unconditional" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256753 (owner: 10Giuseppe Lavagetto) [20:52:46] <_joe_> gwicke: rolling back now [20:52:57] _joe_: the only other idea I have would be a restart of runners [20:53:06] <_joe_> ori did that already [20:53:14] okay [20:53:25] jynus: Inserted 4855 MobileWebUIClickTracking_10742159 events in 43.245376 seconds [20:53:33] 43 seconds is kinda long.. [20:53:34] eh? [20:54:12] unless there is some long bottleneck in constucting the query in sqlalchemy, that should be the db query time [20:54:58] _joe_: labs jobs reaching restbase should fail quickly, as labs isn't configured there [20:55:11] so they shouldn't hold off anything else [20:55:16] *hold up [20:55:21] <_joe_> gwicke: they fail before reaching restbase [20:55:26] <_joe_> now [20:55:33] <_joe_> but well, I'm reverting [20:55:48] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: Fix the jobqueue on wikitech (duration: 00m 47s) [20:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:59] _joe_: thanks! [20:56:05] we'll know a tiny bit more in a minute [20:56:26] <_joe_> gwicke: I'm sure there is some weird 3-rd order effect there [20:56:36] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:56:37] <_joe_> because the rest of the jobs were not fauiling [20:57:01] (03PS1) 10BryanDavis: l10nupdate: Fix duration reported for ResourceLoader purge [puppet] - 10https://gerrit.wikimedia.org/r/256754 [20:57:21] <_joe_> they are in fact slowing down [20:57:28] <_joe_> not disappearing, mind it [20:58:00] <_joe_> ok, they are now reduced to the previous numbers [20:58:19] (03PS2) 10BryanDavis: l10nupdate: Fix duration reported for ResourceLoader purge [puppet] - 10https://gerrit.wikimedia.org/r/256754 [20:58:25] <_joe_> I am pretty sure we should retry after we eliminate silver's jobs from the queue [20:58:42] yeah, to me the theory of labs job failures in the runners affecting other jobs by tying up runner resources sounds plausible [20:59:00] especially if the connection is hitting a firewall blackhole [20:59:05] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1849958 (10Smalyshev) 3NEW [20:59:08] that often leads to long hangs [20:59:18] <_joe_> well, that or connection failures too [20:59:20] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1849968 (10Smalyshev) [20:59:50] <_joe_> gwicke: can you lead on this a bit? it's 10 PM, I'm around since 7 AM or so... [21:00:02] <_joe_> I'll be here-ish [21:00:04] bd808: Dear anthropoid, the time has come. Please deploy Wikimania Scholarships 2016 update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151203T2100). [21:00:05] _joe_: I'll keep nagging, yeah [21:00:15] and help as much as I can [21:00:18] <_joe_> I think we need aaron :) [21:00:39] <_joe_> as soon as they're done with their team meeting it might become easy :P [21:00:41] yeah, I cried for help earlier, but my cries were not heard (yet) [21:01:00] <_joe_> gwicke: at what time? probably caught me at dinner [21:01:14] oh, towards AaronSchulz [21:01:37] <_joe_> and this shows why managing large systems is interesting - we fixed a bug, we caused an effect worse than what we were trying to cure [21:01:40] <_joe_> :P [21:01:56] yup, and another puzzle falls out of it [21:02:18] request rates seem to recover [21:03:11] getting the job queue to forget about a class of jobs is tricky as I recall. [21:03:16] yeah our team meeting will be over in 15-20 [21:03:32] ori figured it out in January... it's in phab somewhere [21:03:48] bd808: yeah, it's even hard to convince it not to retry jobs forever [21:04:04] <_joe_> bd808: yeah definitely beyond my mental capacities right now [21:04:53] <_joe_> btw, once you figured this out, I'd advise to re-revert and then clean the queue [21:05:20] * gwicke keeps refreshing grafana [21:05:49] <_joe_> gwicke: it's fixed and I'm pretty sure rb is ok too [21:06:15] yeah, it looks like it; it's just that one graph hasn't updated yet [21:06:26] ahhh, there it is [21:06:31] <_joe_> AaronSchulz: I need a thorough lesson on how the jobqueue works :) [21:06:32] found the magic ticket -- https://phabricator.wikimedia.org/T87360#991136 [21:06:32] all good it seems [21:07:47] <_joe_> gwicke: not all good, we're losing the wikitech jobs this way [21:07:50] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1850027 (10Vituzzu) Worksforme! [21:07:53] <_joe_> but better, yes [21:07:57] <_joe_> bd808: ah, great [21:08:26] is it getting better now? [21:08:30] _joe_: yeah, still need to address that [21:08:37] i ran the following on all the redis instances: [21:08:41] EVAL "local keys = redis.call('keys', ARGV[1]) \n for i=1,#keys,5000 do \n redis.call('del', unpack(keys, i, math.min(i+4999, #keys))) \n end \n return keys" 0 labswiki:* [21:08:50] (03PS1) 10Ottomata: Puppetize 4 eventlogging mysql consumer processes to speed up db insertion [puppet] - 10https://gerrit.wikimedia.org/r/256755 [21:08:52] via http://stackoverflow.com/questions/4006324/how-to-atomically-delete-keys-matching-a-pattern-using-redis [21:09:14] ori: when did you run this? [21:09:22] over the past minute [21:09:27] ori: you probably need to clean up jobqueue:aggregator:h-ready-queues:v2 for */labswiki too [21:09:28] <_joe_> ori: we reverted that change, shit [21:09:44] <_joe_> ori: should we try to re-deploy it? [21:09:46] the recovery started before the last minute [21:10:00] <_joe_> I can deploy HEAD~1 and we re-clean the queues [21:10:09] <_joe_> because right now silver is sending jobs there [21:10:29] +1 [21:10:31] no objections from me, as long as we make sure that it works afterwards [21:10:50] <_joe_> I'll deploy HEAD~1 from tin [21:11:09] <_joe_> ori when I'm done, we can clean up redis again? [21:11:13] yes [21:11:22] (03PS2) 10Ottomata: Puppetize 4 eventlogging mysql consumer processes to speed up db insertion [puppet] - 10https://gerrit.wikimedia.org/r/256755 [21:12:41] <_joe_> syncing [21:13:01] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: Re-fix the jobqueue on wikitech after redis cleanup (duration: 00m 26s) [21:13:03] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:06] (03CR) 10Mobrovac: "For services based on service-runner, we have a special service::node Puppet define which greatly simplifies the integration of a new serv" [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [21:13:16] (03CR) 10Mobrovac: [C: 04-1] WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [21:13:24] <_joe_> ori: kill! kill! kill! [21:13:47] * gwicke copies line to quote file [21:14:04] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 70327 MB (3% inode=99%) [21:14:24] what's that thingy's url to immortalise _joe_'s words? [21:14:25] got this one ^^ [21:14:32] bd808 ought to know [21:14:35] :D [21:14:36] <_joe_> gwicke: keep an eye on restbase [21:14:48] that node is actually decommissioned already [21:15:04] gwicke: still undergoing compaction though? [21:15:12] is that possible? [21:15:18] !log stopped cassandra on 1009 as it's decommissioned & will be reimaged [21:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:31] mobrovac: yes, decommission is only a network-level operation [21:15:42] it moves all the data off & gives up ownership of it [21:15:55] RECOVERY - Disk space on restbase1009 is OK: DISK OK [21:16:01] it leaves the ring, but the process is still running [21:16:05] <_joe_> ok no more jobs to labswiki [21:16:13] mobrovac: https://tools.wmflabs.org/bash or "!bash your quote here" in this channel or others with stashbot present :) [21:16:56] https://tools.wmflabs.org/bash/help#add [21:17:24] !bash <_joe_> ori: kill! kill! kill! [21:17:42] how does it look now? [21:18:05] gwicke: https://tools.wmflabs.org/bash/quip/AVFptO_y1oXzWjit6stI [21:18:16] <_joe_> whoah https://grafana.wikimedia.org/dashboard/db/job-queue-health [21:18:20] no drop yet, but the metrics aren't fully updated yet [21:18:31] <_joe_> ori: that is... geez [21:18:32] looking good so far [21:18:46] <_joe_> gwicke: when you confirm it, I'll re-revert [21:18:51] !log Cleaning up msg_resource rows with bogus language codes [21:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:18:56] <_joe_> (right now we're one commit behind master [21:19:37] bd808: nice ;) [21:19:59] still looking good [21:20:37] * gwicke activates auto-refresh every 10s [21:21:21] !log rebooting lvs100[123] for reinstall to jessie [21:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:33] <_joe_> ori,gwicke https://gerrit.wikimedia.org/r/#/c/256757/ [21:21:55] !log restarted eventlogging with 4 mysql consumer processes running in parallel [21:21:55] <_joe_> whenever you feel it's ok, +2 it [21:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:31] +2ed [21:23:20] <_joe_> gwicke: what graph are you looking at? [21:23:33] https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=8&fullscreen primarily [21:24:06] <_joe_> because it doesn't look good at all :/ [21:24:12] <_joe_> wtf is going on here [21:24:19] refresh ;) [21:24:25] <_joe_> I did [21:24:42] <_joe_> ah, I hate graphite [21:24:43] it's often looking a bit down for the last time unit [21:24:48] <_joe_> yup [21:24:52] <_joe_> GRRRR [21:25:22] if I have learned anything about graphite, it is that confidence increases slightly from right to left [21:25:31] <_joe_> eheh [21:25:52] !bash if I have learned anything about graphite, it is that confidence increases slightly from right to left [21:26:02] is there an actual bot or is bd808 adding them manually? [21:26:17] there's a bot [21:26:18] I think it's bd808 doing it manually [21:26:21] and responding with a link [21:26:26] logstash is watching [21:26:26] lets see if it still works ;) [21:26:48] It will show at the top of the search view [21:27:28] https://tools.wmflabs.org/bash/search [21:27:34] !log Applied database migrations and purged last year's data from Wikimania Scholarships db [21:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:04] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: re-sync (re-merged the change) (duration: 00m 29s) [21:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:36] ok so [21:29:12] it would definitely be useful to have a maintenance script that knows about $wgJobQueueAggregator and $wgJobTypeConf['default'] [21:29:18] and lets you run a redis command against all backends [21:29:20] grrit-wm? [21:29:31] 21:14 -!- grrrit-wm [~lolrrit@208.80.155.255] has quit [Remote host closed the connection] [21:29:32] <_joe_> it's gone it seems [21:29:37] <_joe_> yuvipanda: ^^ [21:29:37] bblack: broken in a reboot, yuvipanda is working on it [21:29:46] yeah [21:29:47] <_joe_> oh ok [21:29:48] am working on it [21:29:53] <_joe_> sorry :) [21:30:01] ori: so what's going on now? [21:30:13] AaronSchulz: nicely played ;) [21:30:21] <_joe_> ahahahah [21:30:28] beats me, i just did my usual thing of running a bunch of things without thinking them through [21:30:45] <_joe_> so my hypothesis is what follows: [21:30:45] i'll take credit if it helped [21:30:45] :) [21:30:50] * AaronSchulz meant high level [21:31:07] ori: I assume the goal is for wikitech to only push/pop jobs from it's own DB queue right? [21:31:08] <_joe_> 1) we were submitting jobs for labswiki to the jobrunners, which was wrong [21:31:14] AaronSchulz: yes [21:31:15] <_joe_> 2) I fixed that [21:31:23] and before it was pushing them to it's DB but also updating the redis aggregator [21:31:36] causing the main runners to launch for wikitech and get db conn errors? [21:31:39] <_joe_> AaronSchulz: exactly [21:32:10] <_joe_> AaronSchulz: when I fixed that, what happened is that for some reason the jobqueue was processing obsessively jobs for labswiki [21:32:24] <_joe_> also, when I rolled back and ori did his trick on redis [21:32:27] maybe because they had been sitting in the queue longest? [21:32:31] I'd really like to just not have MW update the aggregator (I already started the patches for that a while back) [21:32:41] <_joe_> the job queue somehow became 1 mil objects long [21:32:56] <_joe_> which worries me [21:32:57] Reedy: that's over 9,000, if you're wondering. [21:33:09] it is dropping [21:33:11] ori: ty [21:33:26] Reedy: :) [21:33:26] <_joe_> ori: heh, graphite again I guess [21:33:40] ori: there's over 9000 bogus msg_resource rows on commonswiki :( [21:33:55] https://www.youtube.com/watch?v=SiMHTK15Pik [21:33:56] <_joe_> ori: it's going down very slow [21:34:02] <_joe_> I'm actually worried for the redises [21:34:03] https://grafana.wikimedia.org/dashboard/db/job-queue-health?from=1449176633144&to=1449178133144 [21:34:05] _joe_: was it actually putting jobs in redis too? not just the aggregator? I wonder how that happened [21:34:28] <_joe_> AaronSchulz: in the aggregator, sorry [21:35:01] <_joe_> ori: the queue is reducing because we're consuming a lot of jobs [21:35:33] <_joe_> I guess these are all the jobs that were blocked on the aggregator somehow while the jobrunners were busy trying to process labswiki jobs [21:35:53] <_joe_> and not knowing how to, given they weren't set up to deal with them on redis... [21:37:06] give it a few minutes to settle [21:40:21] still looking good on the RB side [21:40:55] the queue is almost entirely restbase update jobs [21:41:07] it usually is [21:41:21] hello template updates [21:42:15] <_joe_> so maybe that specific queue was clogged [21:44:08] ori: is this labs jobs that built up over a long time? [21:46:00] <_joe_> gwicke: a month I guess? [21:46:23] <_joe_> the change was made one month ago [21:47:16] that's a lot of jobs for a small wiki [21:48:23] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: puppet fail [21:48:29] especially considering that the link update jobs are usually stored as one entry until they are actually processed [21:48:32] !log finished removing bogus msg_resource rows [21:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:26] the grafana page for job queue is real sweet [21:50:41] <_joe_> Nemo_bis: again, thank ori [21:50:41] _joe_, ori: why would the number of jobs spike up so quickly? [21:50:58] <_joe_> gwicke: I have literally zero idea [21:51:01] were they somehow held elsewhere & now released into the queue? [21:51:06] _joe_: ori andrewbogott added a cronjob yesterday that does SMW things [21:51:10] maybe that spawned jobs? [21:51:21] <_joe_> yuvipanda: no it's not that [21:51:34] <_joe_> yuvipanda: how's grrrit-wm doing? [21:51:48] <_joe_> gwicke: in the aggregator, probably [21:51:52] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1850268 (10RobH) So I need to re-open this, as we are still getting some bounced email content. Rather than force the entire task private, I... [21:51:57] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1850269 (10RobH) 5Resolved>3Open [21:52:23] <_joe_> I don't know enough about the architecture of the jobqueue [21:52:27] * _joe_ guilty [21:52:43] _joe_: should be back soon [21:52:56] <_joe_> what happened? [21:53:22] _joe_: so the container for it was built with nodejs:4.0.0 [21:53:25] which expects to run as root :) [21:53:31] except we stopped doing that a while ago [21:53:32] <_joe_> ahahah [21:53:35] <_joe_> ok [21:53:43] and in the rebuild and redeploy process with valhallasw`cloud discovered I had missed this [21:53:51] so I'm fixing it by not using nodejs FROM [21:54:30] some of it is also me paying penance for using nodejs on it [21:54:48] Right after "Fix the jobqueue on wikitech", lots of jobs where undelayed. Mostly cirrusSearchIncomingLinkCount, but the other queues that use delayed had bumps too. [21:55:29] http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1449179638.506&from=-1days&target=movingAverage(jobrunner.job-undelay.*.count%2C9) [21:57:32] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [22:04:48] !log Updated scholarships.wikimedia.org to cb94319 plus local i18n filtering [22:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:00] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1850398 (10chasemp) [22:05:15] PROBLEM - salt-minion processes on lvs1003 is CRITICAL: Connection refused by host [22:07:07] RECOVERY - salt-minion processes on lvs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:08:59] !log Removed zirconium.wikimedia.org from Trebuchet minions list for scholarships/scholarships [22:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:15] valhallasw`cloud: aand we're back [22:12:15] PROBLEM - Host lvs1003 is DOWN: CRITICAL - Host Unreachable (208.80.154.57) [22:12:36] RECOVERY - Host lvs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [22:14:05] (03PS1) 10Merlijn van Deen: aptly: install graphviz for aptly graph [puppet] - 10https://gerrit.wikimedia.org/r/256806 [22:14:21] _joe_: grrrit-wm is back btw [22:14:53] valhallasw`cloud: shall I merge your two patches [22:15:00] (03PS28) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [22:15:03] yes please [22:15:10] (03PS3) 10Yuvipanda: toollabs: install ruby dev tools on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/256760 (https://phabricator.wikimedia.org/T120287) (owner: 10Merlijn van Deen) [22:15:18] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: install ruby dev tools on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/256760 (https://phabricator.wikimedia.org/T120287) (owner: 10Merlijn van Deen) [22:15:20] <_joe_> yuvipanda: kkk i feared a k8s fail [22:15:27] (03PS2) 10Yuvipanda: aptly: install graphviz for aptly graph [puppet] - 10https://gerrit.wikimedia.org/r/256806 (owner: 10Merlijn van Deen) [22:15:34] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: install graphviz for aptly graph [puppet] - 10https://gerrit.wikimedia.org/r/256806 (owner: 10Merlijn van Deen) [22:15:49] _joe_: puppet's failing on the k8s master too but haven't investigated that. [22:15:53] I'll do so shortly [22:15:55] valhallasw`cloud: done [22:16:02] grazie [22:17:41] (03PS29) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [22:18:45] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:00] AaronSchulz: hrm, the number of jobs seems to be going up [22:19:51] another spike of undelayed jobs at https://grafana.wikimedia.org/dashboard/db/job-queue-rate [22:21:20] what does that mean? [22:21:21] http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1449181225.85&from=-1days&target=movingAverage(jobrunner.job-undelay.*.count%2C9) cirrusSearchIncoming again [22:21:37] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1850619 (10Luke081515) I tried it, phabricator was updated, but the design was broken, he tried to load this from http, not https. [22:23:00] (03PS1) 10BBlack: interface::tagged - do not use hotplug [puppet] - 10https://gerrit.wikimedia.org/r/256842 (https://phabricator.wikimedia.org/T110530) [22:23:02] _joe_: is the jobqueue stuff sorted out? showJobs.php shows an empty queue but that could be good news or bad news [22:23:08] it's not sorted out [22:23:36] AaronSchulz: I understand what you're saying, but not what it means. Is further manual intervention required? How long do you think the queue will take to clear? [22:23:44] it almost seems familiar [22:23:48] * AaronSchulz wonders if ebernhardson is around [22:24:15] PROBLEM - git_daemon_running on gallium is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/lib/git-core/git-daemon [22:24:19] like the jobs would all re-enqueue in some cirrus read-only mode [22:25:05] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:25:09] ori: I was trying to do a Nuke, but I’ll cancel that for now [22:25:38] (03PS2) 10Dzahn: dataset,ores: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256494 [22:26:06] RECOVERY - git_daemon_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [22:27:25] (03CR) 10BBlack: [C: 032] interface::tagged - do not use hotplug [puppet] - 10https://gerrit.wikimedia.org/r/256842 (https://phabricator.wikimedia.org/T110530) (owner: 10BBlack) [22:29:23] (03CR) 10Dzahn: "also see https://phabricator.wikimedia.org/T118388#1799533" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [22:31:57] (03PS1) 10BBlack: Revert "disable BGP on lvs100[123] until finished with reinstall confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/256843 [22:32:36] 6operations: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1850696 (10Dzahn) Thanks for pointing this out @Southparkfan adding ops-eqiad [22:32:49] 6operations, 10ops-eqiad: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1850697 (10Dzahn) [22:33:20] AaronSchulz: should we page someone from search? what is wrong, exactly? [22:33:48] ori: probably [22:34:16] (03PS2) 10BBlack: Revert "disable BGP on lvs100[123] until finished with reinstall confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/256843 [22:34:24] (03CR) 10BBlack: [C: 032 V: 032] Revert "disable BGP on lvs100[123] until finished with reinstall confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/256843 (owner: 10BBlack) [22:35:33] (03PS3) 10Dzahn: dataset,ores: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256494 [22:35:50] (03CR) 10Dzahn: [C: 032] dataset,ores: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256494 (owner: 10Dzahn) [22:36:02] ebernhardson coming [22:36:11] AaronSchulz: which? [22:36:19] job queue issue? [22:37:12] (03PS2) 10Dzahn: varnish:misc: add smokeping on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/256736 [22:37:19] (03PS1) 10Andrew Bogott: Open openldap servers to all wikimedia hosts. [puppet] - 10https://gerrit.wikimedia.org/r/256844 [22:38:23] ori: still can't say it's ES specific [22:38:25] http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1449182277.53&target=highestMax(jobrunner.job-recycle.*.count%2C6) is interesting [22:38:42] (03CR) 10Dzahn: [C: 032] varnish:misc: add smokeping on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/256736 (owner: 10Dzahn) [22:38:59] AaronSchulz: what does it mean for the jobs to be in "backoff"? I've looked through the code a few times before but never quite figured it out [22:39:05] it looks like recycle and undelay are more active today and for all applicable queues (ES might just be the biggest user of delays...could be a red herring) [22:39:35] ori: I wonder if tasks were not running on some of the new partitions for a while and then jobchron started actually seeing them today? [22:39:36] yea es typically has more jobs than most things. [22:39:49] though I though everything was restarted a few times by then [22:40:48] yes [22:41:00] it does seem awfully suspicious for the 99% job wait time to jump from 3 hours to 8 days in the course of a few minutes [22:41:10] i saw that too [22:41:16] then again, last time I thought puppet restarted the service it still needed another kick [22:41:54] so to confirm: the suspicion is that the jobs have been accumulating quietly in the redises but not getting dequeued because the jobrunners didn't know about them? [22:42:35] This isn’t the source of your main issue, but be warned that some of those wikitech jobs will requeue themselves on failure. And if they’re not running on silver they will always fail so are maybe immortal :/ [22:43:02] there should be a delayed job count graph in addition to the runnable job count [22:44:29] oh, I can just change GetJobQueueLengths --report [22:47:51] ebernhardson: how important are the incomingLinksJob? how bad would it be if we nuked them? [22:47:59] nuked whatever is currently in the queue, i mean [22:48:35] hrm, not quite as easy to add as a I thought [22:49:07] would be possible with https://gerrit.wikimedia.org/r/#/c/252608/ and friends though [22:49:16] * AaronSchulz adds to the backlog [22:49:16] ori: they should be able to run through on their own really quickly as long as they don't get into the job queue backoff list [22:49:39] (usually they end up in the backoff list, which is why i asked earlier what exactly that is) [22:49:46] the throttle makes runJobs exit after 1 job [22:49:49] i think it happens when the slaves get behind the masters? [22:49:52] at least on terbium [22:50:07] that job can't return false :S [22:50:11] it could throw an exception i suppose [22:50:17] but i don't see any logged [22:51:01] while there was a lots of lag at 1:40PM (wonder why...), lag has been low for enwiki [22:51:15] 0-1 range, says tendril [22:51:19] hmm [22:51:53] --nothrottle is fast ;) [22:52:09] why is throttling necessary? [22:52:49] well, if they were legitimately failing throttling could make sense [22:55:39] ebernhardson: I suggest bumping the throttle, unless there is a reason to be so low [22:55:58] don't those go to 2 dcs now too? [22:56:14] so the same throttle would make it worse right? [22:56:14] AaronSchulz: yes, so there will be a little round trip lag [22:56:32] AaronSchulz: that one job will build up the document and then send it to both dc's [22:56:53] ok [22:58:25] AaronSchulz: not seeing how i would disable or reduce throttling, that would by extening something form the core Job class? [22:58:43] $wgJobBackoffThrottling['cirrusSearchIncomingLinkCount'] = 1; [22:58:46] that is in wmf-config [22:58:49] ahh ok [22:59:03] 1 job per sec per server (roughly) [22:59:16] AaronSchulz: yea thats crazy low for how busy this job should be :) [22:59:18] we have 18 generic job servers [22:59:55] !log ori@tin Synchronized php-1.27.0-wmf.7/includes/jobqueue/JobRunner.php: temporarily disable job throttling (duration: 00m 29s) [22:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:02] (03PS2) 10Dzahn: varnish:misc: add torrus on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/255460 (https://phabricator.wikimedia.org/T119582) [23:01:02] (03CR) 10Dzahn: [C: 032] varnish:misc: add torrus on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/255460 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [23:02:10] AaronSchulz: ok i'll take care of it. I have an interview for next hour but after that [23:02:26] * AaronSchulz heads to the office finally [23:05:10] (03PS3) 10Dzahn: l10nupdate: Fix duration reported for ResourceLoader purge [puppet] - 10https://gerrit.wikimedia.org/r/256754 (owner: 10BryanDavis) [23:05:17] (03CR) 10Dzahn: [C: 032] l10nupdate: Fix duration reported for ResourceLoader purge [puppet] - 10https://gerrit.wikimedia.org/r/256754 (owner: 10BryanDavis) [23:06:25] always reads it as "lion"-update [23:07:32] (03PS2) 10Dzahn: graphite: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256495 [23:07:50] (03CR) 10Dzahn: [C: 032] graphite: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256495 (owner: 10Dzahn) [23:09:25] !log restarting pybal (w/ BGP enabled) on lvs100[123] (newly-installed w/ jessie) [23:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:30] (03PS2) 10Dzahn: varnish: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256498 [23:10:24] (03CR) 10jenkins-bot: [V: 04-1] varnish: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256498 (owner: 10Dzahn) [23:11:22] (03PS3) 10Dzahn: varnish: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256498 [23:11:48] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1850826 (10BBlack) [23:11:49] 6operations, 10Traffic: upgrade lvs1001-3 to jessie - https://phabricator.wikimedia.org/T119517#1850825 (10BBlack) 5Open>3Resolved [23:13:33] (03PS1) 10Alex Monk: Get rid of old unused $wgAllowed* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256853 (https://phabricator.wikimedia.org/T50493) [23:13:45] pretty monstrous spike of cache purges ~23:03 -> 23:08, with some trailing a bit after that [23:14:09] maybe a major template or something? [23:14:27] <_joe_> bblack: the jobqueue fail [23:15:05] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1449180896199&to=1449184496199&var-site=All&var-cache_type=All&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&theme=dark [23:15:38] peaking out around 600K purges/sec during that window [23:15:45] <_joe_> actually, no [23:15:47] here's my very hazy understanding [23:16:34] labs jobs were getting enqueued on the main job queue, but then the runners we not successful at running those jobs, because they can't actually talk to the labs db etc [23:17:07] the job queue has throttle / backoff settings by job type and i *think* the failures caused it to gradually throttle the jobs down to 1 per invocation [23:17:19] 6operations: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#1850835 (10Dzahn) [23:17:28] this made the rate at which the job runners process jobs plummet [23:17:29] (03PS3) 10Dzahn: RT: move role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) [23:17:35] <_joe_> and that got worse once I merged the change not to submit those jobs why? [23:17:52] (03CR) 10jenkins-bot: [V: 04-1] RT: move role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [23:17:55] so my purge spike is from the jobrunners finally catching up? [23:17:56] i'm half-guessing at this point, but i think it's because the job types were shitlisted by then [23:17:58] yes [23:18:05] the queue size is declining rapidly [23:18:11] i told the runners to ignore the throttle [23:18:23] <_joe_> bblack: actually that purge spike could be due to the surge in jobs processing [23:18:28] <_joe_> ah, that's it [23:18:38] went from 2m to 1.4m in the last 15 mins [23:18:51] 6operations, 5Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#1850845 (10Dzahn) ^ once we know we don't need email anymore on RT (or before, actually) we can just add the role to krypton (before we switch anything in DNS) to see which puppet issues we have to fix before h... [23:18:53] <_joe_> ori: probably after my change the jobqueue started actually noticing the failures? [23:19:15] <_joe_> bblack: look at the last few mins of https://grafana.wikimedia.org/dashboard/db/job-queue-rate [23:20:17] <_joe_> we're processing more than 1k jobs/sec, wow [23:20:50] i don't entirely understand why the throttle is necessary [23:21:20] <_joe_> ori: ahah funny: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&c=Jobrunners+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [23:21:30] <_joe_> the cpu load actually went down drastically [23:21:33] <_joe_> WTF? [23:21:53] <_joe_> are the jobrunners basically busy throttling themselves? [23:21:58] yes [23:21:59] probably! :) [23:22:16] what's going on? [23:22:25] nothing bad, right now [23:22:28] paravoid: are you seeing something borked? [23:22:31] no [23:22:38] then nothing! :) [23:22:41] heh [23:23:00] there's been ongoing jobrunner stuff for a while, but seems to be getting sorted out [23:23:12] and I reinstalled lvs1001-3, that just finished up and switched traffic back [23:23:19] the throttle limits the number of jobs that get run per invocation of the jobrunner [23:23:24] it was down to 1 [23:23:28] <_joe_> paravoid: in short: we've been sending wikitech jobs to the jobqueue for a month; i fixed that and as a result the jobqueue basically worked very bad for a few hours [23:23:37] ? [23:23:43] you *fixed* it and then it went bad? [23:23:51] <_joe_> yes [23:24:08] <_joe_> because a month's worth of wikitech jobs were in the queue [23:24:18] ok I have to ask... [23:24:35] if those jobs sat in some stuck queue for a month and nobody cared, do we have to do whatever they were doing at all? :P [23:24:42] <_joe_> as soon as ori cleaned the queue, the problem swas olved [23:24:51] what did you fix exactly? [23:25:52] <_joe_> paravoid: https://gerrit.wikimedia.org/r/#/c/256698/ fixed https://gerrit.wikimedia.org/r/#/c/250170/ where the inclusion of the jobqueue definitions was made unconditional [23:26:03] <_joe_> thus making labswiki send its jobs to the jobqueue [23:26:03] bblack: yes, we do, in the general case. people don't notice that on labs because people don't edit actively on labs. [23:26:09] <_joe_> instead of processing them locally [23:26:28] <_joe_> bblack: Krinkle was doing template updates and noticed [23:26:44] i don't understand why your fix made things worse, though [23:27:02] oh thanks [23:27:09] and I thought it was me that wasn't getting it :P [23:27:22] <_joe_> ori: I guess the jobqueue wasn't able to handle the labswiki jobs anymore in some strange way [23:27:28] <_joe_> paravoid: no one is [23:27:46] <_joe_> because well, for labswiki it had no info on where redis was [23:27:57] <_joe_> since we didn't include that code anymore [23:28:10] (03CR) 10Dzahn: "fixing compiler warnings with https://gerrit.wikimedia.org/r/#/c/256855/" [puppet] - 10https://gerrit.wikimedia.org/r/256498 (owner: 10Dzahn) [23:28:34] <_joe_> (the fcgi code, the jobrunner doesn't care about labswiki ofc) [23:28:35] ori: bblack I think people might have noticed (SMW updating was worse than usual) but it was chalked up to SMW being SMW [23:28:59] <_joe_> I am not sure what part of the code talks with redis once a job is completed [23:29:20] <_joe_> but if that's the fcgi part, it didn't have the config for that redis anymore [23:29:33] <_joe_> while the jobrunner has its own redis config [23:29:34] it is possible that it was able to get farther, that the error shifted from the jobrunner setup code to inside the job itself [23:29:48] <_joe_> yup [23:29:52] and that consequently the code which 'punishes' a job type for having a high rate of failure was getting executed [23:30:20] <_joe_> but why the labswiki jobs were being executed at such an high rate? [23:30:23] <_joe_> I have no idea [23:30:39] <_joe_> when I looked, 10% of all processed jobs were for labswiki [23:30:44] <_joe_> 4k/minute [23:30:56] <_joe_> which is a ridiculous amount of jobs [23:31:49] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1426/" [puppet] - 10https://gerrit.wikimedia.org/r/256498 (owner: 10Dzahn) [23:32:00] <_joe_> so I have no idea what caused all this, precisely [23:32:36] aaron is working on a slide deck / presentation which gives an overview of the job queue -- i don't think he'd mind me sharing https://dl.dropboxusercontent.com/u/25082461/slides/mediawiki-jobqueue.odp [23:32:54] (03PS2) 10Dzahn: fix double quoted string warnings [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256483 [23:32:57] <_joe_> great [23:33:07] <_joe_> I was about to ask an ops session on it :P [23:33:21] yeah, it's obviously ridiculous to be groping around in the dark like that [23:33:47] <_joe_> it's totally my fault, I should've harassed you guys to learn more [23:34:00] <_joe_> I just know parts of the puzzle, and it evolved since then [23:34:07] down to 1.14m jobs [23:34:29] <_joe_> but let's say the knowledge about this is not exactly common [23:34:53] <_joe_> ok time to go, seriously [23:34:56] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1427/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256483 (owner: 10Dzahn) [23:34:59] * ori undelays _joe_ [23:35:04] good night [23:35:11] <_joe_> If I speak again before 06:00Z, ignore me [23:37:04] heh, it's like me yesterday night [23:37:08] I think I said I was leaving like 4 times [23:37:14] he's upto 3 times already [23:38:07] (03PS2) 10BryanDavis: l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [23:38:32] (03PS1) 10Dzahn: mariadb: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/256857 [23:38:48] ori: bblack I think people might have noticed (SMW updating was worse than usual) but it was chalked up to SMW being SMW (03CR) 10BryanDavis: [C: 031] Remove unused MWMULTIDIR variable [puppet] - 10https://gerrit.wikimedia.org/r/255957 (owner: 10Reedy) [23:39:56] (03PS2) 10Dzahn: mariadb: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/256857 [23:40:01] (03CR) 10BryanDavis: [C: 031] "LGTM but I tweaked it a bit too" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [23:41:28] RoanKattouw: ostriches: Krenair: hi! someone can SWAT soon? I may have small CentralNotice patch... ;P [23:41:42] too busy today [23:41:57] Ah hmmm [23:42:05] did greg-g or Katie approve it? [23:42:27] I can do it [23:42:27] Krenair: I assume if it's AndyRussG the answer on the latter is 'yes'. :-) [23:42:33] (03CR) 10Dzahn: [C: 031] "not used in the script" [puppet] - 10https://gerrit.wikimedia.org/r/255957 (owner: 10Reedy) [23:42:34] Krenair: it was generally discussed on the FR-tech team [23:42:42] So did they approve it? [23:43:04] (03PS2) 10Dzahn: Remove unused MWMULTIDIR variable [puppet] - 10https://gerrit.wikimedia.org/r/255957 (owner: 10Reedy) [23:43:11] Krenair: AndyRussG is on the FR Tech team :) so presumeably he has katie's approval :) [23:43:19] (03CR) 10Dzahn: [C: 032] Remove unused MWMULTIDIR variable [puppet] - 10https://gerrit.wikimedia.org/r/255957 (owner: 10Reedy) [23:43:33] Krenair: Now that you ask, I'm pretty sure K4 was there when we talked about it at standup, not 100% sure, but we've been talking about it on the team for a few days [23:43:34] I don't recall AndyRussG being on the list of people who can approve? [23:44:23] * yuvipanda makes AndyRussG get things signed in triplicate and reads him a poem [23:44:25] Adam Wight, who is the tech lead, is definitely aware of the plan, though not here today [23:45:37] Krenair: Katie appears to be in a meeting (judging by her IRC handle) [23:45:51] Krenair is right, if we start this "use good judgement" we'll argue about it every time. just update the list [23:45:53] greg-g is here though [23:47:04] AndyRussG: what's the patch? [23:48:23] greg-g: small adjustments to CentralNotice client-side code, mainly to handle a new banner history hide reason, and to enable us to start a small test of a "remind me later" button, which may be used a bit later in the Big English campaign [23:48:42] One sec, lemme get the gerrit changes... I was just gonna do the wmf_deploy branch patch [23:48:48] RoanKattouw: thanks in advance BTW! [23:49:34] That sounds so much more risky to fundraising than unrelated commits that I imagine you'd want to get explicit approval [23:51:21] Krenair: I'm the main CentralNotice maintainer and I can guarantee it's not risky to FR. I certainly understand and am fine with the need to do stuff by the book tho :) [23:51:40] This is only stuff that we actually want _during_ the campaign [23:51:53] (03CR) 10Dzahn: [C: 032] fix the last quoted boolean [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/256478 (owner: 10Dzahn) [23:52:01] (03CR) 10Dzahn: [V: 032] fix the last quoted boolean [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/256478 (owner: 10Dzahn) [23:52:09] "I can guarantee it's not risky" [23:52:13] ^ Famous last words :) :) [23:52:31] https://gerrit.wikimedia.org/r/256054 https://gerrit.wikimedia.org/r/256349 https://gerrit.wikimedia.org/r/256615 https://gerrit.wikimedia.org/r/256856 [23:52:34] Sure, the same could be said about any other extension developers wanting their patches deployed [23:52:56] Krenair: CentralNotice is what shows the FR banners [23:53:12] Yes, and the deployment freeze covers everything. [23:53:32] ostriches: yes, well, as "guaranteed" as one can normally be 8p [23:53:40] Krenair: well, the reason the freeze is in place is so FR's stuff doesn't get fucked up by other people's stuff (AIUC). I guess next year's freeze should be better worded :) [23:53:59] Krenair: K I'm gonna ask people in SF to find K4 IRL [23:54:25] (not sure if she's WFH today tho) [23:54:38] AndyRussG: no need, one second [23:54:44] no need to both k4, that is [23:55:15] AndyRussG: approved. [23:55:35] greg-g: K thanks so much!! [23:55:52] * ostriches was just making fun of AndyRussG for being so certain :p [23:55:54] (also, given the reason for this week's freeze is fundraising, it makes sense to use good judgement with fundraising proposed patches) [23:55:58] No actual objection from me :) [23:56:13] ostriches: jestful point well taken tho :) [23:56:13] AndyRussG: Could you put these on the deployment page in the usual spot for SWAT patches? [23:56:22] Easier to keep track that way [23:56:33] greg-g: you should probably say that on the freezes thread :) [23:56:54] RoanKattouw: yep was just getting to that... [23:57:15] yuvipanda: later, this is the last hour of the freeze for this week (last hour of deploys for the week) and next week is normal [23:57:27] oh right [23:57:29] ok [23:57:42] but yeah, will do before I take the last 3 weeks of Dec off :) [23:57:44] (03PS1) 10Dzahn: kafkatee: bumb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256859 [23:57:53] it'll mostly be: ask ostriches :P [23:58:14] Just finished reading backlog. Interesting events _joe_ , ori [23:58:34] So sorry for setting all this in motion. [23:58:52] What's the status now?