[03:19:28] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2720864 (10yuvipanda) Thank you for taking it on, @bd808 / @leila. I've one person who wanted to 'opt out', I'll share that with you two. [06:37:36] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:42:25] RECOVERY - Puppet staleness on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [3600.0] [06:43:37] RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [07:12:36] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [07:14:16] 06Labs, 10Striker, 06Operations, 07LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#2720964 (10MoritzMuehlenhoff) I think we should use the same wmf-user.schema file across the labs and corp servers, but introduce a separate object class for sto... [17:21:25] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:46] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [17:52:23] 06Labs, 10Tool-Labs, 06Operations, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2723365 (10Andrew) a:05Andrew>03None [17:53:09] 06Labs, 10Labs-Infrastructure: Default source group (security group) allowances do not update properly - https://phabricator.wikimedia.org/T142165#2723373 (10Andrew) 05Open>03Resolved This is fixed in M and it looks like L isn't going to happen. [17:58:53] 06Labs: Update custom sink handlers for Designate Mitaka - https://phabricator.wikimedia.org/T134280#2723382 (10Andrew) a:05Andrew>03None [18:04:46] 06Labs, 13Patch-For-Review: Update or remove certcleaner.py - https://phabricator.wikimedia.org/T146303#2723414 (10Andrew) I added logging to the certcleaner, and it turns out to be doing some things! Here's a sample: > Removing stale puppet cert i-00000584.eqiad.wmflabs > Removing stale puppet cert i-0000... [18:17:25] PROBLEM - Free space - all mounts on tools-docker-builder-01 is CRITICAL: CRITICAL: tools.tools-docker-builder-01.diskspace.root.byte_percentfree (<55.56%) [18:33:24] PROBLEM - Puppet staleness on tools-worker-1003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [18:34:40] 06Labs, 13Patch-For-Review: Update or remove certcleaner.py - https://phabricator.wikimedia.org/T146303#2723498 (10Andrew) ok, I'm confirmed that the properly.named.eqiad.wmflabs cert deletions are from a race. It seems harmless, but we can probably get what we need by doing a manual cleanup of misnamed certs... [18:54:53] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2723543 (10jcrespo) My interpretation of parent/child is blocker/bloquee, not "part of", but feel free to move thing around if that help... [19:04:51] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:05:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:12:43] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:15:05] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:15:05] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:18:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:24:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:25:58] ^ "Skipping this run, puppet agent already running at pid 7428. Notice: Caught TERM; calling stop" on multiple hosts [19:26:08] iow puppet is hanging? [19:26:58] no longer running when I logged in, though [19:27:04] so probably something transient [19:27:23] valhallasw`cloud: there is a task where we noted puppetmaster in tools sometimes freezing up and requiring restart unf atm [19:28:47] I don't understand why that would cause the puppet /client/ to hang, though? shouldn't the connection just time out? [19:32:13] it's a good question, I'm not sure, it seems like the master accepts the connection and then churns on it doing nothing useful [19:32:29] I haven't had time to look into it much [19:33:45] the hang might come from a file diff save or other master interaction after the manifest is delivered. I wouldn't be surprised to find that Puppet doesn't have sane network error handling code [19:34:15] actually, I would be more surprised to find that it did [19:37:19] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:51:20] 06Labs, 06Collaboration-Team-Triage, 10Notifications, 10wikitech.wikimedia.org: Echo notification to remind users to validate email - https://phabricator.wikimedia.org/T148440#2723734 (10Quiddity) [19:59:56] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2723747 (10bd808) The list of email addresses to send the survey to can be generated by running P4254 on silver.wikimedia.org. [20:07:20] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2723764 (10harej-NIOSH) a:03harej-NIOSH [20:12:42] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:49] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:17:55] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2723790 (10bd808) [20:18:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:20:50] volans: btw, am checking out your new puppetmaster [20:20:57] oh [20:20:59] it's got puppet disabled [20:20:59] hmm [20:21:08] let me just get a new one rather than mess with it [20:24:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [20:51:39] Guest69514: your bridge has lost track of your nick [20:54:42] fun [20:58:15] o/ yuvipanda & chasemp, could you take a look at https://phabricator.wikimedia.org/T146718 ? [20:58:34] Essentially, I'm looking to host a big dataset somewhere that someone could access it from Quarry. [20:58:51] I think it would be useful to join it against the replicas, but not essential. [20:59:03] halfak: how soon? [20:59:13] Today if I could, but not in a big rush. [20:59:21] I did a size analysis here: https://phabricator.wikimedia.org/T146718#2677286 [21:01:20] 10Tool-Labs-tools-Xtools: Bugs section on articleinfo returns incorrect results - https://phabricator.wikimedia.org/T148046#2723877 (10Matthewrbowker) Appears to be related to LanguageTool shutting down: http://forum.languagetool.org/t/shutting-down-wikicheck/1098 [21:03:06] I was about to say I'm not sure where this would live before christmas or so, saw jaime's comments to a similar effect [21:08:33] this seems like a natural extension for the coming quarry/paws dediated db servers halfak but even that has some question marks no next steps, budget was allocated but it seems not everyone has the same idea of how much and what to do about it [21:08:41] there is a task somewhere we should link in here [21:10:34] chasemp, I wonder if we're looking at this wrong by focusing on RDBMS infrastructure for analysis. Generally, in the context of data analysis in 2016, this dataset really isn't that big. [21:10:41] That's a bigger question [21:11:01] In the short term, I think it sounds reasonable to wait if you and jynus agree. [21:11:05] I have been having a similar conversation w/ milimetric [21:11:11] re: RDBMS [21:11:51] On the other hand, we have some slaves for analytics internally where loading in this dataset would be part of the normal operations. [21:11:58] (MySQL RDBMS) [21:12:05] but there are other more foundational and holistic issues with surfacing data in another way as our current data sanitization pipeline is nothing anyone wants to keep around [21:12:06] So maybe it's just a disk size issue. [21:12:18] right [21:12:40] Privacy and applying suppression historically is difficult. [21:20:20] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2723915 (10bd808) I have a mass mailing script ready to go for actually sending the email. It takes into account some feedback from last year that usi... [21:29:45] 06Labs, 13Patch-For-Review: Update or remove certcleaner.py - https://phabricator.wikimedia.org/T146303#2723931 (10Andrew) I cleaned up all the *.eqiad.wmflabs puppet and salt certs, and turned off the cron that runs certcleaner. That /should/ be the end of it, but I want to wait a few days and then run the c... [21:37:24] chasemp, one more question for you and yuvipanda. I'm looking to host some other datasets relevant to ORES in a way that is accessible from Quarry. I'd like to have these datasets sit in a dbname that makes sense. I've been considering creating a tools project for hosting the datasets, but I'm not sure if that is the right solution. See https://phabricator.wikimedia.org/T146722 [21:37:36] Many of these datasets would be small (20k rows or so). [21:37:46] So they aren't concerning for the other tickets. [21:38:21] there may be an initial language barrier, when you say create a tools project what do you mean? [21:38:34] tool labs project account. [21:38:40] That multiple users could "become" [21:39:03] gotcha ok [21:39:14] a service user [21:39:20] Ahh yes. [21:39:45] I'd like to have a database name be human readable/rememberable -- like "ores_p" or "project_ores_p" rather than something like "u235742364_ores_p" [21:44:14] A tools service user could be used to load/unload/service web api or soemthing that access an underlying user db [21:44:28] but something on toolsdb atm isn't hte same as something on the labsdb boxes [21:44:41] isn't necessarily the same anyway [21:45:02] chasemp, I'm looking for something that would be (1) accessible from Quarry and (2) could join to replica tables. [21:45:11] I'll try to ask some clarifying question on the task [21:45:12] ah [21:45:33] where does the data come from? production or? [21:45:54] Hmm... Originally it came from Quarry. Then it was annotated by Wikipedians. [21:46:25] I forsee a lot of value in these types of dataset hostings. [21:47:04] E.g. I've been analyzing the effects of a teahouse experiment and I'd like to provide direct access to the table of experiment participants (and their experimental conditions) to Quarry users. [21:47:51] So this question is intended to get at "What should we be doing with datasets like this?" as well as "What should I be doing with these specific datasets?" [21:48:07] right, no good answer atm [21:48:46] If you have the capacity (and I know labs doesn't have much free time) please take the opportunity to think creatively. [21:48:56] situation is this, we have 2 labsdb boxes barely standing w/ no redundant storage, if 1 disk goes bad on them it dies, and there are some mysterious amount of user tables on thse we are trying to sort out where to go put them [21:49:01] I'm interested in imagining some future we could work towards [21:49:10] so we have new servers incoming w/ a real ha based solution in flight before christmas [21:49:28] and that in and on itself is pretty all consuming for this [21:49:43] and we could latch on and do things the old style way even now if you want but I don't know where that ends up [21:49:44] or at what time [21:49:58] or sometime after christmas everything looks differently almost 100% [21:50:05] and we hopefully have some capacity to reason [21:50:46] so I'm not trying to give this the brush off at all but everything here is bad right this second [21:50:55] and I don't know exactly what some outcomes will be over teh next 2 months [21:50:59] afa user tables and capacity [21:51:13] even right now I'm reworking maintain-views logic [21:51:16] literally right now :) [21:51:46] chasemp, I hear this. And I imagine it's hard to think past hardware/operational concerns at the moment. [21:52:00] My question is intended to be much more -- organizational. [21:52:15] so in theory this safely lives on teh new quary/paws labsdb hosts I think [21:52:21] but those don't exist yet [21:52:27] so it's a not very comforting answer [21:52:43] How do we plan to organize space on the quarry/paws hosts? [21:53:01] The holidays aren't that far away. [21:53:17] Maybe I should be bothering yuvipanda with this question? I don't mean to badger you about it :) [21:53:24] we have this task to reason on what to buy https://phabricator.wikimedia.org/T146065 [21:53:48] Oh! I thought the machines would be spec'd when they were budgeted for. that didn't happen? [21:54:05] well, it's a good question it seems only kind of [21:54:14] the specs changed out from underneath and so now what to do [21:55:01] yuvi is gonig ot talk to dario tomorrow I think and persist some line of reasoning on teh task w/ history [21:55:08] but what space looks like is a bit up in the air [21:56:04] Gotcha. Maybe I'll see if I can write a strawman proposal for how space will be allocated/organized [21:56:14] that would be cool [21:56:24] https://etherpad.wikimedia.org/p/paws_db_organization [21:58:44] that seems incredibly useful, could you link that to the procurement task? [21:58:53] Sure! :) [21:59:46] thanks, and no worries I'm feeling badgered there is just literally no easy answers here atm [21:59:47] 06Labs, 06Operations, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2723975 (10Halfak) I'm looking at this task because I've got a set of datasets that seem to belong on these boxes. It seems that it woul... [21:59:51] I'm not feeling [21:59:58] is what I meant heh [22:00:56] ha. Na. It's cool to hear "haven't had time to think about that yet, why don't you go start thinking about that now" [22:01:04] We're wiki people after all [22:32:28] chasemp, I've got some things in place if you want to look. https://etherpad.wikimedia.org/p/paws_db_organization [22:32:59] I focused on "dataset" tables since they are the new type. We already know how to handle replicas and tool dbs [22:33:26] ok will read [22:43:10] 06Labs: Increase quota for tools project - https://phabricator.wikimedia.org/T146322#2724140 (10madhuvishy) We talked about it in the Labs meeting today, and the plan is to bump up quota so there's enough space for 10 more Large instances. Each large instance is 4 VCPUs/80G disk/8G RAM. [22:48:41] 06Labs: Increase quota for tools project - https://phabricator.wikimedia.org/T146322#2724157 (10bd808) [22:48:43] 06Labs, 10Tool-Labs: Create more trusty nodes in anticipation of the default for jsub switching to trusty - https://phabricator.wikimedia.org/T147205#2724156 (10bd808) [22:57:21] 06Labs: Increase quota for tools project - https://phabricator.wikimedia.org/T146322#2657383 (10bd808) The current quota in the tool labs project is: * Cores: 480/512 * RAM: 983044/1000000 MB * Instances: 126/150 To make 10 large instances we need: * 40 Cores * 80GB RAM So we need to raise the quota to at leas... [23:20:50] 06Labs, 07Puppet: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew) [23:21:26] 06Labs, 07Puppet: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724231 (10Andrew) [23:21:54] 06Labs, 07Puppet: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew) [23:27:53] 06Labs, 07Puppet: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724259 (10Andrew) I should add that code blocks that use curly-races don't break anything. For an extreme example, checkout role::puppetmaster::standalone. The API says that it has no docs. But if I rem... [23:42:51] 06Labs, 10Labs-Infrastructure: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724268 (10yuvipanda) [23:43:20] 06Labs, 10Labs-Infrastructure: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724280 (10yuvipanda) This would also kill the final remaining use case for hiera_include in role::labs::instance, and make it clearer where the list of roles for an instance c... [23:47:10] 06Labs, 10Labs-Infrastructure, 10Horizon: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724283 (10bd808) [23:52:46] 06Labs, 10Labs-Infrastructure, 10Horizon: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724268 (10AlexMonk-WMF) Can work around this issue using the 'classes' hiera attribute [23:55:09] 06Labs, 10Labs-Infrastructure, 10Horizon: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724288 (10yuvipanda) Yup, that's the current workaround. I'd like an easier way for this though. [23:57:03] 06Labs, 10Labs-Infrastructure, 10Horizon: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724290 (10AlexMonk-WMF) yeah, agreed