[00:00:53] Platonides: sorry about that 'see above'. I meant I was demonstrating a difference in the errors to isolate the issue, rather than trying seek help about my command [00:02:29] yep, sorry [00:02:51] I gave it a quick look, and spotted the obvious [00:53:22] PROBLEM - Puppet errors on tools-exec-1430 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:03:23] 10Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361809 (10MusikAnimal) [01:03:37] 10Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361822 (10MusikAnimal) [01:06:08] 10Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361823 (10MusikAnimal) [01:09:36] 10Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361809 (10MZMcBride) @jcrespo can answer much better than I can, but in my experience, these types of data integrity issues on To... [01:20:13] 10Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361828 (10MusikAnimal) I want to also point out this query took 9 seconds to finish on production. [02:03:22] RECOVERY - Puppet errors on tools-exec-1430 is OK: OK: Less than 1.00% above the threshold [0.0] [02:24:24] 10Labs, 10Tool-Labs, 10DBA: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361851 (10zhuyifei1999) [02:30:21] 10Labs, 10Tool-Labs, 10DBA: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361809 (10zhuyifei1999) FWIW, using https://tools.wmflabs.org/tools-info/optimizer.py, the EXPLAIN-s for both queries query is basically the same as (differ a... [02:38:00] 10Labs, 10Tool-Labs, 10DBA: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361856 (10zhuyifei1999) Regarding logging_logindex: | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | | 1 | SIMPLE | l... [03:23:28] 10Labs, 10Tool-Labs, 10DBA: Tool Labs logging vs indexed version returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361908 (10bd808) [03:23:30] 10Labs, 10DBA, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3361909 (10bd808) [03:24:50] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361809 (10bd808) [03:26:54] 10Labs, 10DBA, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3361916 (10bd808) Linked {T168349} as a child. The report there is pretty long for pasting into this task. [03:30:14] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361928 (10zhuyifei1999) Maybe DBAs have better ideas, but this is an optimised-to-4-min query: {P5596} The relevant EXPLAIN is: | id | select_type | table | type | possib... [03:31:18] bd808: I don't really think it's a drift. it's just mysql grouping mechanism being weird [03:34:08] and mysql optimizer is bad (and you can't really write a good optimizer for such declarative language as SQL anyways) [03:37:34] zhuyifei1999_: could be. my default guess for "results different than prod" is to link to that tracking bug [03:37:50] k [03:39:03] there was a time when I knew lots of sql things, but I've forgotten most of them ;) [03:39:45] yeah, sql is madness [03:39:57] * zhuyifei1999_ goes and write c [03:40:04] I bumped https://phabricator.wikimedia.org/T109179 [03:40:14] Since literally every time drift comes up, this gets mentioned. [03:40:18] And it seems to have stalled. [03:40:24] we are actually getting close [03:40:30] Nice. [03:40:46] I saw some notes about codfw testing, but they're from late 2015 and early 2016. [03:40:54] I was taking to Jamie and Manuel about it in an email thread last week [03:41:17] There's a separate task somewhere about automating data integrity checks. [03:41:21] I think. [03:41:36] But at the moment such a system would just tell us what we already know. [03:42:14] T140788 is the master task for the new replicas that are using row-based replication [03:42:14] T140788: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788 [03:42:15] Hmmm, maybe the lack of primary keys was the blocker to row-based replication. [03:42:22] yeah [03:44:37] the part that we are mostly waiting on is filling up the new replicas with sanitized data -- T153743 [03:44:37] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [04:02:15] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361943 (10MusikAnimal) >>! In T168349#3361928, @zhuyifei1999 wrote: > Maybe DBAs have better ideas, but this is an optimised-to-4-min query: > > P5596 This is amazing. Th... [05:08:21] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361969 (10zhuyifei1999) >>! In T168349#3361943, @MusikAnimal wrote: > I think something really funky is going on. The grouping mechanism don't seem to work correctly from... [05:15:53] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [05:55:52] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [06:35:31] PROBLEM - Puppet errors on tools-exec-1403 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:49:22] PROBLEM - Puppet errors on tools-exec-1430 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:49:23] 10Labs, 10Quarry, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3362051 (10jcrespo) [07:05:39] PROBLEM - Puppet errors on tools-exec-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:10:30] RECOVERY - Puppet errors on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [07:29:20] RECOVERY - Puppet errors on tools-exec-1430 is OK: OK: Less than 1.00% above the threshold [0.0] [07:34:43] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3362175 (10Samwilson) It is live, yes. It seems to be timing out for users with large numbers of edits. Even loading your example with just the 'general stats' se... [07:37:24] !help [07:37:24] weirdo: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [07:40:40] RECOVERY - Puppet errors on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [07:41:49] 10Labs, 10Labs-Infrastructure, 10Operations: Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3356455 (10akosiaris) It's not related to the host. It's the Puppet CA itself as @Andrew says. On a random VM created on Mar 26 ``` sudo openssl x509 -noout -... [07:44:46] weirdo: yes? [07:49:01] I got an email saying my article was reviewed [07:49:08] I don't know what that means [07:49:38] I googled it to death [08:10:19] weirdo: you mean wikipedia? [08:11:16] if that's the case, you might want to ask in #wikipedia-en-help [08:11:41] thanks [08:13:10] np [08:59:51] 10Tool-Labs-tools-fatameh: URL regexes are too loose - https://phabricator.wikimedia.org/T168363#3362362 (10Tarrow) [09:01:46] 10Tool-Labs-tools-fatameh: Enable Auth token for non browser session use - https://phabricator.wikimedia.org/T168364#3362375 (10Tarrow) [09:49:08] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3361809 (10Marostegui) In which hosts did you do the tests? [10:02:46] 10Tool-Labs-tools-Other: Heavy 19-hour quries on labsdb1005 (tools-db) by s51203 at s51203__baglama2_p - https://phabricator.wikimedia.org/T168375#3362604 (10jcrespo) [10:05:06] 10Tool-Labs-tools-Other: Heavy 19-hour quries on labsdb1005 (tools-db) by s51203 at s51203__baglama2_p - https://phabricator.wikimedia.org/T168375#3362618 (10jcrespo) [10:05:08] 10Labs, 10Tool-Labs, 10DBA, 10Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#3362619 (10jcrespo) [10:06:02] 10Labs, 10Tool-Labs, 10DBA, 10Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#1830725 (10jcrespo) [10:06:04] 10Labs, 10Tool-Labs, 10DBA: s51053 (tools.jackbot) is abusing resources on labsdbs, throttle his grants - https://phabricator.wikimedia.org/T114559#3362620 (10jcrespo) [10:06:15] 10Labs, 10Tool-Labs, 10DBA: s51053 (tools.jackbot) is abusing resources on labsdbs, throttle his grants - https://phabricator.wikimedia.org/T114559#1699378 (10jcrespo) [10:06:17] 10Labs, 10Tool-Labs, 10DBA, 10Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#1830725 (10jcrespo) [10:08:00] 10Labs, 10Tool-Labs, 10DBA: s51053 (tools.jackbot) is abusing resources on labsdbs, throttle his grants - https://phabricator.wikimedia.org/T114559#3362637 (10JackPotte) This should be resolved now, do you know a monitoring on which I could check it please? [10:09:29] 10Labs, 10Tool-Labs, 10DBA: s51053 (tools.jackbot) is abusing resources on labsdbs, throttle his grants - https://phabricator.wikimedia.org/T114559#3362642 (10jcrespo) This is resolved, I only edited it because of admin purposes (correct tracking). Sorry for the spam, email is automatic. [14:04:58] 10Labs, 10Labs-Infrastructure, 10Operations, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3363428 (10Andrew) [14:53:40] 10Tool-Labs-tools-fatameh: Enable Auth token for non browser session use - https://phabricator.wikimedia.org/T168364#3362375 (10Tarrow) 05Open>03Resolved a:03Tarrow [14:54:36] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10RobH) [15:14:09] 10Labs, 10MediaWiki-extensions-OpenStackManager, 10User-Addshore: Add "GoranSMilovanovic" to labs "bastion" project - https://phabricator.wikimedia.org/T165294#3363708 (10Addshore) 05Open>03Resolved a:03Addshore [15:22:33] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3353312 (10JAllemandou) Hi @Marostegui I can't connect to `dewiki_p` nor `wikidatawiki_p` on `labsdb-analytics`. Should this task be reopened? [15:23:20] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3363746 (10Marostegui) What errors are you getting? [15:25:04] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3363765 (10Andrew) [15:25:51] 10Tool-Labs-tools-Other: Heavy 19-hour quries on labsdb1005 (tools-db) by s51203 at s51203__baglama2_p - https://phabricator.wikimedia.org/T168375#3363766 (10Magnus) I have rewritten the query, should work better now I hope [15:39:40] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3363809 (10Marostegui) I have recreated the views, can you try again? if you can show the error you are getting, that would be helpful. Als... [16:05:33] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3363907 (10RobH) a:05Cmjohnson>03RobH [16:25:49] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3364008 (10JAllemandou) I use a script checking available views from `information_schema`. For the moment it still tells me `dewiki` and `... [16:27:52] chasemp or andrewbogott: im taking over the setup task for the new labvirt hosts in eqiad [16:28:17] but i have questions for the networking (they are currently in 1gbe racks, but have both 1gbe and 10gbe capabitlity) [16:28:25] and on partitioning, they are 10 * 1.6TB ssds [16:28:43] andrewbogott: asked for a small raid for the os and a larger for the data [16:28:44] With hardware raid, you raid the ENTIRE disk, so if you want your OS data on a different raid partition than the data, it has to be split into, at minimum, 2 of the 10 1.6TB SSDs. That would lose a substantial amount of data to just silo the OS to its own hardware raid. [16:30:13] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364014 (10RobH) @andrew: These hosts were reviewed and approved for order with 10 * 1.6TB Intel S3510 SSDs. With hardware raid, you raid the ENTIRE disk,... [16:34:19] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3364022 (10Marostegui) I don't know what that script does but: ``` mysql:root@localhost [information_schema]> select @@hostname; +---------... [16:34:48] robh: hm, ok… so I guess we want just one big hardware raid and then we can partition that for os and VMs. [16:34:53] * andrewbogott braces for a day of partman [16:34:58] robh: I'll update accordingly [16:35:17] as far as I know the row those are in doesn't have ports for 10Gb so everything is just set up with 1G, these can be the same. [16:35:45] andrewbogott: ill write the recipe [16:35:52] great! [16:35:57] i just bashed one out for dumpsdata with only 2 live hacks in testing [16:36:03] ie: got it installed in less than 3 reboots! [16:36:09] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364029 (10Andrew) > Would that be acceptable? Yep, sounds great. Thank you. [16:36:09] so im due for more partman hell [16:36:10] ;] [16:36:33] 3 reboots is a lot less than my last attempt [16:37:01] well, im also just totally going to steal what i made for dumpsdata and use it here with a little bit of refactoring on mount points [16:37:10] as i just this second realized they are nearly identical otherwise [16:37:18] \o/ [16:37:33] also these only have one of the two interfaces hooked up [16:37:38] where it seems other labvirts have 2 ports [16:37:44] do these need to have 2 ports bonded for speed? [16:38:38] I'm not sure. [16:38:48] We definitely want these to be the same as the other labvirst [16:39:01] but I would've guessed that they had one port for ssh and one port for VM neteworking [16:39:11] chasemp: do you remember? [16:39:29] yeah in site.pp it doesnt say [16:39:36] but it does say labvirts that already exist have openstack::nova::partition{ '/dev/sdb': } [16:39:44] which wont work for these if they have to define that as well [16:39:47] (since these are hw raid) [16:39:58] we can also make the new hosts sw raid, but seems wasteful on cpu overhead ;] [16:40:27] theopenstack::nova::partition wont block my installation of the OS, but it seems like it may block you later ;] [16:40:50] (i hope it can also be defined by a directory structure rather than a partition) [16:40:59] or we may have to peel off an SSD for it? [16:42:24] I think that specifying the partition device is in hiera and differs from labvirt to labvirt [16:42:31] so this won't be any different [16:42:41] cool [16:42:51] well [16:42:56] partition device is different [16:42:56] e.g. role::labs::openstack::nova::compute::instance_dev: "/dev/mapper/tank-data" [16:43:01] ohhhh [16:43:03] ok [16:43:16] in puppet/hieradata/hosts/labvirt1013.yaml [16:43:17] just odd that site.pp has it for node /^labvirt100[0-9].eqiad.wmnet/ { but the others are in heira [16:43:20] (for example) [16:43:21] andrewbogott: robh no bonding on the labvirts [16:43:26] sorry I missed hte ping [16:43:30] chasemp: do they need two interfaces? [16:43:50] right now only 1 is wired for the labs-hosts1-b-eqiad [16:43:51] chasemp: don't we have both ports hooked up, though? One on the lab VM network and one on the labsupport (or maybe normal prod) network? [16:43:51] yes, one in labs-hosts and one is a trunk [16:43:59] ok, so eth0 is labs-hosts1-b-eqiad [16:44:03] and eth1 is trunk? [16:44:10] robh: same scheme as labvirt2003 was in codfew [16:44:12] yeah [16:44:20] I would've said the other way around... [16:44:22] eth0 trunk [16:44:29] but, can we look at an existing one and see? [16:44:46] eth0 is labs-hosts and eth1 is trunk on existing [16:45:08] yeah [16:45:09] eth1.1102 is the subinterface [16:45:11] it seems other way areound [16:45:12] same for labnet [16:45:17] ge-5/0/0 up up labvirt1014 eth0 [16:45:17] ge-5/0/3 up up labvirt1014 eth1 [16:45:29] and eth1 is in labs-instances1-b-eqiad [16:45:43] unless they are labeled wrong on switch [16:46:08] but if so then its wrong for labvirt1013 as well [16:46:14] it has eth1 in labs-instances1-b-eqiad [16:46:22] which seems odd, since then its primary interface is in 'trunk' [16:46:29] which i dont see on the switch as a vlan, so its something else [16:46:59] oh wait, found labs-hosts1-b-eqiad [16:47:44] I'm confident eth0 is in labs-hosts1-b-eqiad, less so eth1 switch side configuraiton other than to say all logic assumes trunk [16:47:58] and it should be consistent across labvirts [16:48:49] robh: it seems what I'm saying and what you are saying is the same, I'm not sure what you see as 'other way around' [16:52:10] * andrewbogott withdraws his opinion [16:53:20] sorry, bouncer died [16:53:52] chasemp: sorry about that, but vlans seem sensible now that i stops transposing them about. eth0 in labs-hosts1-b-eqiad and eth1 in labs-instances1-b-eqiad [16:54:15] robh: eth1 is a trunk technically and not in any particular vlan but labs-instances1-b-eqiad in on the allowed list [16:54:28] I'm not sure in junos what the effect is if a port in a VLAN range and yet functioning as a trunk [16:54:37] if that's what seems to have happened in some case [16:54:40] also its not required for the os isntall afaik [16:54:53] right eth1 is instances only [16:54:57] so it seems i can just install these so they are calling in and just not running instances [16:55:03] then you guys are only blocked on netops [16:55:06] seems right to me [16:55:19] robh: sure, you can assign to me post install and I'll take care of it [16:56:29] legoktm: I need a sysadmin, are you around? [16:57:16] legoktm: I would like to rename a user with 75k edits. [17:01:57] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364154 (10RobH) Ok, further updates. I'll write the partman recipe and get the OS isntallation done on these. However, all of these hosts will need their... [17:02:50] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364156 (10RobH) [17:04:40] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364159 (10chasemp) @Cmjohnson @RobH thanks guys, post install assign to me and I'll take care of it. [17:11:15] PROBLEM - Puppet errors on tools-bastion-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:11:49] andrewbogott: actually, it seems the other labvirts1014 use a recipe called labvirt-ssd [17:12:00] but the sizes are arbitarry and uses swap, i'd like to modify the existing recipe [17:12:17] I think that's fine, it wont affect existing boxes anyway. [17:12:19] but then it has to be understood when those labvirt1010-1014 reinstall they'll repairtitiong slightly [17:12:21] yeah [17:12:31] also the ssd recipe uses a mount point of [17:12:37] /var/lib/nova/instances [17:12:50] is that what you want as the instance mount point, not srv or something liek that? [17:13:10] that's it and that's hardcoded elsewhere to match so it's not easily changed [17:13:16] yeah, that's what I want. At least, that's how everything else is anyway. We could symlink instead if it worries you :) [17:13:22] ok, i'll use that, nah! [17:13:27] i'd just like to remove swap from it [17:13:36] and move the # * 92G / to just 120GB like most things [17:13:58] sure [17:14:07] there was debate on swap, im not sure where it lands on labvirt usage [17:14:44] I don't have much opinion, although there's something to be said for having consistency among servers doing the same job. [17:14:48] How much swap is in that recipe? [17:14:53] https://phabricator.wikimedia.org/T156955 [17:14:58] not much [17:14:59] 8gb [17:15:06] so i can just leave if you prefer [17:15:26] honestly we can leave the 97gb for / if you like i dont really ahve a strong preference, i just like to suggest standardization ;] [17:15:42] I'd rather you leave it, just so we don't have one more variable. But standardizing the OS partition is definitely fine [17:15:51] your raid10 on these hosts is 7.24TB [17:15:59] so you have some space [17:16:05] sweet [17:20:17] oh, one more thing [17:20:23] it seems other labvirts are trusty [17:20:32] do these need to be as well? (or can they be jessie?) [17:20:36] yeah, these need to be trusty too for now :( [17:20:38] ok [17:20:40] wilco [17:22:48] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3364338 (10Papaul) In the process of troubleshooting the pxe boot issue on this system, I setup a test dhcp/dns/tftp server on my laptop and boot the server to it... [17:28:11] 10Labs: Deprecate DSA (ssh-dss) SSH keys for Labs users - https://phabricator.wikimedia.org/T168433#3364362 (10bd808) [17:29:20] 10Labs, 10DBA, 10User-bd808, 10cloud-services-team (Kanban): setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011 - https://phabricator.wikimedia.org/T168021#3364388 (10JAllemandou) I can access the views - Sorry for the false positive. However my script still don't find the DB - I'll need to loo... [17:30:24] 10Labs, 10Labs-Infrastructure: ssh-dss (DSA) keys fail for Labs instances with "debian-9.0-stretch (experimental)" image - https://phabricator.wikimedia.org/T167267#3364398 (10bd808) 05Open>03declined Closing in favor of {T168433} after a short discussion with @faidon on irc. Affected users should generate... [17:31:26] 10Labs, 10cloud-services-team (Kanban): Deprecate DSA (ssh-dss) SSH keys for Labs users - https://phabricator.wikimedia.org/T168433#3364362 (10bd808) [17:36:29] 10Labs: `maintain-meta_p --all-databases` timeout on labsdb1009 contacting uk.wikimedia.org - https://phabricator.wikimedia.org/T168436#3364444 (10bd808) [17:36:44] 10Labs, 10cloud-services-team (Kanban): `maintain-meta_p --all-databases` timeout on labsdb1009 contacting uk.wikimedia.org - https://phabricator.wikimedia.org/T168436#3364457 (10bd808) [17:37:32] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3319424 (10Dzahn) >>! In T167157#3364338, @Papaul wrote: > Jun 20 17:21:43 install2002 dhcpd[11106]: DHCPDISCOVER from 30:e1:71:63:5e:5c via... [17:37:45] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3364465 (10Papaul) Daniel find out that for 208.80.153.108 reverse lookup = 2001 and forward lookup = 1002 He fixed it and will try inst... [17:41:17] 10Labs, 10Tool-Labs, 10Tools-Kubernetes: Fix or delete tools-worker-1028 and 29 - https://phabricator.wikimedia.org/T167324#3364484 (10yuvipanda) 05Open>03Resolved a:03yuvipanda I just deleted these :) [17:43:48] PROBLEM - Host tools-worker-1028 is DOWN: CRITICAL - Host Unreachable (10.68.22.23) [17:45:08] 10Labs, 10PAWS, 10Tool-Labs, 10Tools-Kubernetes: Consider moving PAWS to its own k8s cluster, rather than using Tools' k8s cluster - https://phabricator.wikimedia.org/T167086#3364539 (10yuvipanda) Going to keep it inside tools! [17:46:16] RECOVERY - Puppet errors on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:46:26] PROBLEM - Host tools-worker-1029 is DOWN: CRITICAL - Host Unreachable (10.68.22.5) [17:53:08] 10Labs, 10Horizon, 10User-bd808, 10cloud-services-team (Kanban): Horizon bug: hidden web proxy after deleting instance - https://phabricator.wikimedia.org/T167985#3364576 (10mpopov) 05Open>03Resolved @Andrew it works now, thank you! [18:15:25] 10cloud-services-team, 10Operations: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3364701 (10MoritzMuehlenhoff) [18:20:48] o/ [18:21:47] I'd like to run some performance analysis on the beta cluster while I do some uploads and thumbnailing, does anyone know where I should look to find information on CPU/memory usage? [18:25:17] 10Labs, 10DBA: enwiki_p logging vs logging_userindex returning dramatically different results - https://phabricator.wikimedia.org/T168349#3364718 (10MusikAnimal) >>! In T168349#3362544, @Marostegui wrote: > In which hosts did you do the tests? Sorry I didn't record this information. I ran `sql enwiki` on `too... [18:30:01] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3364723 (10MusikAnimal) >>! In T165401#3362175, @Samwilson wrote: > It seems to be timing out for users with large numbers of edits. I'm not sure what's going on... [18:36:52] marktraceur: https://tools.wmflabs.org/nagf/?project=deployment-prep or https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?var-project=deployment-prep&var-server=All possibly [18:37:04] chase here apparently irc hates me [18:37:51] test [18:37:57] andrewbogott hi, is it normal for horizion create instance page to look like [18:37:57] https://phabricator.wikimedia.org/F8491115 [18:38:09] it is showing variable names instead of images. [18:38:21] Thanks chasemp [18:39:43] paladox: it's a race condition that pops up sometimes, if you reload the dialog it should switch back to normal [18:39:52] Ok [18:39:53] thanks [18:48:50] andrewbogott i've created a stretch instance called wikistats-tuesdaytaco but trying to ssh in fails with [18:48:51] $ ssh wikistats-tuesdaytaco [18:48:51] Permission denied (publickey). [18:48:51] Killed by signal 1. [18:49:02] what project? [18:49:06] wikistats [18:51:08] 10cloud-services-team, 10Operations: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3364905 (10MoritzMuehlenhoff) Updated kernels have been installed (plus the related base libraries/services). [18:52:23] 10Labs-project-Extdist: Migrate extdist.wmflabs.org to Debian stretch - https://phabricator.wikimedia.org/T168456#3364906 (10Legoktm) [18:53:16] paladox: Googling, I see a few other people with a similar error but there's no good explanation other than 'corruption'. I haven't seen it with any test cases… want to see if it happens twice in a row? [18:53:27] Ok [18:53:38] i will re create it :) [18:54:34] thanks also for googleing :) [18:54:54] 10Labs, 10cloud-services-team (Kanban): `maintain-meta_p --all-databases` timeout on labsdb1009 contacting uk.wikimedia.org - https://phabricator.wikimedia.org/T168436#3364940 (10bd808) Same result on labsdb1010 and labsdb1011. [18:58:13] andrewbogott still happends [18:58:24] hm, ok [18:59:28] which is strange as doing a direct jessie -> to stretch (by that i mean doing apt-get dist-upgrade) works. Though this is a brand new instance with stretch. Could it some how be missing the config that tells the instances where our pub key is which we store in wikitech? [19:02:28] works now [19:02:30] paladox: I think this one just wasn't ready yet. Does it work for you now? [19:02:31] after doing a reboot [19:02:35] Yep [19:02:39] thanks [19:03:02] 10Labs, 10Tool-Labs, 10DBA, 10Stewards-and-global-tools: Throttling linkwatcher tool user as it is consuming 100% CPU - https://phabricator.wikimedia.org/T121094#1868898 (10Luke081515) Any updates on this old ticket? [19:12:07] 10Labs, 10cloud-services-team (Kanban): labmon1001 disk filling up - https://phabricator.wikimedia.org/T168344#3361694 (10Luke081515) I would think a year is a first good step. [19:19:20] andrewbogott: I am back around sorry. [19:19:33] hashar: is now a good time? [19:19:47] for the nodepool rate ( https://gerrit.wikimedia.org/r/#/c/358601/ ), maybe the OpenStack API has a rate limit as well ? [19:19:54] and yeah that can be done any time for nodepool/ci side [19:20:08] nodepool reread the yaml file automagically [19:20:24] that will just make it send requests every 5 seconds instead of 6 secs. [19:20:40] the trouble is really figuring out what might happen on the openstack side :( [19:22:34] andrewbogott hi, Krenair in -devtools says he carn't write a comment on https://phabricator.wikimedia.org/phame/post/view/56/watroles_returns_in_a_different_place_and_with_a_different_name_and_totally_different_code./ but seems other users can [19:22:43] I carn't write a comment on there either. [19:23:01] hashar: merging, we will see :) [19:24:04] paladox: I feel like we've seen this before but I don't remember what it was. chasemp, any idea about phab blog comments? [19:24:45] andrewbogott " I believe comments are controlled by the 'edit' policy on the blog" [19:24:49] :] [19:25:29] Ah, so it's not per-post [19:25:53] I think it's a bit broken, if it's what I think it is then you can't comment on a blog post unless you can edit that blog (editing the blog doesn't affect posting, posting is controlled by the 'blog post' form, it's all rather confusing and stupid) [19:26:30] but since posting is controlled by the form then we can allow editing on the blog and that won't allow just anyone to post to the blog [19:27:51] twentyafterfour would you be able to fix the form please? [19:29:44] andrewbogott: nodepool caught up and does run a query every 5 seconds :) [19:29:50] it could just be set up the wiki way [19:42:18] andrewbogott: looks good to me so far. I will check the impact on CI after a couple days of data [19:45:55] 10Labs, 10cloud-services-team (Kanban): labmon1001 disk filling up - https://phabricator.wikimedia.org/T168344#3361694 (10chasemp) >>! In T168344#3365032, @Luke081515 wrote: > I would think a year is a first good step. +1 [19:51:32] 10Labs, 10cloud-services-team (Kanban): `maintain-meta_p --all-databases` timeout on labsdb1009 contacting uk.wikimedia.org - https://phabricator.wikimedia.org/T168436#3365188 (10bd808) p:05Triage>03Low ukwikimedia is in both wikimedia.dblist and closed.dblist, but not in the deleted.dblist which would kee... [19:52:35] hashar: sounds good, thanks for sticking around. [19:54:31] 10Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 10Nodepool, and 2 others: Lower rate of Nodepool requests to OpenStack API - https://phabricator.wikimedia.org/T167803#3365191 (10hashar) Nodepool caught up with the new rate. That would reflect on the graphs: * //Tasks per minute// h... [19:55:34] andrewbogott: if all goes fine on the openstack side, I guess I will ask to lower it slightly again [19:56:38] 10Labs-project-Wikistats, 10Patch-For-Review: Wikistats 2.2 [beta] gives internal server error 500 for all csv, ssv and xml formats - https://phabricator.wikimedia.org/T165879#3365196 (10Dzahn) Nice, i see the subtask is resolved too. cool! ( i should still do the rewrites when i get to it) Also, Xqt i lik... [20:07:15] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365221 (10Papaul) [20:57:46] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365385 (10Papaul) [21:01:32] 10Labs, 10Labs-Infrastructure: Setup wikitech, horizon, and striker on new labweb hardware - https://phabricator.wikimedia.org/T168470#3365393 (10bd808) [21:22:04] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365462 (10Papaul) [21:26:20] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365467 (10Papaul) @Andrew this is complete you can take over from here. Thanks. [21:30:28] 10Labs, 10Patch-For-Review, 10User-bd808, 10cloud-services-team (Kanban): `maintain-meta_p --all-databases` timeout on labsdb1009 contacting uk.wikimedia.org - https://phabricator.wikimedia.org/T168436#3365470 (10bd808) a:03bd808 [21:41:25] 10Labs-project-Wikistats: numbers in rank.php wrong? - https://phabricator.wikimedia.org/T168474#3365509 (10Dzahn) [21:41:32] 10Labs-project-Wikistats: numbers in rank.php wrong? - https://phabricator.wikimedia.org/T168474#3365521 (10Dzahn) a:03Dzahn [22:10:50] 10Labs, 10cloud-services-team, 10DBA, 10wikitech.wikimedia.org: move wikitech and labstestwiki to s3 (needs discussion) - https://phabricator.wikimedia.org/T167973#3365576 (10bd808) [22:10:50] 10Labs, 10Labs-Infrastructure: Setup wikitech, horizon, and striker on new labweb hardware - https://phabricator.wikimedia.org/T168470#3365575 (10bd808) [22:12:50] 10Labs, 10Labs-Infrastructure: Setup wikitech, horizon, and striker on new labweb hardware - https://phabricator.wikimedia.org/T168470#3365393 (10bd808) [22:13:52] 10Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#3365585 (10bd808) [22:13:52] 10Labs, 10Operations, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10bd808) [22:14:49] 10Labs, 10Operations, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10bd808) >>! In T98813#3135116, @greg wrote: > Added T161553 as a subtask per above comments. I removed OSM deprecation as a blocker. I think we can figure out how to... [22:19:54] 10Labs, 10wikitech.wikimedia.org, 10Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#3365606 (10bd808) [22:21:10] 10Labs, 10Striker, 10Operations, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#3365620 (10bd808) [22:21:12] 10Labs, 10wikitech.wikimedia.org, 10Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#3145305 (10bd808) [22:33:51] chasemp: heyas, ive installed labvirt101[56] and its odd. i can ping them from install1002, but i cannot ping or login to them from puppetmaster1001 [22:34:02] and i have to do that to access with new install key, sign puppet keys, etc... [22:34:14] oddly enough, labvirt1014 (existing labvirt) also doesnt ping from puppetmaster1001 [22:34:26] but does from install1002, so it makes me think its networking related and not related to my new isntalls [22:34:29] installs even [22:35:13] robh: my guess is that's fw related if it's install box specific [22:35:23] i mean the install box works [22:35:25] but puppetmaster doesnt [22:35:40] ie: os install is done on half of them but i cannot sign the puppet keys or login to them at all [22:35:57] just wondering if you may be aware of any recent changes to network rules that may explain it [22:36:48] I can't ping ping labvirt1001.eqiad.wmnet from puppetmaster1001 either [22:37:00] yeah, but those all call into it for puppet updates [22:37:07] so yeah, maybe firewall on puppetmaster [22:37:17] no I mean firewall in teh core routers [22:37:36] it would have had to be new since the last install of a labvirt [22:38:01] could hte move of this to puppetmaster1001 be new since the last labvirt? [22:38:55] labvirt1014 was installed last year [22:38:56] heh [22:39:23] I don't know when puppetmaster1001 came to be [22:40:46] Ubuntu 14.04.5 LTS auto-installed on Sat Aug 6 15:29:51 UTC 2016. [22:40:49] labvirt1014 [22:41:02] puppetmaster1001: Debian GNU/Linux 8 auto-installed on Wed Aug 24 22:45:07 UTC 2016. [22:41:09] but that doesnt mean thats when it took over as puppetmaser =P [22:41:17] either way, post labvirt1014 [22:41:20] it seems like labvirt1015-18 have puppet keys pending [22:41:29] yeah, but dont sign [22:41:39] cuz we cannot login to them to enable puppet to run after singing [22:41:43] signing [22:43:20] and the existing labvirts all call into puppet normal [22:43:24] so its something blocking ssh and ping [22:46:41] chasemp: so worst case is now i'll just have to escatae this to netops for them to check out the settings on the routers [22:46:53] but it seems like the installs ran fine, and the partitions during install seemed ok [22:47:15] so yu would do install-console next on puppmaster1001 [22:47:19] is that right? [22:47:45] basically install_console is just a wrapper for a new_install ssh key [22:47:53] that we use for the initial login and to enable puppet and trigger a run [22:47:57] I have the vague memory of something here and andrew asking folks about new installs on labs things [22:47:58] right ok [22:48:00] puppet is disabled on install [22:48:13] yeah i also vaguely recall that [22:48:20] and i thought whatever we did seemed broken [22:48:29] andrewbogott: is there some known issue with new installs on labvirts? I recall you looking at the problem when iron went away or something [22:48:30] like i recall he moved the key somewhere lese, but i think it was a server that i dont recall [22:49:00] oh [22:49:01] iron has it [22:49:10] i recall now i didnt like this solution [22:49:12] but it works. [22:49:31] I think its better to firewall iron off from labs, it has no reason to touch it, and allow puppetmaster1001 ssh [22:49:33] maybe I only thouht iron went away but I'm remembering the email thread and general discussion [22:49:37] but iron is 'ops bastion' so meh [22:49:54] it no long houses the private repo or anthing [22:50:00] but seems it still exists, not sure why [22:50:11] iron is a Bastion host using two factor authentication (bastionhost::twofa) [22:50:11] iron is a Experimental Yubico two factor authentication bastion (misc) [22:50:22] oh well, it works around this issue [22:50:42] robh: so you can do the first login from iron is that the deal? [22:50:46] yep [22:50:49] ok [22:51:20] fuzzy memory for teh win [22:52:20] hehe! aha.. so iron solves it [22:52:25] gtk [22:52:45] yeah, i just think its a bad solution [22:52:53] seems no good reason for iron to talk to this vlan other than this [22:53:01] while puppetmaster1001 already has to talk to it for its normal duties [22:53:10] so allowing it additional ping and ssh access seems trivial from there versus iron. [22:53:18] (does that make sense?) [22:53:38] so I think that the conflation of prod acl and instance acl here is what causes this [22:53:46] as in instances go through the labs-hosts vlan for transit [22:53:55] it seems like an ACL that allowed puppetmaster and install servers was changed [22:53:56] and an acl there is says no 22 ever to be careful [22:54:09] oh, iron is public vlan [22:54:13] right [22:54:18] aha [22:54:44] I think this can be fixed down teh road by separating out instance traffic from host traffic for labvirts totally [22:54:57] or at least that should relieve teh paranoia [22:55:23] that's somewhere down teh list of wishes [22:55:47] or we could suggest to put install-console on install servers [22:56:14] unless that is considered insecure [22:56:21] i mean officially, with puppet [22:56:43] I would think anything private space prod is going to suffer same issues [22:56:52] gotta go eat :) [22:56:58] good luck robh and thanks [22:57:06] welcome [23:08:08] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Convert xtools intuition to its own repository - https://phabricator.wikimedia.org/T165708#3274486 (10kaldari) @Matthewrbowker: Can you explain this task? I believe the Intuition migration guide is about moving message keys out of the Intuition repo (as used t... [23:08:40] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3365759 (10RobH) p:05High>03Normal [23:22:23] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Create an XTools logo - https://phabricator.wikimedia.org/T167345#3365815 (10kaldari) [23:23:16] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Fix "Notice: Undefined index: allusers" in Adminstats when the wiki is unreachable - https://phabricator.wikimedia.org/T165707#3365819 (10kaldari) [23:27:39] robh: sorry for the delay in responding… those servers need to be set up from iron and not from the puppetmaster [23:27:46] due to how their vlan is set up, I believe. [23:27:55] oh, you're there already, great [23:28:02] Sorry I wasn't around to save you the trouble earlier :( [23:30:19] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Planning for Xtools beta - https://phabricator.wikimedia.org/T167217#3365834 (10kaldari) [23:30:37] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Planning for Xtools beta - https://phabricator.wikimedia.org/T167217#3321093 (10kaldari) [23:35:15] 10Labs, 10cloud-services-team (Kanban): Rename labs-admin mailing list to cloud-admin - https://phabricator.wikimedia.org/T167155#3365859 (10bd808) [23:37:22] 10Labs, 10User-bd808, 10cloud-services-team (Kanban): Consult with technical community on Cloud Services rebranding plan - https://phabricator.wikimedia.org/T165094#3365868 (10bd808) 05Open>03Resolved The on-wiki plan documentation has been updated based on feedback received from the consultation. See {T... [23:38:15] 10Labs, 10Horizon, 10MediaWiki-Vagrant, 10Patch-For-Review, and 2 others: Create MediaWiki Vagrant role for local devlopment of Horizon customizations - https://phabricator.wikimedia.org/T166006#3365875 (10bd808) 05Open>03Resolved Always more work to do, but the initial Horizon role is functional. [23:38:29] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3365877 (10RobH) a:05RobH>03Cmjohnson Chris: Please wire up eth1 on these systems and label their ports on the switch. Then you or I can take a look a... [23:49:18] 10Labs, 10Horizon, 10Patch-For-Review, 10cloud-services-team (Kanban): Fix watroles to work with new Puppet storage backend for Labs - https://phabricator.wikimedia.org/T151522#3365920 (10bd808) 05Open>03Resolved Live at https://tools.wmflabs.org/openstack-browser/puppetclass/. I also setup a redirect... [23:49:53] 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10cloud-services-team (Kanban): Horizon puppet roles not cleared when instance is deleted - https://phabricator.wikimedia.org/T147878#3365922 (10bd808) 05Open>03Resolved