[00:41:40] I got a bunch of emails like "tools-exec-1402 : Aug 19 21:56:12 : diamond : unable to resolve host tools-exec-1402" [00:41:45] what's happened [00:42:05] subject:*** SECURITY information for tools-exec-1402 *** [03:09:00] 10Tool-Labs-tools-DrTrigonBot---General: DRTRIGON-86 Test the re-write branch an decide what parts to migrate - https://phabricator.wikimedia.org/T61528#1556042 (10jayvdb) 5Open>3Resolved a:3jayvdb Seems to have been completed. [07:00:09] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1556276 (10jcrespo) I will add the user on puppet. Just for the record- on our configuration, users with hosts using dns entries are i... [09:14:31] Hm https://tools.wmflabs.org/catscan2/cross_cats.php?language=it&project=wikiquote&category=Argomenti&depth=1&external_depth=0 [09:22:12] valhallasw`cloud: I'm starting to fill out the etherpad with Joe [09:22:20] valhallasw`cloud: do poke if any of the OGE comments seem wrong [09:22:25] YuviPanda: ehhhh [09:22:31] err [09:22:33] not the therpad [09:22:35] the spreadsheet [09:22:44] for evaluation of k8s, OGE, Mesos [09:26:44] aha [09:28:39] YuviPanda: not in my shared-with-me folder? [09:28:52] so you probably just sent a link at some point in the past [09:30:27] valhallasw`cloud: just did [09:33:43] ok maybe google sheets is not so great for chatting [09:44:03] <_jovi_> valhallasw`cloud: thoughts on the gridengine stability question? [10:37:54] 6Labs, 6operations: bastion-02.bastion.eqiad.wmflabs not restricted_from=(ops) like bastion-01 is - https://phabricator.wikimedia.org/T109641#1556640 (10yuvipanda) 5Open>3Resolved Done! Thanks for spotting! [13:10:21] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1556979 (10jcrespo) mw1142 had the same problem tonight: https://logstash.wikimedia.org/#dashboard/temp/AU9LNp9HOkQDz4dSqpM2 Sorry I restarted instead of depool it. [13:10:57] lol [13:13:03] lol? [13:17:18] (03PS1) 10Sitic: Fix expand button size for traditional layout [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/232720 [13:17:33] (03CR) 10Sitic: [C: 032 V: 032] Fix expand button size for traditional layout [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/232720 (owner: 10Sitic) [13:57:30] liangent: I got those too — I’m investigating but I don’t think they’re serious. [14:34:12] andrewbogott: okay [14:34:29] liangent: I found a pretty good explanation, trying to fix now. [15:24:05] hmm did something happen to redis (within the last hour)? I'm seeing the same problem as in https://phabricator.wikimedia.org/T101514 [15:24:38] also is http://graphite.wmflabs.org/ working? getting connection refused [15:25:17] <_jovi_> sitic: I think the proxy blew up in the reboots [15:25:20] <_jovi_> sitic: should be back now [15:25:33] <_jovi_> sitic: redis was restarted during the reboots. I was going to send out an email but didn't manage to do it in time :| [15:25:34] <_jovi_> sorry [15:26:51] _jovi_: I encourage you to change your username back so that people can tell you the heck you are :) [15:26:54] _jovi_: it took me a while to figure out who you are :p [15:27:04] hehe [15:27:16] _groovy_panda_ [15:27:21] ^^ [15:28:19] _jovi_: ah ok. I think I'm getting some really old data from redis, at least I just got 400+ error emails for old tasks [15:29:03] andrewbogott: hehe [15:29:36] sitic: hmm, still? [15:30:18] YuviPanda: it’ll be another 10-15 before everything from labvirt1008 is back up. Meantime I’m going to go get a bagel. [15:30:25] andrewbogott: cool [15:30:29] And then after that… maybe we can do live-migration again :) [15:30:44] well the emails have stopped. Don't know though if the data is up to date or the queue is just emtpy now … [15:33:14] we'll find out soon enough!!!!! [15:33:18] also you shouldn't get stale data [15:36:09] yeah, it happed before https://phabricator.wikimedia.org/T101514 :-P [15:36:40] http://graphite.wmflabs.org/render/?width=586&height=308&target=tools.tools-redis-01.redis.6379.memory.internal_view&from=-1days looks also a bit suspicious. Well redis is a cache for most people anyway so … [15:36:54] but I deleted the old instance... [15:36:55] let me verify [15:37:46] nope, tools-redis is gone [15:38:37] hmmm ok, strange [15:38:50] it seems ok [15:38:55] you'd have lost all pubsub if you were using that [15:39:08] I should setup nutcracker for redis so we can fix this kind of stuff. sigh [15:40:19] I'm already seeing enough nutcracker failures from production so I vote no ;-) [15:40:30] hahah :P [15:40:35] sitic: those are memcached but yeah... [15:40:43] the alternatives is to wait and use redis cluster [15:40:58] YuviPanda: can we have a check for (some) webservices to return a 200 code? [15:41:14] there's a bug for that somewhere... [15:44:22] gifti: apparently there wasn't, I made https://phabricator.wikimedia.org/T109719?workflow=create [15:44:56] thank you! [16:05:12] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557484 (10demon) >>! In T109715#1557289, @dcausse wrote: > Using dumps in esbulk format is certainly not the fastest and convenient way to replicate indices but there's one major... [16:11:17] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557527 (10demon) >>! In T109715#1557312, @yuvipanda wrote: > Do note that for labsdb we run them on real hardware just in the labs subnet and would want to do the same for this t... [16:12:05] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557530 (10yuvipanda) How much traffic can we support with just a primary? Just reads + the replication writes, I guess. [16:13:50] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557542 (10demon) >>! In T109715#1557530, @yuvipanda wrote: > How much traffic can we support with just a primary? Just reads + the replication writes, I guess. Not much, but the... [16:14:10] https://wikitech.wikimedia.org/wiki/Special:NovaProxy isn't showing me proxies that I know to exist. Tried logging out and logging back in, still no joy [16:14:23] hmm probably needs restart [16:14:39] bd808: let me fix that [16:14:52] YuviPanda: Big thing they need is memory, really. [16:14:58] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557543 (10dcausse) > My initial thought was either using snapshot/restore or a river. rivers have been deprecated [1] (they suggest using logstash), snapshots should be most effi... [16:15:03] More you can keep on the heap, the better ES can run [16:15:15] + fs cache [16:15:36] ostriches: right [16:18:02] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557556 (10demon) Logstash sounds much easier than a river, good idea. [16:18:42] bd808: try now [16:18:53] !log project-proxy deleted empty hiera page that was causing puppet fails [16:19:11] YuviPanda: \o/ fixed [16:19:13] YuviPanda, are in the loop about labs getting new hardware? [16:19:14] !log project-proxy restarted dynamicproxy-api by hand [16:19:21] Cyberpower678: no. why? [16:19:22] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557558 (10jcrespo) There are several databases on silver. I hope I am granting to the right one... [16:19:43] YuviPanda, I wanted to ask Coren a question regarding it, but he's away. [16:20:19] ok [16:23:54] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557559 (10Krenair) Yep, "labswiki" is the exact name of the correct database. Nothing else is necessary. [16:26:28] 6Labs, 10Tool-Labs: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1557562 (10scfc) [16:26:28] 6Labs, 10Tool-Labs: Make tools-mail route mail for @tools-*.pmtpa.wmflabs correctly - https://phabricator.wikimedia.org/T63484#1557561 (10scfc) [16:31:13] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557574 (10EBernhardson) In terms of resources the current prod cluster is 2.5TB worth of primary shards. Elasticsearch seems quite efficient in terms of writes, the issues i've s... [16:40:16] 6Labs, 10Tool-Labs: Make sure gridengine-exec starts on boot - https://phabricator.wikimedia.org/T109728#1557608 (10valhallasw) 3NEW [16:42:07] 6Labs, 10Tool-Labs: Add monitoring for SGE queue status - https://phabricator.wikimedia.org/T109730#1557633 (10valhallasw) 3NEW [16:45:33] 6Labs, 10Tool-Labs: Add monitoring for expected load issues - https://phabricator.wikimedia.org/T109732#1557664 (10valhallasw) 3NEW [16:46:39] YuviPanda: if you give my draft a read-over I can actually post it [16:47:14] valhallasw`cloud: which draft? [16:47:26] the one you have a link for in your inbox! [16:47:28] :D [16:47:39] I tried google docs today [16:47:51] let's see if that works better than etherpad in the copying department [16:49:42] valhallasw`cloud: hah! looking [16:51:15] YuviPanda: just the second page [16:52:46] valhallasw`cloud: yup, the second one looks good [16:52:53] valhallasw`cloud: I made a tiny edit [17:01:25] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557773 (10jcrespo) 5Open>3Resolved a:3jcrespo So, I added wikiadmin to puppet from tin, and that should work and resolve the is... [17:03:12] 6Labs, 10Tool-Labs: Add monitoring for expected load issues - https://phabricator.wikimedia.org/T109732#1557779 (10scfc) In T50668, I suggested for that: - Count of jobs in error state doesn't exceed 5 % of all jobs running, - count of jobs pending doesn't exceed 5 % of all jobs running. [17:05:21] !log tools wait, what timezone is this?! [17:05:39] * valhallasw`cloud prods labslogbot hello where are you [17:06:26] !log tools wait, what timezone is this?! [17:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:06:48] huh. okay [17:07:02] oh, yes, it is consistent [17:11:28] YuviPanda: added one more point about availability of ops [17:12:09] 6Labs, 6operations, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1557823 (10Dzahn) jcrespo added a user/grants on the mysql side. so connections should now work from tin. [18:32:46] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1558103 (10Dzahn) also opened firewall to allow connections from terbium, in addition to tin [18:39:33] !log tools cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files [18:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:42:50] !log tools tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8. [18:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:43:52] !log tools running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02). [18:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:44:31] !log tools both are now at 3dbbc87 [18:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:55:42] !log labvirt1007 "only" 29G space left - but since we have 2.2T there that means 99% full [18:55:42] labvirt1007 is not a valid project. [18:55:56] YuviPanda: fyi, open() on python3 uses whatever encoding the locale specifies, which is ascii for the c encoding [18:56:09] aaaah [18:56:13] c locale* [18:56:25] and the C locale is what you get on hosts that have no locales installed, of course... [18:57:00] should we install the utf locale by default? [18:57:06] or is that going to be too disruptive? [18:57:14] it's complicated [18:57:24] and I don't even know under which locale puppet under cron runs [18:57:46] if I login, it's whatever locale I use /at home/, which is not necessarily installed on the hosts [18:58:00] and if that locale is not installed, it falls back to C [18:58:02] (yay) [19:01:18] sigh [19:01:42] YuviPanda: maybe we can localectl LANG=C.UTF-8 but not sure [19:02:05] huh. [19:02:16] valhallasw@tools-web-static-02:/srv/cdnjs$ localectl [19:02:16] System Locale: LANG=en_US.UTF-8 [19:02:22] but only en_US.utf8 is installed [19:02:28] maybe puppet runs under a different one? [19:02:59] things under cron typically run with LANG=C, but I think that's a result of the system locale setting [19:03:02] again, not sure [19:03:10] simplest option: specify encodings [19:03:10] ;-) [19:16:54] YuviPanda: shinken-irc is down as well [19:17:02] more stuff broken from the network interruption I think [19:26:51] so, redis was down. didn't we have a second redis in case the first is down? was it down too? [19:27:02] or was that just the network failure? [19:30:47] 6Labs, 10Labs-Infrastructure, 6operations: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1558240 (10Dzahn) 3NEW [19:31:05] 6Labs, 10Labs-Infrastructure, 6operations: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1558248 (10Dzahn) [19:42:03] 6Labs, 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1558286 (10Dzahn) connections should also work from terbium now, so it would be possible to run the maintenance scripts where all o... [20:48:38] 6Labs, 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1558583 (10Krenair) >>! In T107547#1558286, @Dzahn wrote: > connections should also work from terbium now, so it would be possible... [20:50:07] YuviPanda: I’m about to schedule an OpenStack upgrade for this coming Wednesday. Any objection? [20:51:05] andrewbogott: nope, sounds good [20:51:11] No expected downtime I guess? [20:51:25] Possible API downtime, shouldn’t break anything within labs though [20:54:35] andrewbogott: right [20:54:41] Cool :) +1 from me [21:26:27] 6Labs, 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497611 (10Krenair) (Nope.) [22:20:35] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 3Labs-Sprint-109: Evaluate kubernetes for use on Tool Labs - https://phabricator.wikimedia.org/T107993#1559101 (10yuvipanda) As for ACL's to prevent one user from accessing other users' pods and services, we could possibly isolate them in a namespace per user, and... [22:32:53] 6Labs, 10Tool-Labs: cdnjs-packages-gen fails when Puppet is run interactively - https://phabricator.wikimedia.org/T109355#1559150 (10scfc) This has been fixed, apparently by https://gerrit.wikimedia.org/r/#/c/232786/. [22:33:09] 6Labs, 10Tool-Labs: cdnjs-packages-gen fails when Puppet is run interactively - https://phabricator.wikimedia.org/T109355#1559151 (10scfc) 5Open>3Resolved a:3valhallasw [22:43:41] 6Labs, 10Tool-Labs, 6Design Research Backlog, 6Learning-and-Evaluation, 6Research-and-Data: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1559189 (10egalvezwmf) CommTech is also running a survey. Sharing this in case the questions are helpful: https://meta.wikimedia.org/...