[14:00:26] we are going to perform maintenance on labsdb1009, it will transparently failover to other server, but ongoing queries may fail temporarilly [14:47:18] Any idea why toolforge is being super unresponsive today? [15:18:00] [15:00] jynus we are going to perform maintenance on labsdb1009, it will transparently failover to other server, but ongoing queries may fail temporarilly [15:18:18] (maybe that's why, not sure) [15:18:44] not really, that should be a one time glitch on th db [15:19:01] unless with toolforge you mean the databases? [15:26:17] Samwalton9: do you have more details? [15:28:14] I'm connected to toolforge, switched to the tool account, and any commands (just simple terminal commands) are taking like 5-10 seconds to go through [15:29:12] (by connected I mean I'm in a terminal on tools.twltools@tools-bastion-03) Is it likely to be something I've done, or a general problem? [15:29:33] Quite new to this! [15:29:35] arturo: there was trouble on some bastio before, maybe someone is running heavy commands there? [15:29:55] apparently [15:30:09] someone is using up all the IO [15:30:54] Technical Advice IRC meeting starting in 30 minutes in channel #wikimedia-tech, hosts: @addshore & @Christoph_Jauera_(WMDE) - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:31:21] there are a few processes in D state [15:34:04] https://imgflip.com/i/1zf9ce [15:34:24] Samwalton9: btw, there is tools-bastion-02 (tools-dev) if you want to use it [15:34:42] jynus: nice [15:35:42] Ah, thanks :) [15:36:45] there's rarer for -02 to get overloaded due to having less users, but 02 is more or less 'designed to' be the overloaded bastion (it's recommended to do heavy processing like compiling here rather than -03) [15:37:07] and if you want a bastion nobody ever uses, -05 [15:37:29] ^ is a secret. /me said nothing :) [15:42:53] jynus: you are right [15:43:44] ? [15:44:24] jynus: about the overload [15:48:32] !log tools kill tools.powow on bastion-03 for hammering IO and making bastion unusable [15:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:49:44] https://toolsadmin.wikimedia.org/tools/id/powow [15:53:28] !log tools reboot bastion-03 [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:00:46] Which server is labsdb1004, how can I connect to it, and why is it breaking replication [16:02:23] Dispenser: context? [16:02:36] https://phabricator.wikimedia.org/T180560 [16:08:35] Dispenser: that is toolsdb [16:10:30] Where is this documented and how do I connect to it [16:11:05] its tools.labsdb and tools.db.svc.eqiad.wmflabs [16:12:04] docs at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#User_databases -- nothing there about the underlying hostname hwever [16:17:05] for i in {1..100}; do mysql -NBh tools.db.svc.eqiad.wmflabs -e "select @@hostname"; done # Gives me labsdb1005 every time [16:18:52] Dispenser: labsdb1004 is the slave to labsdb1005 so when the DBA's say replication is broken they mean from labsdb1005 to labsdb1004 the slave [16:20:08] So no way for me to test it or see what's going on [19:43:20] legoktm: about? https://gerrit.wikimedia.org/r/#/c/372213/ [19:44:15] yep [19:44:24] looks good except I think I left some extdist references in there [19:44:25] lemme edit [19:45:14] chasemp: I added a new PS. thanks for figuring out how to make that pass jenkins :) [19:45:44] legoktm: so this is a bit new wave, and it will sidestep some of the original intended "use roles" thinking horizon was built on [19:45:59] ok [19:46:02] andrew and I talked about how to handle the role to profile param intended locations last week [19:46:07] do I not add the puppet class in horizon then? [19:46:11] and this is basically our thinking, accept that profiles are the right thing [19:46:23] legoktm: no you totally do but I think you have to add it in the "extra classes" section for now [19:46:28] ok [19:46:33] until andrewbogott unwinds his lookup stuff to not ignore profiles [19:46:36] and I'm sorry about that [19:46:41] we have a mismatch in expectations atm [19:46:48] no worries [19:46:57] so if you don't mind being a small bit of guinee pig I'll merge this [19:47:01] sounds good to me [19:47:02] and you can see how it goes :) [19:47:03] kk [19:47:24] I do have to go afk in ~3 minutes for class but I think I'll have time in the evening to test it out [19:47:56] yeah no worries, just wanted to let you know the weirdness [19:48:02] andrew is traveling today actually [19:48:15] but this sets you up to have good questions for him on wth :) [21:12:03] Why is labsweb100{1,2} serving 1.31.0-wmf.1....to enwiki? [21:12:04] https://logstash.wikimedia.org/goto/eb7c1dba4588fd8cebe683b4be7b0a25 [21:20:27] no_justification: that may have picked up some role and never been finished, andrew was working on these and is out now but I'll take a look [21:20:35] they are meant as silver replacements (wikitech) [21:20:50] Yeah I figured that. If they're meant to get MW updates, they should be added to the dsh groups :) [21:21:05] More worryingly though is how they're serving /enwiki/ traffic [21:21:12] I don't actually know where he is at with them, only that it isn't in any kind of prod role so we can do whatever we need to do to stop that [21:21:19] yeah [21:46:39] Stupid question, but if I submit a job through jsub, can I still send the stdout to a file? e.g., jsub blah blah blah command > output.txt [21:48:22] Less stupid question: in light of one of the wiki replica servers being shut down, do I need to do anything if I don't hardcore server addresses, preferring to use as generic ones as possible like "tools.labsdb"? [21:50:05] Is there some reason why we can't just have tools.labsdb replicated to web-cluster and analytics-cluster? [21:51:41] Looking at a config file for one of my web apps, I specify the address format as SQL_WMF_REPLICA_ADDRESS = '{0}.labsdb'. I take it I won't need to change anything? [22:11:00] harej: that will continue to work, but we will soon stop adding new wiki aliases to the *.labsdb dns [22:11:43] Are all of them going to be served from one address? [22:12:30] using SQL_WMF_REPLICA_ADDRESS = '{0}.web.db.svc.eqiad.wmflabs' would follow the new standard [22:12:37] Got it [22:12:46] and does that work now? [22:12:49] you can replace "web" with "analytics" there too [22:12:55] yes, that works now [22:13:01] sweet, I am going to change that then [22:14:03] the new name for "tools.labsdb" is "tools.db.svc.eqiad.wmflabs", but the old name will continue to work to get there too [22:14:18] the address I apparently use is "tools-db" [22:14:22] we are just going to stop adding new wikis to the *.labsdb scheme [22:14:37] tools-db has been deprecated for a long time :) [22:14:51] fascinating [22:15:01] switch to "tools.db.svc.eqiad.wmflabs" for more long term joy [22:15:34] our dns is/was a mess and we are trying to clean it up [22:18:49] !log suggestbot Dropped database p50380g50553__ilc_s7 from c3.labsdb in preparation for shutdown, no longer needed [22:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Suggestbot/SAL [22:24:07] It looks like submitting a job through jsub doesn't get you access to the stdout. at least with the usual configuration. is there any way to get access to the stdout of a job as it's being executed? sending the output of the command just sends the output of jsub (i.e. "Your job has been submitted") rather than the thing actually being executed [22:28:59] harej: since the code is running on a remote server (away from the submission host) you can't use a shell redirect to capture the output. The output will be written to a $jobname.out file by default that you can tail [22:29:14] there will be some lag in that however due to NFS reads and writes [22:29:49] if you are trying to make some kind of shell pipeline you can wrap everything up in a shell script and submit that to the grid [22:36:17] chasemp: Anything I can do to help figure out labsweb*? [22:36:48] no_justification: I'm on an interview call w/ someone at the moment, I was thinking I would just pull the roles they have and remove the checkout locally and andrew and do his thing next week [22:37:17] I'm curious how they're serving traffic if they're still a WIP :\ [22:37:32] Wonder if something's detecting them as up and pooling them, [22:37:36] s/,// [22:37:37] must be? [22:37:48] I mean, that's pretty creepy [22:38:03] no_justification: if you want to poke, go for it [22:38:11] I can't :) [22:38:14] I can't ssh to them [22:38:14] ohhhh [22:38:19] is it me, or is bastion 3 really slow? [22:38:33] harej: possible. People like running jobs in the wrong place [22:38:44] Reedy: this is why I am trying to finally play nice and learn how jsub works [22:38:46] chasemp: Which is what makes me think they're only half-setup. If they're a scap target, they should be reachable by deploy group [22:38:56] harej: have a look in top... [22:39:07] or, at top [22:39:31] hmm, the really narrow setup truncates the usernames, so i can't wave my fist at them [22:39:56] but yeah, that's... a lot [22:40:02] harej: iotop -ao [22:40:10] usualy if slow it's a bad IO consumer eating NFS [22:48:42] !log tools Rebooted tools-paws-worker-1017 [22:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:49:23] now, how can i check on the jobs currently being run? [22:49:49] qstat if gridengine [22:50:28] delightful, thank you [23:03:41] no_justification: https://phabricator.wikimedia.org/T168470#3765042 [23:06:36] tyvm [23:13:49] * zhuyifei1999_ might have started too many python threads in tools-exec-1434