[09:05:58] (03CR) 10MarcoAurelio: "recheck" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/436936 (owner: 10Libraryupgrader) [09:06:21] (03CR) 10jerkins-bot: [V: 04-1] build: Updating mediawiki/mediawiki-codesniffer to 19.0.0 [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/436936 (owner: 10Libraryupgrader) [09:10:50] (03CR) 10MarcoAurelio: "Filed T196530." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/436936 (owner: 10Libraryupgrader) [09:14:54] hello, cannot ssh into etytree-b.etytree.eqiad.wmflabs [09:15:06] anyone who can give me tips? thank!! [09:15:33] I'm doing ssh -F ~/.ssh/config_labs etytree-b.etytree.eqiad.wmflabs [09:16:19] following Accessing instances with ProxyCommand ssh option (recommended) [09:16:19] in https://wikitech.wikimedia.org/wiki/Help:Access [09:16:54] I get Permission denied (publickey) [09:18:15] and added my public key to https://wikitech.wikimedia.org/wiki/Special:NovaKey [09:18:30] sorry to wikitech [09:19:01] Ester: did you specify ssh -i explicitly? [09:19:15] where? [09:19:20] ssh -F ~/.ssh/config_labs etytree-b.etytree.eqiad.wmflabs [09:19:22] here? [09:19:41] ah [09:20:04] -F ~/.ssh/config_labs is missing from the proxy command [09:20:34] what do you mean? [09:20:43] ProxyCommand ssh -a -W %h:%p @primary.bastion.wmflabs.org [09:20:49] ^ uses default config [09:20:51] no -F [09:21:42] I really don't understand why people choose to use -i or -F in the command line instead of writing it to default ~/.ssh/config [09:22:04] sometimes you have multiple proxies [09:22:09] could that be an answer? [09:23:19] you can seperate the config with Host or Match blocks [09:23:25] ok [09:23:38] eg. I have a: [09:23:39] so, I'm not that advanced but in other words [09:23:42] https://www.irccloud.com/pastebin/fqC8pqIr/ [09:24:02] I should just move config_labs to config? [09:24:12] that applies only to github uris [09:24:19] I recommend that way [09:24:43] but yes you can set the -F argument in ProxyCommand [09:27:51] after that I just do ssh -etytree-b.etytree.eqiad.wmflabs [09:27:53] ? [09:28:07] ssh etytree-b.etytree.eqiad.wmflabs [09:28:28] I still get Permission denied (publickey). [09:28:29] :( [09:29:00] can you `ssh -vvv etytree-b.etytree.eqiad.wmflabs` and paste the logs? [09:29:58] OpenSSH_7.3p1, OpenSSL 1.0.2k 26 Jan 2017 [09:29:58] debug1: Reading configuration data /Users/esterpantaleo/.ssh/config [09:29:58] debug1: /Users/esterpantaleo/.ssh/config line 1: Applying options for * [09:29:58] debug1: /Users/esterpantaleo/.ssh/config line 2: Deprecated option "useroaming" [09:29:58] debug1: /Users/esterpantaleo/.ssh/config line 4: Applying options for *.eqiad.wmflabs [09:29:59] debug1: /Users/esterpantaleo/.ssh/config line 7: Applying options for *.wmflabs [09:30:02] debug1: /Users/esterpantaleo/.ssh/config line 10: Applying options for * [09:30:04] debug3: kex names ok: [curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256] [09:30:06] debug1: Reading configuration data /usr/local/etc/ssh/ssh_config [09:30:08] debug1: Executing proxy command: exec ssh -a -W etytree-b.etytree.eqiad.wmflabs:22 epantaleo@primary.bastion.wmflabs.org [09:30:10] debug1: identity file /Users/esterpantaleo/.ssh/id_rsa type 1 [09:30:13] debug1: key_load_public: No such file or directory [09:30:15] debug1: identity file /Users/esterpantaleo/.ssh/id_rsa-cert type -1 [09:30:17] debug1: Enabling compatibility mode for protocol 2.0 [09:30:19] debug1: Local version string SSH-2.0-OpenSSH_7.3 [09:30:21] debug1: permanently_drop_suid: 501 [09:30:23] Enter passphrase for key '/Users/esterpantaleo/.ssh/id_rsa': [09:30:26] Permission denied (publickey). [09:30:32] after I enter the passphrase [09:30:37] i get Permission denied (publickey) [09:31:38] next time, can you paste it somewhere? eg. pastebin.com, dpaste.de, phab paste, or toolforge paste? [09:31:54] sure!! sorry! [09:32:02] it looks like that the proxycommand still failed [09:32:22] can you `ssh -vvv epantaleo@primary.bastion.wmflabs.org` [09:34:06] https://pastebin.com/V89NTyzT [09:34:10] I'm insterested in stuffs that happen after `debug1: Authentications that can continue: publickey` [09:34:15] ok [09:35:01] debug3: receive packet: type 60 is okay [09:35:12] I mean it means the key is accepted [09:35:23] 51 is declined [09:36:29] after the password input a successful auth should have type 52 [09:37:46] so the passphrase is wrong? [09:37:50] I don't think I can check the logs on the bastion project [09:38:20] I don't think the passphrase is wrong. in that case it would usually re ask for the passphrase [09:38:30] if it can't decrypt the key [09:38:49] does ssh-ing to toolforge work? [09:39:24] how? [09:39:44] to the tools [09:39:46] ok I'll try [09:41:06] https://www.irccloud.com/pastebin/9PctgnZs/ [09:41:15] does this match your public key? [09:41:34] it's about twice as long as mine [09:44:03] yes it does match [09:44:22] the slashes are ok? they look weird to me [09:44:55] yes the slashes are part of base64 encoding [09:44:58] ok [09:45:22] maybe there is some weird setting on my computer [09:45:25] could it be? [09:46:12] idk [09:46:39] anyways, does toolforge work? I'll check the logs on toolforge if not [09:47:03] so I just changed the ssh key [09:47:10] do I need to update anything [09:47:10] ? [09:47:22] ? [09:47:43] so this morning I generated a new ssh key pair [09:47:45] you need to update your public key on striker or wikitech [09:47:50] and I updated it on wikitech [09:47:51] that should be it [09:48:06] ssh -i ~/.ssh/id_toolforge epantaleo@login.tools.wmflabs.org [09:48:11] gives me permission denied [09:48:28] * zhuyifei1999_ looks at the logs [09:49:36] !log tools T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid [09:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:49:39] T196137: toolforge: prometheus issue is filling up email queue - https://phabricator.wikimedia.org/T196137 [09:50:47] https://www.irccloud.com/pastebin/tnEJTPeE/ [09:50:50] Ester: ^ [09:51:24] can I ask why is '-i ~/.ssh/id_toolforge' here? [09:56:34] 17:50:46 https://www.irccloud.com/pastebin/tnEJTPeE/ [09:56:34] 17:50:49 Ester: ^ [09:56:34] 17:51:24 can I ask why is '-i ~/.ssh/id_toolforge' here? [09:56:50] that's how i used to connect [09:58:33] I don't really know what 'error: key_verify: error in libcrypto' means [09:58:50] if I do ssh epantaleo@login.tools.wmflabs.org I get Permission denied (publickey,hostbased). [09:58:53] I still find it weird that your RSA key is really long [09:59:08] I can generate a new one [09:59:31] ssh-keygen -t rsa -b 4096 -C [09:59:38] I use this command [09:59:52] https://www.irccloud.com/pastebin/bf54srsw/ [10:00:01] oh 4096 bit key [10:01:38] I don't see any mentions of maximum key length [10:03:02] does 2048 bit work? [10:03:14] oh now it works [10:03:18] sorryyy [10:03:26] o.O? [10:03:31] I regenerated the key [10:03:35] not sure... [10:05:12] well thanks!! [10:05:15] for your help [10:05:32] I restarted and regenerated the key [10:07:11] o.O [10:07:45] Jun 6 10:03:01 tools-bastion-03 sshd[21624]: Accepted publickey for epantaleo from X.X.X.X port 43044 ssh2: RSA SHA256:7ydhLzLdL9RVJRNHLF4ia5P+aijpmYMmyPgWDWqag+k [10:07:51] k [10:07:59] well, that was odd [10:08:03] Ester_: that is the key fingerprint you have in wikitech, right? [10:09:36] zhuyifei1999_: Hi, Do you have a few minutes ? [10:09:46] yes? [10:09:50] arturo: not sure what you ean [10:09:51] mean [10:10:22] * zhuyifei1999_ is switchin back and forth between IRC and watching youtube [10:10:32] so yeah, kind of [10:10:45] zhuyifei1999_: In toolsbeta, I am unable to run any tool on gridengine. I tried admin, test and test2. I think I may have messed up the code. Should I rebuild it ? [10:11:17] I think it's because it's only able toone one job per queue (I should fix that) [10:11:31] *to run [10:11:54] use `qstat -f -u \*` to see what's going on [10:12:07] but feel free to fix that [10:12:12] Ester_: nevermind, I'm glad it works now [10:13:01] thanks arturo [10:14:12] zhuyifei1999_: ok got it. Also, can we write python 3 code in webservices ? I was getting errors on using python3 modules [10:14:24] yes, use k8s [10:14:34] the 'python' type is python 3 [10:14:52] oh wait, do you mean the webservice command? [10:15:29] yes, I meant the source code of webservices command [10:16:13] idk about the internalsof the building process, so ask bd.808 :P [10:16:36] which module are you trying to use? [10:17:17] zhuyifei1999_: I wanted to use configparser of python 3. It was giving me errors so for the time being I switched to configparser of python 2 [10:17:28] what error? [10:18:46] zhuyifei1999_: I think it was 'no module named 'configparser'. And the syntax was correct, since I have already tried the same code locally. [10:19:11] https://docs.python.org/2/library/configparser.html [10:19:19] it's camelcased in py2 [10:19:41] zhuyifei1999_: Yes, I have switched to that [10:20:18] zhuyifei1999_: Thanks a lot for your help :) [10:20:22] np [11:17:27] can I do sco to copy files from the instance to my computer? [11:17:32] scp [11:17:40] is there a link with instructions? [12:25:34] (03CR) 10Reedy: "recheck" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/436936 (owner: 10Libraryupgrader) [14:05:54] !log shinken stopping shinken for now; too noisy [14:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL [14:23:28] Guys, I'm having a problem here. Namely... [14:23:29] ssh: connect to host tools-login.wmflabs.org port 22: Connection refused [14:24:17] bd808: ^ [14:25:31] Now I'm getting Network is unreachable [14:27:07] ssh cyberbot-exec-iabot-01.eqiad.wmflabs [14:27:07] ssh: connect to host primary.bastion.wmflabs.org port 22: Network is unreachable [14:27:07] ssh_exchange_identification: Connection closed by remote host [14:27:20] I can't access my VPS's either. What's going on? [14:28:26] zhuyifei1999_: ^ [14:28:32] andrewbogott: ^ [14:28:41] Cyberpower678: we are doing rolling reboots of the servers [14:28:43] * zhuyifei1999_ looks [14:28:47] Cyberpower678: there is an outage inducing maint in progress [14:28:55] Oh [14:28:58] see status in topic or emails to -cloud mailing list [14:29:24] Why are the VM's getting rebooted? [14:29:49] And are ALL VM's getting rebooted including those in my Cyberbot project? [14:30:31] All presumably includes all [14:30:32] Cyberpower678: please subscribe to https://lists.wikimedia.org/mailman/listinfo/cloud-admin [14:31:59] Ugh. Another list. [14:32:40] digest ftw [14:32:46] andrewbogott: The email address you supplied is banned from this mailing list. If you think this restriction is erroneous, please contact the list owners at cloud-admin-owner@lists.wikimedia.org. [14:32:49] WTF?? [14:33:06] What domain? [14:33:10] gmail [14:33:42] My Yahoo inbox sees literally hundreds of emails a day. Emails get lost in there. [14:34:31] cloud-announce is probably what you want, not cloud-admin [14:34:57] I think I'm already subscribed to it. [14:35:27] the current maintenance is https://lists.wikimedia.org/pipermail/cloud-announce/2018-May/000050.html [14:36:28] we are updating software on the servers that host the VMs as well as on the VMs themselves. [14:36:39] bd808: has cyberbot been rebooted yet? [14:36:57] bd808: oh, good point, thank you [14:37:03] I'm a bit distracted :) [14:37:34] andrewbogott: *nod* I can answer questions while you work on the servers :) [14:37:55] yay, managers [14:37:57] * Reedy hides [14:38:44] Cyberpower678: I do not think so. It looks like things are just starting. [14:39:56] Oh good. I shut down https://tools.wmflabs.org/iabot [14:40:02] :S [14:40:31] it seems i can't reach my tool [14:40:43] mmecor: See the topic and https://lists.wikimedia.org/pipermail/cloud-announce/2018-May/000050.html [14:40:50] Maintenance ongoing [14:41:02] bd808: any etas [14:41:14] "several hours" [14:41:39] i guess this means the entire afternoon-evening [14:42:04] Cyberpower678, mmecor: the first server that we rebooted is not starting back up as easily as expected. We are figuring out what is wrong and how to proceed [14:42:15] Okay [14:42:17] no prob. ok [14:42:32] * Cyberpower678 really needs to get his job manager up and running on the tool [14:55:43] andrewbogott: is it normal, that the reboot of the instance dwl and dwl.taxonbot lasts so long? [14:56:09] doctaxon2: there have been issues we are working to debug, so no but it's known [14:58:11] chasemp: okay, please keep me up to date here [15:08:50] Technical Advice IRC meeting starting now in channel #wikimedia-tech - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:19:49] status update on the reboots: we have 2 labvirt hosts (the machines that actually run the VMs) down right now. We are working through kernel configuration issues that are keeping them from starting back up cleanly. [15:52:08] tools-login is back up, finally [16:29:26] status update: The issues with rebooting have been fixed (we needed another package of kernel modules installed). Reboots continue to roll across the OpenStack cluster [17:22:55] because I can query `revision` by `rev_user_text` on production and it's fast, but on Toolforge it is not, you need to use `revision_userindex` [17:23:13] this makes me feel like `revision` doesn't have the same indexes as production, no? [17:23:45] or at least I wouldn't want to suggest that it does and that it would be efficient [17:26:43] It actually has more indexes. However, in toolforge there are joins in it right now. [17:27:09] The joins are there for backward compatibility during some refactors of the database. We've had problems with it already. [17:27:21] The revision table has a temp table or two that are joined in as well. [17:28:16] you can use `EXPLAIN SELECT *rest of the query*` to see if there is something you can point to as well. [17:28:47] It's always possible that our replicas are missing something that was specially-added in production that I don't know about :) [17:29:07] well that's what my application is for, to allow you to run EXPLAIN on Toolforge queries https://tools.wmflabs.org/sql-optimizer [17:29:09] I tend to think that the joins are likely lagging things [17:29:24] HAH! nice :) [17:29:32] next I was going to add a "schema browser", which dumbs it down and lists the indexes for each table [17:30:14] Here's some of those joins: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#541 [17:30:52] Where people have had issues so far is that I was joining revision into page. I had to remove that. It was too slow. [17:31:09] There should be an index for each foreign key...there could be something missing! [17:31:17] But even with an index, it's more work than a straight select. [17:32:26] information_schema_p contains already metadata information [17:32:50] Good point! [17:33:24] I guess not the index table,but that can be added [17:33:34] I didn't even consider that we had that :) [17:37:33] it seems the `statistics` table isn't there, is what I think we give me the indexes of every table [17:38:07] ask for it :-) [17:38:15] nobody asked for it before! [17:39:10] musikanimal: this explain tool is rather nice. :) [17:39:21] thanks! work in progress [17:40:01] my goal is to make it easy to figure out why your query is slow, and suggest ways to improve it [17:40:35] bstorm_: honestly, I gave my opinion about the views- while not breaking compatibility is nice [17:40:54] :) [17:40:58] if production changes and that makes the query slow, I prefer we have the bare tables [17:41:07] this is for anomie really, not for you [17:41:15] he was the mastermind about the views [17:41:17] :-) [17:41:24] heh :) [17:41:46] !log project-proxy switching primary proxy to novaproxy-02 [17:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL [17:42:02] * anomie looks for context [17:42:13] anomie: just badmouthing you :-) [17:42:16] as usual [17:42:33] ha, ok (: [17:43:06] in any case, the views are a temporary patch until we have a binary log on-the-fly modifier [17:43:19] (it just happens to be a 10-year temporary patch) [17:46:38] hey, tools-bastion-03 (IO?) is super slow - someone's running a script again? [17:46:46] maxsem@tools-bastion-03:~$ time ls [17:46:54] real 0m14.909s [17:48:45] MaxSem: looks like it to me but it's so slow that I can't get a prompt [17:48:50] I will look if my cursor ever appears [17:49:00] I have a prompt [17:49:02] jynus: https://phabricator.wikimedia.org/T196570 I may have worded that wrong, I only a tiny bit know what I'm talking about, but hopefully it's clear what I'm asking for [17:49:10] it does seem slow but I don't see why [17:49:33] chasemp: what is the magic ps result? [17:51:14] maybe the split log? [17:52:29] the load on its host doesn't seem that high [17:52:43] iops is usually the culprit there [17:54:53] https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1528307681.196&target=tools.tools-bastion-03.cpu.total.iowait [17:56:34] !log project-proxy switching primary proxy to novaproxy-01 [17:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy/SAL [18:01:10] tools-login seems better now [18:07:10] still very slow to connect for me [18:07:25] Hi, tools.wmflabs.org gives 502, is it known/related to today's reboots? [18:07:37] and a lot of iowait andrewbogott https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1528308220.324&target=tools.tools-bastion-03.cpu.total.iowait&from=-1hours [18:08:00] the io intensive tasks there need to killed [18:08:08] it seems like as soon as I interact with the grid it slows down... [18:08:09] s/to/to be/ [18:08:18] chicocvenancio: how about now, better? (My grid requests are done) [18:09:19] Hi? [18:09:23] jem: it's not expected… probably needs a restart. bd808 is that something you can do? [18:09:37] Ok, thanks [18:09:56] andrewbogott: can do. did you fail over the tools proxy too, or just the vps one? [18:10:10] I did but it's back — that was > than an hour ago [18:10:15] other things (e.g. openstack-browser) are fine [18:10:35] andrewbogott: still slow (5s per ls) [18:10:55] what is the output of iotop? [18:10:58] chicocvenancio: ok, probably not me then since I'm logged out :) [18:11:21] !log tools.admin Pod in CrashLoopBackOff, invesitgating [18:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [18:11:46] its usually an user doing something that seems benign, but hogs all iops [18:12:09] iotops was all 0's when I looked before [18:12:38] there has to be something, the iowait can't be coming from nowhere... [18:13:04] !log tools.admin Restarting webservice (webservice --backend kubernetes php5.6 start) [18:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [18:13:36] andrewbogott: from the graph and the start times in the ps, I'd wager on one of the dexbot's processes [18:13:43] jem: its back for me. The cause of the crash was not obvious in the logs, but I'll try to keep an eye on it [18:14:58] Ok, my tool is working after webservice restart, thanks, bd808 [18:15:21] chicocvenancio: there's still nothing at all on iotop or top [18:15:31] !log tools.replag Restarting webservice, pod in CrashLoopBackOff state [18:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.replag/SAL [18:15:42] I'm seeing a trend here :/ [18:15:55] wmopbot keeps leaving and comming back. [18:16:20] andrewbogott: now its good, zero iowait [18:16:37] ls taking under 10ms [18:16:51] paladox: the "(Nickname regained by services)" message makes me think there are multiple copies running? I don't know who operates that bot [18:17:01] ah ok [18:17:03] thanks [18:17:40] danilo: is wmopbot yours? [18:21:42] chicocvenancio: it is [18:21:55] looks like it is stable now [18:22:08] I issued a restart command in IRC [18:22:58] Sagan: I think there still may be 2 copies. "Guest5872 (tools.wmop@wikimedia/bot/wmopbot)" and "wmopbot (tools.wmop@wikimedia/bot/wmopbot)" [18:35:15] zhuyifei1999_: someone on Facebook appears to have gotten a 502 error on v2c [18:35:20] bd808: andrewbogott: How goes it so far? [18:36:02] !log tools.anni-me Restarted webservice, 25720 CrashLoopBackOff retries logged [18:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anni-me/SAL [18:37:03] I assume the reboots are still going on? [18:37:50] !log tools.wikibugs qdel current jobs [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [18:37:55] !log tools.anni-me Stopped webservice. App trying to run django, but failing to load MySQLdb module. This causes an infinite start, crash, restart loop [18:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anni-me/SAL [18:38:25] Cyberpower678: yup, still going on. I know we are over half way done at this point [18:38:44] !tool anni-me [18:38:54] bd808: sweet. Just ping me when I can log back on, and reactivate IABot. [18:39:17] Cyberpower678: you can give it a shot now. many things are up, some are down [18:39:52] Bastion was down two minutes ago. :p [18:40:55] Cyberpower678: which bastion? [18:41:13] The one I proxy to get to cyberbot, but it's up now [18:41:40] bastion-01.bastion.eqiad.wmflabs has been running for almost 3 hours now [18:43:03] !log tools.video2commons-socketio Webservice in CrashLoopBackOff due to failure to reach video-redis.video.eqiad.wmflabs:6379 [18:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.video2commons-socketio/SAL [18:44:31] chicocvenancio, zhuyifei1999_: looks like tools.video2commons-socketio is yours? Its not having a fun time. 5721 CrashLoopBackOff restarts [18:44:57] I'm not yet familiar with it [18:45:03] but I can take a look [18:45:40] its dying because of a failure to connect to a redis in the video project [18:47:46] !log tools.wmf-task-samtar Restarted webservice, stuck in CrashLoopBackOff [18:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wmf-task-samtar/SAL [18:49:36] Reboots are all done. [18:49:51] Cyberpower678: ^ [18:49:59] Yay [18:50:12] Thanks :-) [18:52:33] * halfak just came here to ask about when the reboots would be done :) [18:52:48] bd808: I'm getting this [18:52:51] https://www.irccloud.com/pastebin/iF0BytMh/ [18:54:55] tools-bastion-03 is slow again [18:56:00] something is weird there [18:56:02] and I don't know what [18:56:27] * chicocvenancio is moving debuging v2c to tools-bastion-02 [18:56:53] yay rolling robots! [18:56:55] reboots [18:56:56] roboots [18:58:47] o_O [18:58:52] NFS requires a host-only network to be created. [18:59:01] i should probably just wipe this machine and replace it with a newer vm [18:59:07] brion: mw-vagrant startup failure? [18:59:15] yeah, on jessie-compat brach so it's old [19:00:04] I have seen this before and never found a reliable fix. I usually try `vagrant reload` 4-5 times and then if that doesn't work I reboot the underlying VM and cross my fingers [19:00:10] heh [19:00:28] its some problem with the loopback networking and LXC not playing nice [19:00:51] there we go, it's happier on another run [19:00:52] thx :D [19:01:14] bd808: this error doesn't seem related to redis, did something change with webservice already? [19:01:48] chicocvenancio: yeah, that one is a host mount from the k8s exec node not being seen in the Container. [19:02:12] Hi. [19:02:16] I honestly don't know why that would happen [19:02:28] "tools-bastion-03" is basically unusable. [19:02:30] hey Marybelle. what tool are you hear to ask about? ;) [19:02:37] Is there some other bastion host that's better? [19:02:46] Marybelle: dev.tools.wmflabs.org [19:02:53] I usually use whatever tools-login.wmflabs.org is pointing to. [19:03:06] I'm thinking I want to reboot tools-bastion-03 [19:03:10] it's got issues [19:03:15] and tools-bastion-05 seems ok [19:03:18] so I believe it's local [19:03:34] chasemp: +1. lots of looking has not found the cause and 02 and 05 do seem to be ok [19:03:36] A whole three hours of uptime. :-/ [19:03:50] Thanks, I'll try one of the other hosts. [19:04:31] Marybelle: we just finished a kernel upgrade across all the vms and hosts. A few things are being goofy on restart [19:04:32] !log tools tools-bastion-03 is virtually unusable [19:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:04:39] chasemp: Does tools-bastion-05 have an external host name? [19:04:46] bd808: Ah, okay. :-) [19:05:05] Marybelle: no probably not, but it can be reached through normal VPS bastion jump hosts [19:05:10] !log tools.video2commons-socketio restarting webservice to see if it fixes host mount issue [19:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.video2commons-socketio/SAL [19:05:14] hopefully tools-bastion-03 comes back here pretty quick [19:05:15] bd808: that did it [19:05:39] dev.tools.wmflabs.org got me into tools-bastion-02. I'll take it. [19:06:11] :) [19:06:24] tools is down again. [19:07:10] Cyberpower678: it's difficult to know what that means to you, a more specific report would be helpful [19:07:23] login.tools.wmflabs.org seems back now and is working better for me [19:07:25] ^I imagine it was the bastion reboot [19:07:33] $ ssh tools-login.wmflabs.org [19:07:33] ssh: connect to host tools-login.wmflabs.org port 22: Connection refused [19:34:26] heh guess it is time to redo that vm... wiki: Error: You might be using an older PHP version (PHP 5.6.33-0+deb8u1). [19:34:54] git reset HEAD~250 --hard [19:34:56] profit [19:40:42] aw crap i think i lost my OTP setup when i moved phones a few months ago [19:40:45] 2fa faaaaail [19:41:35] "Disable two-factor authentication / Enter a code from your authentication device to verify:" [19:42:52] anybody able to reset the 2fa on my wikitech/horizon account? [19:43:05] "my voice is my passport, verify me" [19:46:04] 502 bad gateway for https://tools.wmflabs.org/geohack/geohack.php . Can s.o. fix this? [19:46:55] brion: can you make a reset request file in your homedir somewhere and then let me know where to look? [19:47:13] sure! [19:48:15] !log tools.geohack Restarted webservice, stuck in CrashLoopBackOff [19:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.geohack/SAL [19:48:34] tetewww: looks like its back up now. thanks for the poke [19:48:56] ry [19:48:58] ty [19:49:03] bd808: I stuck a file in ~brion on bastion.wmflabs.org. :) [19:50:43] brion: {{done}} [19:50:57] woot [19:51:07] thanks bd808 ! resetting it for my new phone now :) [19:51:27] brion: Authy might help. [19:51:31] (In future.) [19:51:37] this time i'm saving the scratch codes too :D [19:52:12] James_F: oooh [19:53:08] * bd808 has been belt-and-suspendering with printed initial code and authy [19:53:14] brion: OTOH, it involves un- and re-setting 2FA across all your accounts, which takes forever and isn't fun. I have ~25, split across Google Auth and Authy now. [19:53:21] I hate the authy app ui, but oh well [19:53:35] Yeah, it's not wonderful. [19:53:47] the iphone ui is actually horrible [19:53:52] Not being able to distinguish by similar username accounts on similar sites? [19:53:56] ie wikimedia properties? [19:54:15] yeah [19:54:26] I haven't found an easy way to change the labels [19:54:49] hrm, horizon still not letting me in. am i using the right credentials? hrmf [19:54:52] we could probably rethink what we write into the QR code [19:55:03] Yeah, that's the problem [19:55:16] should be same credentials as gerrit or ....? hrm [19:55:20] wgOATHAuthAccountPrefix [19:55:49] brion: we have had some issues with case-sensitivity there, should be 'Brion VIBBER' and your developer LDAP password [19:56:15] which should be the same password on wikitech & gerrit [19:57:05] ahhhhhh wtf something didn't take on the 2fa [19:57:15] lol i didn't confirm it [19:57:20] * brion is not braining today [19:58:12] and i'm in :D [19:58:15] thanks all for your patience [19:58:16] I can admit to having done that exact same thing on wikitech 2fa [19:58:28] :) [19:58:36] (I have done it too!) [20:07:09] hi! are the cuts over? [20:09:04] mmecor: yes, we are done rebooting things [20:09:46] ok, thanks! :) [20:23:59] anomie: is there a scheduled time when the fix will be rolled out regarding the OAuth redirect patch? [20:25:16] !log tools Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220) [20:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:25:19] Cyberpower678: It should go out with 1.32.0-wmf.7 or later. https://www.mediawiki.org/wiki/MediaWiki_1.32/Roadmap has a schedule. [20:26:22] Oh so tomorrow. :p. Thanks. :-) [20:30:48] Cyberpower678: For future reference, you can usually figure it out yourself by looking at the project ReleaseTaggerBot adds to the task: The 1.32.0-wmf.7 in "MW-1.32-release-notes (WMF-deploy-2018-06-05 (1.32.0-wmf.7))" is what you're looking for. You can also look at the date in there, which is typically the Tuesday it will be deployed to group 0 wikis. Group 1 usually follows on Wednesday and group 2 on Thursday, assuming no critical bugs are [20:30:48] reported that stop the train. [20:32:45] Or you can look at when the patch was merged. The new branch is cut sometime on Tuesday, so absent backporting anything merged before the cut goes out that week. [21:04:20] https://tools.wmflabs.org/add-information/?image=Map-heart-054.jpg [21:04:35] broken: 502 Bad Gateway [21:05:27] ok i think i got ogvjs-testing.wmflabs.org rebooted and running, but my mac doesn't see it in DNS. probably a propagation issue when it was briefly deleted :( [21:06:44] !log tools.add-information Restarted webservice stuck in CrashLoopBackOff [21:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.add-information/SAL [21:07:08] yannf: I think I got it running again [21:07:32] ok, thanks [21:08:23] it works [21:10:25] !log tools Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220) [21:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:16:01] Reedy: wikibugs looks to be working fine now right? I just saw your ticket [21:16:32] legoktm: It's more the sudo error in the middle [21:26:05] Reedy: hmm, but all 3 jobs are running now? [21:26:10] did you retry it? [21:26:14] No [21:26:47] hmm [21:26:54] I think wb2-phab started on its own then? [21:27:10] I think it started [21:27:19] sudo didn't say it failed, just it was a non 0 result [21:28:00] But it also took a while before the bot came back to irc [21:28:02] the start at timestamp is 45 minutes before the other two jobs [21:30:05] But I killed them all... [21:30:09] God knows [21:30:11] It's confusing [22:00:41] !log tools Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589) [22:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:00:44] T196589: Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart - https://phabricator.wikimedia.org/T196589 [22:03:47] andrewbogott hi, some how with the cloud reboots it now caused /srv to be messed up on phab-tin [22:03:54] /dev/mapper/vd-second--local--disk 484M 443M 13M 98% /srvv [22:05:45] ah doing [22:05:46] umount /srv [22:05:48] worked [22:26:45] god dammit [22:26:54] No wiki found / Sorry, we were not able to work out what wiki you were trying to view. Please specify a valid Host header. [22:29:25] * brion does the vagrant destroy && vagrant up dance [22:29:40] it keeps freezing up port 8080 [22:29:48] is there a way other than rebooting the host to fix that? sigh [22:30:15] brion sudo service apache2 restart [22:30:20] in vagrant [22:30:28] i get that too and doing sudo service apache2 restart [22:30:31] paladox: there is no vagrant anymore, i destroyed it [22:30:31] fixes it for me [22:30:34] heh [22:30:45] brion you will have to do that when you start a new vagrant too [22:31:33] brion: that's an artifact of the LXC and bridge network failure stuff :/ [22:31:35] well the great thing about these scripts is you can re-run everything easy :D [22:31:47] fun :D [22:36:46] !log tools.best-image Stopped webservice. Python2 venv seems to be corrupt. Possibly created from bastion instead of using `webservice --backend=kubernetes python2 shell`? [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.best-image/SAL [22:42:14] chicocvenancio: do you know how to get petscan up and running? I see a SAL message that you did it once [22:42:24] reported down on T196568 [22:42:24] T196568: 502 Bad Gateway using multiple WMF Labs Tools - https://phabricator.wikimedia.org/T196568 [22:43:05] bd808: not of the top of my head [22:43:15] * chicocvenancio checks if message rings bells [22:43:57] the other scary thing in its SAL is something about y.uvi starting the server in a screen :/ [22:44:55] I vaguely remember there being a script there [22:45:29] * bd808 reads update_live_site.sh and shudders [22:46:02] looks like run.sh may be the magic. [22:49:33] !log petscan Started a screen as magnus and then ~magnus/petscan/run.sh inside it [22:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL [22:50:37] !log petscan https://petscan.wmflabs.org/ active again. (T196568) [22:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL [22:50:39] T196568: 502 Bad Gateway using multiple WMF Labs Tools - https://phabricator.wikimedia.org/T196568 [23:03:56] !log tools.paws-beta masking swap partitions on k8s cluster [23:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.paws-beta/SAL [23:08:34] !log tools.paws-beta rebooted all nodes to verify it can survive, all good [23:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.paws-beta/SAL [23:08:58] ok i'm back up and running on mine, minus having working math. i'll leave that to later :D [23:15:04] !log tools.enwnbot `kubectl delete deploy/enwnbot` to remove what looks to be an orphan Kubernetes deployment [23:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.enwnbot/SAL [23:16:53] !log tools.ft `kubectl delete deploy/ft` to remove what looks to be an orphan Kubernetes deployment [23:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ft/SAL [23:17:57] !log tools.himo `kubectl delete deploy/himo` to remove what looks to be an orphan Kubernetes deployment [23:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.himo/SAL [23:20:16] !log tools.loltools shutting down webservice. Its in an infinite restart loop for "java.lang.UnsupportedClassVersionError: com/typesafe/config/ConfigException : Unsupported major.minor version 52.0" [23:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.loltools/SAL [23:25:17] !log tools.lyan `kubectl delete deploy/lyan` to remove what looks to be an orphan Kubernetes deployment [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lyan/SAL [23:31:07] !log tools.not-in-the-other-language `kubectl delete deploy/not-in-the-other-language` to remove what looks to be an orphan Kubernetes deployment [23:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.not-in-the-other-language/SAL [23:33:49] !log tools.r96340-bot Stopped webservice. Stuck in an infinite restart loop and apparently an 11 month old "hello world" prototype app [23:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.r96340-bot/SAL [23:39:23] !log tools.readmore Deleted 7 kubernetes replicasets that look to have been left from kubernetes cluster issues about a year ago [23:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.readmore/SAL [23:40:51] !log tools.readmore webservice --backend=kubernetes python start [23:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.readmore/SAL [23:42:57] !log tools.sparqlblocks Stopped webservice. In an infinite restart loop due to missing nodejs modules [23:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sparqlblocks/SAL [23:43:35] tgr: ^ that sparqlblocks one is apparently yours. The npm install is messed up somehow for it. [23:45:00] bd808: is it causing problems? [23:45:00] !log tools.verification-pages Stopped webservice. In an infinite restart loop due to "Could not find libv8-6.3.292.48.1 in any of the sources" [23:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.verification-pages/SAL [23:45:28] tgr: It was just infinitely crashing and restarting [23:45:36] so I stopped it [23:45:45] after being stopped, I mean [23:46:00] nope, jsut wanted to make sure you knew about it [23:46:12] cool, thanks [23:46:21] I don't expect to work on it any time soon [23:46:48] the irc ping saved me from writing on your talk page like I'm doing for others :) [23:58:05] anomie: cool. Good to know