[08:03:57] !log tools.zppixbot-test investigating incident T254348 [08:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [08:03:59] T254348: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 [08:21:22] ^ That makes no sense [08:21:46] Nothing is playing with the memory now that isn't core and the connection is closing before it starts [08:22:04] * RhinosF1 will try and reboot the pod after he finds a sopel dev cc Reception123 [08:29:29] !log tools.zppixbot-test Recovery not expected before 10am [08:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [08:29:41] ^ that's ~30 mins [08:56:30] RhinosF1: No access atm [08:57:54] Reception123: exact same issue as last time [08:58:28] crashes, find_lines spam, keeps crashing, eventaully exhausts and kills itself [09:05:12] !help can someone help me understand https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&from=1591670485852&to=1591670576430&var-namespace=tool-zppixbot-test [09:05:12] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [09:05:51] RhinosF1: what do you need to understand exactly? [09:05:53] Our bot was crashing at 04:46 [09:06:03] arturo: whether that links to the reason we died [09:06:53] 04:46 is UTC [09:07:06] what's in the logs? [09:07:29] arturo: bot disconnects due to a network timeout and loops around until it dies [09:07:42] https://phabricator.wikimedia.org/T254348#6204886 [09:08:07] with random issues with the bot's memory that don't add up in between [09:08:20] that looks to me like a source code problem [09:08:38] if trigger.sender not in bot.memory['find_lines']: [09:08:39] KeyError: 'find_lines' [09:09:28] arturo: that is, but it should auto-reboot. It shouldn't even access that. I'm confused why the auto-reboot failed [09:09:47] them logs cover a 20s period [09:10:09] where it restarts 14 times and then exhausts itself and stops [09:10:22] what TZ is the grafana graph in? [09:10:35] not sure what auto reboot means. Kubernetes can detect crashed pods and restart them if they fail a health check [09:11:05] sopel will reconnect if it fails [09:11:14] for some reason that's never started [09:11:16] it seems to be both.. a source code issue but caused by networking issues https://github.com/sopel-irc/sopel/issues/1865 [09:11:32] I filed that issue [09:11:49] yea [09:11:58] it can normally recover but as soon as I've upgraded to python 3.7 it failed to do that [09:12:20] and then I looked at granfana to check resource use [09:12:25] and there's that spike [09:12:40] but if grafana is UTC then that's 65 mins prior [09:13:26] that graph shows 3 pods oon 4 containers [09:13:39] we only have 2 pods on 2 deployments [09:14:28] there was a crash that did recover at 02:09 UTC [09:14:36] so either side of that spike [09:15:42] that lasted until 02:41 [09:16:08] which I think might have been a pod restart [09:19:18] arturo: am I right that granfana is UTC [09:23:39] * RhinosF1 sees that CPU qouta limits and requests were equal [09:24:08] RhinosF1: seems to me like it shows me the time in whatever my laptop is set to [09:24:38] f.e. if i click "last 5 minutes" it shows the time in EDT which happens to be my system time right now.. for ..reasons [09:25:02] mutante: so https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&from=1591670482346&to=1591670575937&var-namespace=tool-zppixbot-test would be 02:41/2 UTC? [09:26:40] i can't tell [09:26:47] I think it might be [09:26:55] if so, that's the restart [09:27:39] https://github.com/cockroachdb/cockroach/issues/20428 [09:27:47] Change the default timezone of Grafana dashboards from UTC to Local browser time # [09:28:06] it is MacFan's restart then [09:28:17] * RhinosF1 looks at MacFan4000 and wonders what he did [09:28:34] "However, you can easily change the time zone for your dashboards when logged-in to grafana. Simply click the cog icon, navigate to the General tab and adjust the Timezone drop-down." [09:28:38] RhinosF1: ^ that ? [09:29:01] I don't have a grafana account [09:29:30] is it the wikitech login? [09:29:36] or shell username? [09:30:03] he just deleted the pods so I call that a fluke [09:30:32] RhinosF1: never used it .. but wikitech login seems to work.. [09:31:21] RhinosF1: yea,, under settings there are time zone options.. "Default", "Local browser time" or "UTC" [09:31:34] doesn't tell us what Default is.. but try it [09:33:58] mutante: gerrit+toolsadmin password is incorrect [09:34:10] using both "rhinosf1" and "RhinosF1" [09:34:17] and "rhinosf1@gmail.com" [09:35:41] RhinosF1: maybe you would have to ask for being member in a specific LDAP group.. but i have no idea about grafana-labs setup [09:35:50] seems like a group though if i can login and you cant [09:36:26] it is definitely not the email address [09:36:50] your ldap user seems to have a lot more access [09:37:00] but i think it's local time [09:37:06] for a production service i could look it up in puppet [09:37:09] if so, I'm rolling this back [09:37:20] yea, i think local time is default [09:38:35] ah.. maybe i found the backend.. checking [09:40:09] Given a spike in resources on restart (if you confirm local time) and the mess that it seems to be in, I think I'm going to roll back python. [09:41:45] !log tools.zppixbot-test delete deployment to begin rolling back T254246 due to T254348 [09:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [09:41:48] T254246: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 [09:41:49] T254348: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 [09:41:55] RhinosF1: it says that 'all viewers should be logged in as anonymous' or so.. that's all i know [09:42:04] mutante: oh [09:42:08] * RhinosF1 rolling back [09:43:14] Reception123: we're rolling back to python 3.5 [09:47:03] !log rolled back sopelbot to python 3.5 -- END due to T254348 (cc T254246) [09:47:04] RhinosF1: Unknown project "rolled" [09:47:06] T254246: Upgrade ZppixBot docker image to python 3.7 - https://phabricator.wikimedia.org/T254246 [09:47:06] T254348: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 [09:47:14] !log tools.zppixbot-test rolled back sopelbot to python 3.5 -- END due to T254348 (cc T254246) [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [09:47:40] !log tools.zppixbot-test outage over [09:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [09:48:29] Thanks for the help [10:27:42] arturo you there? [10:27:53] yup hauskatze [10:28:17] arturo: Since yesterday, https://logstash-beta.wmflabs.org is down [10:28:30] I rebooted the logstash instances via Horizon [10:28:36] But that didn't fixed the problem [10:28:51] I was wondering if you guys could see if there's some backend issues or something? [10:29:17] I am doing some mw-core patches and I'd love to see if they cause errors before reaching production [10:29:29] in deployment-prep? [10:29:34] yep [10:29:42] ops said ask here [10:30:39] I filed T254801 yesterday [10:30:39] T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out - https://phabricator.wikimedia.org/T254801 [10:31:31] the VM backing that FQDN is up and running, apparently hauskatze [10:31:49] I don't have time right now for a more in-depth investigation [10:32:11] the instance was apparently created by Krenair [10:32:19] It's not loading here, nor the alias kibana4.wmflabs.org either [11:11:25] !log tools.wmde-access deployed 5ed8014383 (fix for toolforge.org migration of other tools and migrate to toolforge.org ourselves) [11:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wmde-access/SAL [12:07:52] !log tools.zppixbot restart webservice as canonical [12:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [13:19:15] Hello Guys, I am a Wikimedia GSoC Student. I want to access the https://toolsadmin.wikimedia.org/tools/id/commons-android-app but I am unable to. [13:20:47] Why are you unable to? [13:21:18] I am getting Access denied error. [13:21:23] From what? When doing what? [13:21:25] This is the procedure I followed, let me know if I a doing this correctly or not. [13:21:25] madhurgupta10@commons-android-app.eqiad.wmflabs [13:22:28] Where did you get that command/hostname from? [13:22:46] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances [13:22:47] As far as I can see, it's tool, not a cloud project [13:23:09] Could you please let me know which command should I execute? [13:23:30] You want to ssh to tools-login.wmflabs.org, and then probably use `become commons-android-app` [13:25:15] I am new to tools, how can I SSH into it? [13:26:43] ssh username@tools-login.wmflabs.org [13:27:17] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Getting_started_with_Toolforge_-_Quickstart [13:27:59] I get the following message [13:28:00] $ ssh madhurgupta10@tools-login.wmflabs.org [13:28:00] (ECDSA) to the list of known hosts. [13:29:28] I think you lost part of your message [13:29:51] !pastebin [13:29:58] well, unhelpful bot is unhelpful [13:31:24] Is it asking you if you want to accept the key/add the host to the list of known hosts? [13:32:30] https://imgur.com/Hs4NntN [13:33:29] Have you uploaded your SSH key? [13:33:37] yes [13:33:46] To where? [13:33:50] Is it loaded into an SSH agent? [13:33:54] https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/ [13:34:42] I added SSH here [13:37:24] It is working now [13:37:28] I missed this [13:37:29] ssh-add key [13:37:35] Thanks Reedy [13:37:39] np :) [13:37:47] you'll need to do ssh-add after a reboot [13:37:59] I am using Git Bash on Windows [13:38:11] do I have to do it everytime I reboot? [13:38:20] yeah [13:38:27] it's loaded into a ssh agent in memory [13:38:27] okay thanks [13:38:33] got it [13:38:42] I also want to access Database Tables [13:39:46] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [13:42:54] What should be the value for this - enwiki.analytics.db.svc.eqiad.wmflabs in my case? [13:43:36] Depends what database tables you're trying to access [13:44:25] I will be working on this one https://toolsadmin.wikimedia.org/tools/id/commons-android-app [13:54:30] After executing `become commons-android-app` I can see the following [13:54:31] Yes, but are you trying to access the database table of that tool (does it have one?), or are you wanting one of the wikis databases? [13:54:57] tools.commons-android-app@tools-sgebastion-07:~$ [13:55:42] https://toolsadmin.wikimedia.org/tools/id/commons-android-app this contains the database of the Wikimedia Commons Android App users upload data which I need to access [13:56:27] https://github.com/commons-app/commonsmisc here is the repo [14:01:21] !log admin icinga downtime everything cloud* lab* for 2h (T253780) [14:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:01:23] T253780: Upgrade cloudservices nodes to Debian Buster - https://phabricator.wikimedia.org/T253780 [14:09:51] !log admin stopping puppet, all designate services and all pdns services on cloudservices1004 for T253780 [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:09:53] T253780: Upgrade cloudservices nodes to Debian Buster - https://phabricator.wikimedia.org/T253780 [14:12:25] madhurgupta10: Looking at the tool, it has a user database (`sql tools`), but it's empty [14:12:32] It looks like it just queries the commons database [14:13:14] So you can probably use `sql commonswiki` [14:15:13] ReedyThanks for all the help, I will get in touch with my GSoC mentor about the tables. [14:17:18] !log tools.bd808-test restarted webservice to check health of the k8s system, DNS servers being upgraded [14:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bd808-test/SAL [15:25:30] !log admin icinga downtime everything cloud* lab* for 2h more (T253780) [15:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:25:32] T253780: Upgrade cloudservices nodes to Debian Buster - https://phabricator.wikimedia.org/T253780 [16:11:09] Cyberpower678: try `webservice start --canonical` [16:11:34] I already did. I'm testing my changes. [16:11:55] great [16:12:22] Looks like my WMF grants work. [16:13:14] arturo: now I just need to update my Miraheze grants. [16:45:05] hi bd808 hope you're having a good day [16:45:21] may I have few minutes of your time? [16:46:07] hauskatze: !ask [16:47:13] bd808: so logstash-beta is down [16:47:22] hauskatze: not my problem :) [16:47:37] bd808: it looks nobody problem :) [16:47:43] ask deployment-prep maintainers. Probably in -releng [16:47:47] releng said cloud, cloud says now dev/null [16:48:01] ping-pong [16:48:08] :) [16:48:18] I'm happy to tell releng not my problem too. [16:48:29] lol [16:48:47] hauskatze: is there a task? [16:49:00] bd808: it's a nginx/timeout issue; maybe something in the cloudvirt hosts? [16:49:05] bd808: sure, let me fetch [16:49:34] https://phabricator.wikimedia.org/T254801 [16:56:43] hauskatze: I left a comment. The host is misconfigured [16:57:00] bd808: thanks, I knew you could help :) [16:57:19] I have horizon access, maybe I can take a look [17:03:14] hauskatze: we could fix everything, but we don't have time to do it. That's what I tried to transmit to you this morning [17:07:30] arturo: I was just asking for advice [17:09:13] anomie: is there documentation somewhere laying out how the OAuth tables are structured and located? [17:09:37] Specifically, the one holding OAuth Callback URLs. [17:10:20] Cyberpower678: anomie isn't around much these days. The tables are in the metawiki db, but also not exposed tot he wiki replicas [17:10:46] the tables are documented in the code for the OAuth extenstion [17:11:00] bd808: I would hope not. This is for a different wikifarm IABot runs on. I need to migrate my grants there. [17:11:38] we actually have a task to expose them, but haven't finished the security sign-off [17:12:20] Oh. What for? [17:13:09] bd808: also confirming that Wikimania has been pushed back to next year. [17:13:33] Cyberpower678: https://www.mediawiki.org/wiki/Extension:OAuth and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/OAuth/+/master/schema/OAuth.sql [17:13:47] yes, wikimania 2020 is cancelled. Not surprising [17:18:33] Horizon's down expected? [17:19:01] DNS Updates [17:19:02] ah [17:19:03] k [17:30:02] bd808: I don't think I quite follow your OAuth callback explanation. [17:32:13] Cyberpower678: Not that I know of offhand, your best bet would be to look at the schema file bd808 linked. [17:33:06] anomie: it's all taken care of. Thank you though. BTW, can you help explain what bd808 is trying to tell me https://phabricator.wikimedia.org/T254857#6207098 [17:40:07] Cyberpower678: If it's OAuth 1, he's talking about the oauth_callback parameter when you hit the /initiate endpoint. But you may just be using "oob" instead of actually specifying the callback URL. If it's OAuth 2, it's redirect_uri to the /oauth2/ endpoints (and may be optional depending on which method you're using, I'm too lazy to look up the details). [17:41:23] anomie: yes, I am using oob. My OAuth engine was based on your OAuth Hello World example. [17:43:28] But if that's supposed to be the callback URL, then I guess that makes sense. [17:44:35] It's fine to use oob, that just tells MediaWiki to use the callback URL on file in the database. [17:46:41] anomie: thanks for the explanation [17:48:50] And if you do specify a URL, MW will insist that it matches the one in the DB anyway (to prevent certain kinds of attacks). So specifying it is only useful for the "prefix" feature, to append a token of some sort to the callback URL rather than using cookies or if you're doing something weird with multiple endpoints. [18:19:05] !log tools.my-first-flask-oauth-tool Updated to use python3.7, enabled --canonical, new OAuth grant [18:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.my-first-flask-oauth-tool/SAL [20:38:03] bd808: bash isn't showing any results [20:38:19] * RhinosF1 wanted the first/final quote for Texas [20:46:09] RhinosF1: ugh. that's a new crash. Let me see if I can figure out the problem [20:46:20] bd808: ty [20:46:40] * RhinosF1 wishes things would crash at the right time like when I'm trying to debug [20:48:56] lol [20:49:34] * RhinosF1 needs somehow to trick his bot in ~20 mins to thinking there's a ping timeout [20:49:47] Or to wait a few days [20:50:03] firewall rule? [20:50:41] It's on toolforge [20:52:54] * RhinosF1 knows he wont be waiting long anyway [21:02:05] * RhinosF1 has an idea [21:17:23] !log tools.zppixbot-test - messing stuff up to test T254348 [21:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [21:17:26] T254348: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 [21:20:24] * bd808 wonders what the heck is wrong with the bash tool [21:20:47] no error logs, but no content from the elasticsearch backend [21:23:34] oh.. well that would do it... [21:24:50] !log tools.bash Credentials missing from es7 cluster for tool access. Fallout from T254491 [21:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bash/SAL [21:24:52] T254491: Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 [21:36:28] * RhinosF1 just proved one bug's cause [21:36:47] 1/3 python update blockers have a confirmed cause now [21:38:33] RhinosF1: bash is alive again [21:38:45] bd808: yey! [21:38:57] bash is one of my top tools [21:42:23] It was a fun one to make. :) [21:43:01] * RhinosF1 uses it when he needs a laugh [21:44:55] !log tools.zppixbot-test keeping logging_channel off with debug logs and hoping the bot timesout to get some decent non log spammed logs to try and fix bug 2/3 in T254348 [21:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [21:44:57] T254348: ZppixBot-test fails restarts correctly on python 3.7 after crash - https://phabricator.wikimedia.org/T254348 [21:54:15] bd808: I use it whenever I need a laugh. It's one of the most uplifting tools [21:54:40] * RhinosF1 goes back to hoping there is a timeout on the bot before DEBUG logs go insane [22:55:27] !log clouddb-services Reset the passwords for T254931 [22:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [22:55:30] T254931: missing postgres passwords on - https://phabricator.wikimedia.org/T254931 [23:00:47] bstorm_: can I be picky and say the title of that task doesn't make sense. on what? [23:03:13] I can fix that [23:03:25] bstorm_: {{done}} I got nerd sniped [23:04:02] ty bd808 [23:07:50] lol [23:07:53] Ok cool