[01:03:00] Dispenser: ;) [08:40:06] hello. I sent a task to the grid yesterday at 23 and it seems that it hasn't start yet :S [08:40:17] job id: 2029528 [08:42:50] mmecor: let me check [08:46:24] it is running [08:47:07] consuming quite a lot of IO, in fact [08:47:33] did you expect output it hasn't given you yet? [08:50:36] sqlite in toolforge is not that performant [08:50:41] mmecor: ^ [08:51:45] :S [08:52:30] one sec [08:52:32] you can use the toolsdb for database [08:52:51] i know, but it is safer to use sqlite for certain things [08:52:56] since there are no connections [08:53:06] actually i am also using the toolsdb [08:53:11] but later [08:53:20] well, in toolforge there is the NFS connection [08:53:22] to store the *real* data [08:53:42] and at 2gb the sqlite is quite heavy [08:54:11] i am creating a 6GB db in sqlite (and my code in my laptop takes 6 GB to create it) [08:54:21] because it is parsing the wikidata dump [08:54:30] 30 GB compressed into 6GB in sqlite [08:54:55] it seems it runs half the speed that in my laptop [08:57:09] it's fine [08:58:12] if it is a one time thing, I think it is fine, will take a while due to the sqlite over nfs deal [08:58:33] if it will be a recurrent job, mysql really is the way to go [08:58:43] once a month? [08:59:05] the thing is that i am surprised that there is no output in the .out. there should be prints [08:59:26] well, it'll probably run a LOT faster with mysql [08:59:57] but where is the problem, in storing in sqlite file or in reading the wikidata dump? [09:00:10] because the wikidata dump is also stored using nfs, right? [09:00:52] the dump is /mnt/nfs/labstore-secondary-tools-project/wcdo/latest-all.json.gz ? [09:01:28] yes [09:02:18] the thing is that i am surprised that there is no output in the .out. there should be prints -> any idea why? [09:03:16] brb [09:04:01] it is definitely the sqlite db slowing it down [09:04:41] strace ONLY shows calls to lseek, read and write to it. [09:05:32] it is probably some loop selecting things many times [09:13:17] I'm still at a loss about the print statements [09:27:20] btw, mmecor, you don't need to download the dump [09:27:49] its there in /public/dumps/public/wikidatawiki/entities/latest-all.json.gz [10:12:26] thanks chicocvenancio for the info [10:13:58] so i can read it directly from that location. good [11:08:28] !log tools aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster [11:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:33:59] !log tools removing unused kernel packages in ubuntu nodes [11:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:36:02] !log tools (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911) [11:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:36:07] T188911: toolforge: cleanup kernel packages - https://phabricator.wikimedia.org/T188911 [11:41:02] !log tools aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:42:13] !log tools clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem [11:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:58:46] !log tools T188994 upgrading packages in jessie nodes from the oldstable source [12:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:58:51] T188994: toolforge: upgrade - https://phabricator.wikimedia.org/T188994 [13:21:45] !log tools T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced [13:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:21:50] T188994: toolforge: package upgrades as part of the new workflow - https://phabricator.wikimedia.org/T188994 [13:32:11] hi, i noticed https://tools.wmflabs.org/guc/ doesnt work [13:32:29] https://tools.wmflabs.org/guc/?user=83.220.237.95 <- example [13:35:44] Wiki13: I see `Error: MySQL login data not found at`. We probably better open a bug in phab for the tool maintainer [13:36:01] yea I noticed that aswell [13:36:55] but it has worked up until earlier today [13:37:05] so something must have changed [13:37:13] yeah [13:38:05] mmmm I've been doing package upgrades in toolforge, but I can't see anything related to mysql client libs [13:38:31] I saw something like that yeah [13:38:36] around 2 hours ago [13:39:14] you have an exact timeframe on when the issue started? [13:39:17] arturo: it is `$uinfo = posix_getpwuid(posix_geteuid());` giving a different dir [13:39:53] chicocvenancio: that could be because some some pam library being upgraded? [13:39:57] I don't quite see how upgrades would change that, but maybe you do? [13:40:11] chicocvenancio: where did you see that line? [13:40:49] arturo: The Settings.php file [13:41:01] in which machine? [13:41:31] I think might mean the tools directory? [13:41:34] he [13:41:47] less /data/project/guc/labs-tools-guc/src/Settings.php [13:42:05] want me to open a phab ticket for this/ [13:42:06] ? [13:42:28] Wiki13: sure, put chicocvenancio and me in subsribers [13:42:38] k [13:42:39] doing [13:46:27] chicocvenancio: do you know where is this tool executing? I would like to inspect the runtime [13:46:29] arturo, chicocvenancio: T189001 [13:46:30] T189001: Global User Contributions complains about replica conf file - https://phabricator.wikimedia.org/T189001 [13:48:35] thanks Wiki13 [13:49:41] https://www.irccloud.com/pastebin/jWGIxzE9 [13:49:58] oh, I was looking at https://tools.wmflabs.org/admin/oge/status [14:04:43] https://tools.wmflabs.org/replag/ wut [14:04:55] I don't think this is... normal [14:05:28] see above too revi, there is something else going on [14:05:30] guc is broken [14:05:34] saw it [14:05:38] and I checked replag [14:05:46] and something super duper weird is going on [14:05:48] arturo, chicocvenancio: ^ [14:06:09] I recall tools/replag doesn't look like that [14:06:17] with this "s1 9223372036854775807" unreasonably high replag [14:06:39] mmm [14:06:55] is that replication time between main databases and wiki replicas? (cc jynus ) [14:07:05] arturo: in theory yes by shard [14:07:26] all lag values are the same so [14:07:33] odds are pretty damn good it's just something weird not not actual lag [14:11:10] other tool (which probably uses replica db) is fine - https://tools.wmflabs.org/meta/crossactivity/-revi [14:11:19] and my script that uses replica saves just fine [14:12:29] maybe the same kind of issue on replag as guc has? [14:12:54] guc uses same replica as I do [14:13:00] so if my script work, guc must work [14:13:17] ah I see, weird [14:13:21] has a maintainer tried restarting guc w/ webservice restart --backend=kubernetes? [14:14:20] let's see who is the maint. [14:14:30] krinkle and luxo [14:14:36] I just checked [14:14:38] luxo is kinda gone [14:14:49] so Krinkle should be the one to reboot? [14:14:56] there is an incident upstream to the wikireplicas going on (see wikimedia-operations) so I wouldn't get too worried about replag reports atm, it's likely bogus or can wait for upstream issues to settle [14:15:27] yeah, but replag shows huh but replica itself is fine is weird anyway [14:15:28] :P [14:15:42] I know they're fixing it, but just feeling weird :P [14:15:43] I will try to restart guc then if it's down, as it won't hurt it :) [14:15:50] Yeah replag has the same issue as guc [14:15:56] https://www.irccloud.com/pastebin/8RTDwo8L [14:16:12] I kinda guessed it :P [14:16:25] boom [14:16:34] might want to restart that one aswell chasemp [14:17:05] hm [14:17:16] I wonder if on the same host [14:18:01] Same PHP function [14:18:05] was on tools-worker-1020 chasemp, accroding to logs posted earlier in this channel [14:18:07] https://www.irccloud.com/pastebin/bCmbHsDG [14:18:18] yeah seems they both are on 1020 [14:18:22] worker that is [14:18:23] so [14:18:34] I don't think so [14:19:00] I dont how many other tools are on that, so they might all have issues atm [14:19:19] chasemp: https://www.irccloud.com/pastebin/52F86Bcp [14:20:01] maybe some host update really did cause trouble on workers [14:21:09] arturo: ^ curiuos taht multiple workers are in teh same boat here [14:21:15] yeah -_- [14:21:21] hmm https://phabricator.wikimedia.org/T188998 [14:21:28] Wiki13: beats me [14:22:42] so similar symptoms on 1006, 1013, and 1020 arturo [14:22:42] chasemp: another tool with issues, orphantalk [14:22:46] yeah [14:22:47] ah [14:23:48] !log tools downtime icinga alert for k8s workers ready [14:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:24:06] https://phabricator.wikimedia.org/T189001#4027007 are the packages that changed [14:24:08] according to -ops, there is an ongoing outage, not sure if related [14:25:10] I thought it was related? [14:25:14] or... that was the outage [14:25:29] related to replag probably, related to tools not being to read replica.my.cnf doubtful [14:25:51] arturo: I'm going to initiative a depool and reboot of workers here [14:26:05] doing 1001 and 1002 first to see if they come back arlight [14:26:53] k [14:27:19] !log tools reboot tools-worker-100[12] [14:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:27:35] !log tools multiple tools running on k8s workers report issues reading replica.my.cnf file atm [14:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:37:14] !log tools.replag @tools-bastion-03:~$ webservice restart --backend=kubernetes [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.replag/SAL [14:38:49] replag seems to work again [14:38:57] https://tools.wmflabs.org/replag/ [14:40:48] It looks like its fixed now chasemp [14:40:58] https://tools.wmflabs.org/guc/?user=83.220.237.95 returns edits [14:41:01] some things are fixed but not all [14:41:08] is running in tools-worker-1012.tools.eqiad.wmflabs, which was just rebooted [14:41:22] I've drained and restarted a few nodes and restarted a few services [14:41:27] ah ok [14:41:28] I did not reboot 1012 arturo did you? [14:41:47] you are right chasemp, not rebooted, 53 days uptime [14:43:14] \ㅐ/ [14:43:19] whooops [14:43:21] \o/ [14:48:00] how does one get the logs from a pod? [14:50:52] Access and error.log in tool home [14:51:49] For webservices pods [14:53:52] Otherwise, `kubectl logs podname` [14:54:30] arturo: ^ [14:54:51] thanks chicocvenancio [14:55:27] I tried `$ kubectl logs guc-3456753348-2l1w6` but since the pod just rebooted, there is nothing [14:56:22] That's a webservices pod [14:57:20] Tail access.log and error.log [14:58:14] !log tools drain and reboot tools-worker-1010 [14:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:03:03] chicocvenancio: I'm no rebooting nodes. Could you try to get some error lines so we could understand what happened? [15:03:03] !log tools rebooted tools-worker 1001-1008 [15:03:04] !log tools drain and reboot tools-worker-1011 [15:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:05:07] arturo: what other tools were affected? [15:05:24] chicocvenancio: I don't know :-/ [15:05:55] chasemp: I can't ssh to tools-worker-1011 after the reboot. Well I could briefly, but not a second time. `Connection closed by UNKNOWN port 65535` [15:06:23] arturo: ack, sec [15:08:46] arturo: it's up for me as root at least [15:08:59] !log tools tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs [15:09:01] chasemp: ldap lookup failing? [15:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:09:27] maybe arturo idk yet [15:09:33] I re-cordoned it [15:10:43] https://www.irccloud.com/pastebin/j8w4U2KR/ [15:10:54] arturo: did pam update? [15:11:00] during package upgrades I mean [15:11:12] Mar 6 15:10:01 tools-worker-1011 CRON[7187]: pam_unix(cron:account): could not identify user (from getpwnam(prometheus)) [15:11:19] it seems so, let me confirm [15:11:40] arturo: I had momentary weirdness on 1004 and 1006 [15:11:46] so I'm not sure whats up here [15:12:46] arturo: let's stop rebooting [15:12:48] something is up [15:12:59] 1004 has issues [15:13:09] chasemp: no pam package updates, but now I'm unsure if some pam update was triggered as part of other package upgrade (ssh-ldap, libnss-ldapd) [15:13:24] arturo: [15:13:27] rush@tools-worker-1004:~$ file /home/rush/.bash_profile [15:13:27] /home/rush/.bash_profile: cannot open `/home/rush/.bash_profile' (No such file or directory) [15:13:34] on 1004 I cannot read a file in my own home [15:13:37] and I think it's ldap related [15:13:41] k [15:13:41] potentially [15:13:48] arturo: did an ldap package update? [15:13:58] yes [15:14:05] libnss-ldapd and ssh-ldap [15:14:47] chasemp: isn't the nsswitch config in puppet? [15:14:55] https://www.irccloud.com/pastebin/99hOEZxs/ [15:14:59] is that correct ^^ ? [15:15:14] arturo: I'm not sure but I would expect yes [15:15:21] maybe the package creates the original and it was never put in puppet [15:17:48] arturo: [15:17:49] Mar 6 15:17:38 tools-worker-1004 nslcd[11175]: [6afb66] request denied by validnames optio [15:17:55] idk what that is [15:18:26] that may be "normal" [15:19:31] ok, I would try reinstalling both nslcd and nsswitch related packages in a node, let's say tools-worker-1012 [15:19:57] tools-worker-1011 better, which is not allowing me in using my ldap account [15:22:27] Mar 6 15:22:04 tools-worker-1011 sshd[16261]: fatal: Access denied for user aborrero by PAM account configuration [preauth] [15:22:32] so this is a PAM issue [15:23:31] potentially [15:23:51] I'm not totally convinced of anything other than things are wrong atm [15:34:34] !log tools.orphantalk Restarting webservice (T188998) [15:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.orphantalk/SAL [15:34:37] T188998: Connection Error at OrphanTalk Tools - https://phabricator.wikimedia.org/T188998 [15:35:20] bd808: did that start working post restart and if so on what worker? [15:36:06] chasemp: yes, it is working now and its on ... tools-worker-1003.tools.eqiad.wmflabs [15:36:17] it was on tools-worker-1013.tools.eqiad.wmflabs before restart [15:36:29] bd808: ok ack and that new one is one of the newly rebooted [15:50:11] !log tools Rebooting tools-worker-1011 [15:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:15:01] !log tools Reboot tools-docker-registry-02 T189018 [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:15:08] T189018: Instances (maybe only Jessie?) are having issues with NFS/LDAP - https://phabricator.wikimedia.org/T189018 [16:23:25] ok, so chicocvenancio , all tools webservices should be working, right? [16:24:11] I hope so [16:24:12] :D [16:25:09] ok, lets keep our eyes opened in case we need to inspect something else [16:31:40] bd808: thanks for restarting orphantalk [16:35:17] All IRCCloud showed me initially was “Krinkle should be the one to reboot?” And I was very confused, thinking someone is asking me to reboot myself... [16:36:02] * Krinkle gets breakfast [16:36:09] chasemp: also thanks :-) [16:51:01] Krinkle: yw. we broke it so it only seemed fair :) [18:24:40] anyone else seeing an error on https://horizon.wikimedia.org/project/puppet/ ? [18:25:03] "Something went wrong!" " Try refreshing the page. If that doesn't help, contact your local administrator." [18:25:26] twentyafterfour: yes, andrewbogott is working on it [18:25:38] ok thanks! [18:25:41] chicocvenancio: I am? [18:25:48] :o [18:26:17] andrewbogott: aren't you? cloud VPS puppet outage? [18:26:32] yes, although it's not obvious to me that those things are connected [18:26:36] they might be [18:27:15] ah.. yeah it could be unrelated. I'll let you work on that, I think I can work around this for now [18:27:59] andrewbogott i think they are as apache2 was stopped [18:28:02] https://horizon.wikimedia.org/project/puppet/ is hit and miss for me, btw [18:28:43] it is veeeeery slow when does work, though (over 20s) [18:29:07] it's unclear to me why "prefix puppet" works [18:29:21] both are hit and miss for me [18:43:37] twentyafterfour: better now? [18:44:10] andrewbogott: loading..... [18:44:17] seems to work! [18:44:22] cool [18:44:26] sorry for the interruption [18:44:51] is puppetmaster up and running now? [18:44:53] twentyafterfour: if you're interested you can also try newhorizon.wikimedia.org — the puppet bits should be much the same there [18:45:00] although it might be /slightly/ faster [18:45:08] twentyafterfour: yeah, in theory everything is back to normal-ish [18:45:30] I guess I need to wait for puppet to run on my instance since it still hasn't done it's first puppet run and I can't log in [18:56:25] andrewbogott: I was having the session flapping issues on newhorizon this morning. Login, back to login, login again, a couple of pages work, back to login. [18:57:27] it kind of seemed like only 1 of the 2 hosts was talking to memcached or something, but I didn't try to debug (was during the ldap mess) [19:02:19] bd808: I see that too. In theory horizon is talking to nutcracker — when I had horizon doing its own memcached pool I didn't see that issue [19:02:43] so I'm curious if the same issue happens with sessions in toolsadmin and wikitech (since they should also be hitting nutcracker) [19:04:43] !log wmflabsdotorg Added BryanDavis (self) as project admin [19:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wmflabsdotorg/SAL [19:58:38] https://phabricator.wikimedia.org/T180636#4028234 What are you looking for? There some things I think should change, like including red links [19:59:47] jynus: ^, I could probably post an abbreviated relevant bits. [20:33:09] (03PS1) 10Andrew Bogott: labs common.yaml: reformat k8s_infrastructure_users data [labs/private] - 10https://gerrit.wikimedia.org/r/416763 [20:33:34] (03CR) 10Andrew Bogott: [V: 032 C: 032] labs common.yaml: reformat k8s_infrastructure_users data [labs/private] - 10https://gerrit.wikimedia.org/r/416763 (owner: 10Andrew Bogott) [20:34:31] so can anyone help me figure out why I can't log in to a newly created instance? it's in the phabricator project and newly created as of a few minutes ago... but I it deny's my publickey when I try to ssh to it [20:34:58] twentyafterfour: what is the project? [20:35:08] phabricator [20:35:23] lfs-storage.phabricator.eqiad.wmflabs [20:35:56] twentyafterfour ah [20:36:00] hmm, "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" [20:36:03] twentyafterfour you applied the puppet classes [20:36:10] before it fully initialised [20:37:00] paladox: hmm, I see... there should be a button in horizon to force a puppet run ;) [20:37:14] twentyafterfour i mean puppet is likly failing [20:37:15] :) [20:37:37] twentyafterfour i will remove the class then do a force reboot through the horizion ui [20:38:10] paladox: that's what I meant - instead of rebooting, forcing a puppet run in horizon would be convenient feature to have :) [20:38:33] ah i see [20:38:35] now I can log in [20:38:44] thanks, that was the problem [20:39:30] :)