[00:02:46] ok [00:04:41] done: https://phabricator.wikimedia.org/T241868 [00:10:43] paladox: could you add the fqdn of that instance to the bug report? [00:11:05] y'all are always thinking that we know where to look for things :) [00:11:15] oh! [00:11:16] sorry [00:11:18] will do [00:11:42] done! [00:13:05] paladox: also paste that whole output of the run with "bash -x" [00:13:21] root@deploy1001:/home/paladox# bash -x /sbin/parted -s /dev/vda mkpart primary [00:13:23] and what you showed me about the return code 1 [00:13:24] root@deploy1001:/home/paladox# bash -x /sbin/parted -s /dev/vda mkpart primary [00:13:24] /sbin/parted: /sbin/parted: cannot execute binary file [00:13:28] aha [00:13:40] well.. that isnt a bash script [00:13:45] yup [00:13:45] only the one using parted is [00:14:06] bash -x /usr/local/sbin/make-instance-vg /dev/vda [00:14:08] /usr/local/sbin/make-instance-vg [00:14:09] is the bash script [00:14:30] root@deploy1001:/home/paladox# bash -x /usr/local/sbin/make-instance-vg /dev/vda [00:14:30] + device=/dev/vda [00:14:31] ++ /sbin/parted -s /dev/vda print free [00:14:31] ++ /bin/sed -e 's/ */ /g' [00:14:31] ++ /bin/grep 'Free Space' [00:14:32] ++ /usr/bin/cut -d ' ' -f 2,3 [00:14:33] ++ /usr/bin/tail -n 1 [00:14:34] + /sbin/parted -s /dev/vda mkpart primary [00:14:35] + echo '/usr/local/sbin/make-instance-vg: failed to create new partition' [00:14:37] + exit 1 [00:14:38] err [00:14:39] sorry [00:14:43] i thought i was in a different channel :) [00:15:04] paladox: just dump it all in the ticket inside code blocks [00:15:34] ok [00:15:42] the disk is all allocated already [00:15:55] check `df -h` [00:16:14] /dev/vda2 40G 1.9G 36G 5% / [00:16:21] there is 40G nominally, and / is 36G [00:16:26] ohh [00:16:38] i presumed that andrew fixed that (so that it defaults to 20g again) [00:16:43] so we don't even need to do anything about enlarging /srv ? [00:17:06] well, that's nice too [00:17:20] then we just remove those extra roles..even better [00:17:31] yup [00:17:31] anything that makes it closer to prod is good in my book [00:20:30] paladox: remove the LVM role on both phab instances.. also removed ALL Hiera from Horizon on both.. all in repo now..either common or ./host/ [00:20:43] removed.. you got the deployment_server [01:06:28] !log tools.admin-beta Moving to new kubernetes cluster [01:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin-beta/SAL [01:09:36] !log tools.fourohfour kubectl scale --replicas=3 deployment/fourohfour [01:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fourohfour/SAL [01:10:23] just noticed that k8s.tools.eqiad1.wikimedia.cloud host pointing at 172.16.0.99, didn't realise we'd started using those names yet [01:11:44] yeah, that's the load balancer name for the new k8s cluster [01:12:29] its not in general use yet, but wikimedia.cloud is going to replace .wmflabs [01:17:09] !log tools.os-deprecation Moving to new kubernetes cluster [01:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.os-deprecation/SAL [02:00:37] !log tools.jouncebot Moving to new Kubernetes cluster [02:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jouncebot/SAL [10:56:44] What is the best way to do the authentication for a bot running on toolforge (started with cron)? [11:14:43] oauth, with the tokens stored in a chmod 600 file, I think [11:16:22] unless you can think of something crypto... I can't [15:39:14] I agree with zhuyifei1999_. Best practice is https://www.mediawiki.org/wiki/OAuth/Owner-only_consumers. If for some reason you can't implement OAuth in your bot, then use https://www.mediawiki.org/wiki/Manual:Bot_passwords [15:59:41] !log cyberbot moving VM cyberbot-db-01 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873) [15:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cyberbot/SAL [15:59:44] T241873: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241873 [16:01:31] !log devtools moving vm puppetmaster-1001 from cloudvirt1024 to cloudvirt1009 due to hardware error T241884 [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL [16:01:37] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:02:03] !log tools moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873) [16:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:02:29] oops, I'm using the wrong phab task [16:04:09] !log tools moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:05:21] !log meza moving VM meza-full from cloudvirt1024 to cloudvirt1003 due to hardware error T241884 [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Meza/SAL [16:05:32] I messed up where I am moving it in my first log message :( [16:06:02] !log tools moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:06:07] no problem :-) [16:07:39] !log tools moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:07:41] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:08:42] !log tools moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:09:30] !log video moving VMs encoding04 and encoding05 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884 [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [16:09:57] !log tools moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:11:22] !log tools moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:13:44] !log tools moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884) [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:13:47] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:16:20] !log tools Draining tools-worker-10{05,12,28} due to hardware errors (T241884) [16:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:28:25] andrewbogott: VM just dies [16:28:25] *died [16:28:25] Cyberpower678: we are relocating cyberbot-db-01 because hardware errors [16:28:25] sorry about that [16:28:25] What is that 5th move now. :p [16:28:26] andrew is now in midnight [16:28:26] Cyberpower678: I know, it's in the SAL. I'm really sorry for that. Hardware issues :-/ [16:28:26] It was on 1024. A really fast one too [16:28:26] :-( [16:28:26] 1024 is dying [16:28:26] RAID controller issues again [16:28:26] * Cyberpower678 is beginning to wonder if his DB VM is causing these failures [16:28:26] and I'm afraid the new destination 1009 wont be very fast [16:28:26] Wherever it goes, so does HW trouble [16:28:38] your DB is stress testing the RAID controller :-P [16:29:50] do we have a record of what host it was on and when? [16:30:39] Krenair: you can guess it from https://wikitech.wikimedia.org/wiki/Nova_Resource:Cyberbot/SAL [16:30:44] cloudvirt1013, 1009, 1024 [16:30:47] arturo: it stress tests anything it's sitting on. [16:30:49] :p [16:30:54] Sorry about that though. [16:31:03] IABot does a lot of work xwiki [16:31:25] it's ok [16:32:38] It all started when the VM was initially moved during Wikimania last year. ;-) [16:34:54] !log admin icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884) [16:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:34:57] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:35:27] Cyberpower678: the VM is now 30% migrated [16:35:43] arturo: my private server has had disk failures lately too. Took 90 hours to rebuild. [16:36:16] BTW do you keep a copy of this VM data somewhere else? [16:36:25] Yes. My private server. [16:36:31] cool [16:36:41] It does a weekly push of it to my personal server sitting next to me. [16:37:09] As long as the local disk space is large enough. [16:37:27] 👍 good idea, specially if the VM is attracting disk failures [16:37:27] arturo: I'm running out of space and need more. [16:38:43] The DM VM creates a local compressed dump file of it, and the SFTPs it to my server. But the VM is running out of disk to do that. Is it possible to pipeline that data over SFTP as the gzip and mysqldump is running? [16:40:51] perhaps, my SFTP kungfu is not very strong [16:41:22] I'm thinking it should be possible. SFTP simply opens a write socket to the remote server and sends the file through. [16:41:27] but with standard SSH you could do something like `cat file | ssh myserver "cat file <"` [16:42:00] Doesn't that still write to the local disk first? [16:42:35] something like `mysqldump | gzip whatever | ssh myserver "cat file <"` [16:42:57] I'll give that a try, and see what happens. [16:44:34] So you guys use RAID 10? [16:47:20] !log tools moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884 [16:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:47:29] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:54:07] !log tools moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884 [16:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:54:10] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [16:54:27] Cyberpower678, yes [16:56:44] if you search for cloudvirt on https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/install_server/files/autoinstall/netboot.cfg and then find the files under https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/install_server/files/autoinstall/partman/custom/ [16:58:26] (ignore cloudvirt200[1-3] that's codfw1dev) [17:00:04] there's some comments in those files about how things are set up [17:27:47] mysqldump -u dbuser -pXXXXXXXX dbname \ [17:27:48] | gzip | cat | ssh -i ~/.ssh/id_rsa_backup backup@my.server.com \ [17:27:48] 'cat > /var/backups/services/my_service/db/$(date +"%Y-%m-%d").sql.gz' [17:27:52] arturo: something like that maybe? [17:32:11] arturo: what's the status of the migration [17:32:49] give me a sec [17:36:29] Cyberpower678: 64% [17:38:25] * Cyberpower678 continues waiting [17:39:21] the network is exasperatingly slow for this VM, I don't know why [17:39:39] about half the speed other VMs migrated [17:39:48] this one is running at about 20MB/s [17:39:58] others ~45MB/s [17:40:11] so like another hour to go? [17:40:42] * Cyberpower678 can only imagine how much work he'll be able to do on it then. :/ [17:40:44] according to rsync, yes `203,699,519,488 66% 38.56MB/s 0:42:31` [18:06:35] !log tools Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom [18:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:10:35] !log tools kubectl delete node tools-worker-1029.tools.eqiad.wmflabs [18:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:11:42] !log tools Shutdown tools-worker-1029 [18:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:31:13] arturo: is it finished? [18:31:35] I still can't route to it. [18:31:38] Cyberpower678: 93% [18:45:15] bd808: just out of curiousity, are all the fast Cloudvirts in one way or another out of service atm [18:45:30] arturo: ^ [18:46:28] its night for a.rturo. We picked 2 servers that obviously had space for the evacuation. No initial concern about which ones those were [18:46:42] bd808: fair enough [18:47:26] we are down 3 (4?) cloudvirts right now for various hardware problems, so the options are not numerous [18:48:19] bd808: this is more or less an inquiry regarding the current state of available cloudvirts. [18:51:33] Looks like the VM sprang to life. [18:52:01] Cyberpower678: its Saturday morning/evening. We had 2 cloudvirts alert for hardware issues in the last 5 hours and we just evacuated one of them. There will be some examination of overall state, but not right now. :) [18:52:18] :-) [18:52:24] Thank you. [19:04:16] andrewbogott: I see encoding04 & encoding05 had a reboot at 16 UTC with a microarchitecture change (expected: Broadwell, actual: Ivy Bridge). is this expected? [19:07:24] zhuyifei1999_: midnight for andrew, the VMs were relocated to other cloudvirt due to HW issues. See SAL [19:08:51] zhuyifei1999_: out of curiosity, why is the architecture being looked at? [19:09:28] arturo: ok I see. are most cloudvirts Ivy Bridge and I'm just lucked to have my VMs on Broadwell before? [19:10:15] Cyberpower678: optimized locally compiled ffmpeg that expects Broadwell... SIGILL on Ivy Bridge [19:10:33] Oh. [19:11:05] What are you encoding on there? [19:11:14] OOC [19:11:35] videos? it runs video2commons [19:12:03] with ffmpeg, I would assume videos, I was just wondering what. [19:12:08] zhuyifei1999_: the farm is a mix. We have something like 6 different generations of hardware in the cloudvirt pool [19:13:31] ok. it there some documentation on the CPUs we have? [19:14:14] nope [19:14:38] I mean, I can recompile to Ivy Bridge now, but if it reloacates to some even older generation it would break again [19:16:14] zhuyifei1999_: why not build a script to detect the architecture and what it is expecting, and if they don't match, automate the compile process. [19:16:46] argh... [19:18:23] I am totally confused. I sure did not start a job on december 6th … but it suddenly appeared? [19:21:06] Wurgl, can we have more details? maybe someone else started one? [19:22:20] arturo: bd808: andrewbogott: Just leaving this here for now, doesn't need immediate action. Current IO wait is sitting at 90% on average, with spikes as high as 150%. [19:22:59] Cyberpower678: stop hitting your database so hard ;) [19:23:09] bd808: unaviodable [19:23:18] IABot does xwiki work [19:23:31] so batch your writes? [19:23:38] Does that already [19:23:39] so tracking so much state? [19:23:46] ? [19:23:47] *stop tracking [19:24:06] ? [19:25:06] using a mysql db as a job queue is always going to be high contention. And you have said recently that you are maxing out your huge disk too. These things make me think that your data management plan could use some rethinking [19:25:51] We will try to find you a faster hypervisor, but that's not going to happen for a few days [19:26:01] I'm open to suggestions, but it's not the job table that's using that data, it's the URL metadata. [19:26:48] Tracking whether or not it's alive or not, what archive URLs it should be using, or not. [19:29:29] Cyberpower678: is externallinks_global the big space hog? [19:29:37] Yep [19:30:59] And that table is eventually going to track every external link that ever existed on every page of every wiki? [19:31:20] Yes. [19:31:54] Every article is more appropriate. IABot only works with articles. [19:32:05] that's obviously not sustainable [19:32:37] Why not. You have a zillion schemas with externallinks tables in them [19:32:38] article count grows unbounded, edit count grows unbounded, therefor link count grow unbounded [19:32:54] Those track all the external links on wiki [19:33:17] Cyberpower678: yes, but spread over 800+ dbs on dozens of hosts [19:33:40] in a design that includes sharding for horizontal spread [19:33:56] I'm not opposed to sharding either. [19:34:07] But the tracking is necessary. [19:38:25] Besides a lot of URLs overlap multiple times over multiple articles over multiple project. This setup eliminates that overlap. [19:39:00] Costing less space. [19:39:52] I'm still very open to improving the setup though. [19:48:02] Krenair: https://phabricator.wikimedia.org/T241902 [19:50:10] thanks [20:00:29] I love reading through our docs and finding pmtpa references [20:00:53] that's only 6 years ago! ;) [20:01:45] Isn't 29 days enough? [20:13:23] Wurgl, I don't suppose you have any mail from tools from around the 6th do you? [20:15:29] I do not, but I use google mail and mybe I thrashed it. Oldest mail in trash is Dec 11th [20:15:50] I've got stuff from it for persondata from 23rd october and 22nd december :/ [20:17:32] 22nd? I got no mail on 2nd? [20:17:37] -2+22 [20:18:09] no "SGE 8.1.9: Job 3385355 failed" ? [20:18:14] According to the logfile I startet that job [20:18:22] 2019-12-06 00:15:24 Start process_templatedata.php [20:18:39] But it was running fine [20:25:32] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Accounting doesn't seem to be working as the file hasn't been updated for months: -rw-r--r-- 1 sgeadmin sgeadmin 909M Mar 25 2019 /data/project/.system/accounting [20:29:27] how is qacct supposed to work... [20:32:59] https://tools.wmflabs.org/sge-status/ <-- when you look here: there is a lot of jobs startet month ago with "n/a" in column CPU [20:38:45] With wikihistory I see a similar behaviour [20:39:19] This has two continous jobs running. Just two. But currently there are four? [20:41:09] Two have this strange "n/a" in column CPU on the website? I have seen this "n/a" just after starting, but these processes cliam to got started in october? [21:51:14] Hi, I’m wondering if someone can help me with resetting 2fa on my wikitech account please? :) [21:51:15] I have the recovery keys [21:54:10] paladox, if you have a recovery code what do you need help with exactly? [22:08:23] Krenair: there’s no reset 2fa or use recovery code option [22:08:24] When I try to login [22:08:51] paladox, what if you give it the recovery code as the TOTP code? [22:08:59] Oh [22:09:04] * paladox tries [22:10:55] That worked! [22:10:55] Thanks! [22:12:10] Though doesn’t seem to work to disable it [22:12:23] IIRC those recovery codes are single-use, you're not trying to use the same one are you? [22:17:35] Yup [22:20:03] Ah that worked! [22:20:04] Thanks! [22:20:39] paladox, just remember to remove your notes of the old codes that won't work anymore [22:21:40] and replace all your codes ideally [22:30:29] Yup [22:30:30] (Already done :)) [22:39:05] Krenair: outdated docs :( The stretch grid tracks state in /data/project/.system_sge/ [22:39:58] it's less the docs and more the scripts they point to [22:41:30] hmm: -rw-r--r-- 1 sgeadmin sgeadmin 115M Jan 4 22:41 /data/project/.system_sge/gridengine/default/common/accounting [22:41:35] maybe that's what I want