[00:05:45] 6Labs, 10Tool-Labs: can't login to tools-shadow-01 - https://phabricator.wikimedia.org/T104781#1477501 (10yuvipanda) Root key doesn't work either - I think the instance is dead. @Coren? was it DOA or is there something you did inside that needs rescuing? [00:43:30] 10Tool-Labs-tools-meetbot: Update meetbot to not hang out in 'wikimedia-office' but instead 'wikimedia-meeting' - https://phabricator.wikimedia.org/T103404#1477726 (10Krenair) [01:41:04] 6Labs: Investigate spikes in Labs NFS network usage - https://phabricator.wikimedia.org/T95392#1477872 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Oh well. [01:41:45] 6Labs, 6operations: lvm 'others20150715' snapshot full on labstore1001 - https://phabricator.wikimedia.org/T106601#1477876 (10yuvipanda) 5Open>3Resolved a:3yuvipanda The snapshot has been deleted by @Coren [01:50:53] 6Labs, 10Tool-Labs, 10Wikimedia-Git-or-Gerrit: git clone operations/mediawiki-config on tool labs fail: recursion detected in die_errno handler - https://phabricator.wikimedia.org/T106393#1477887 (10zhuyifei1999) 5Open>3Resolved a:3zhuyifei1999 Retried just now. Cannot reproduce the error anymore. (weird) [03:35:18] 6Labs, 10Tool-Labs: Reenable backups for /home and /data/project - https://phabricator.wikimedia.org/T63103#1478027 (10yuvipanda) [03:35:19] 6Labs, 5Patch-For-Review: Replicate data between codfw and eqiad - https://phabricator.wikimedia.org/T85606#1478026 (10yuvipanda) [03:36:59] 6Labs, 10Labs-Infrastructure: Recover /data/scratch/ content - https://phabricator.wikimedia.org/T106324#1478031 (10yuvipanda) 5Open>3declined a:3yuvipanda I think this is too late and /data/scratch was obliterated during the recovery. /data/scratch should not also contain valuble information. [03:38:55] 6Labs, 10Labs-Vagrant: failing puppet - https://phabricator.wikimedia.org/T106442#1478043 (10yuvipanda) Seems to work fine for me? [03:39:08] 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1478044 (10yuvipanda) [03:39:34] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-107, 5Patch-For-Review: nfs-exports-daemon hangs, prevents new instances from accessing nfs - https://phabricator.wikimedia.org/T106076#1478045 (10yuvipanda) 5Open>3Resolved [03:39:54] 6Labs, 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1478046 (10yuvipanda) @leila has graciously accepted to help run this one :) [03:40:52] 6Labs, 10Labs-Infrastructure, 6Security, 10wikitech.wikimedia.org, 7Security-Other: Huge rash of bot accounts on wikitech - https://phabricator.wikimedia.org/T105350#1478050 (10yuvipanda) 5Open>3Invalid a:3yuvipanda AFAICT this was a misunderstanding, do re-open if not. [03:42:27] 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1478053 (10yuvipanda) @Negative24 ping? Were your issues resolved? Was ferm the issue? [03:44:42] 6Labs, 10Tool-Labs, 10Wikimania-Hackathon-2015: Workshop: Doing Research on Wikimedia things as a volunteer - tools and communities - https://phabricator.wikimedia.org/T91062#1478055 (10yuvipanda) 5Open>3Resolved [03:46:43] 6Labs, 10Tool-Labs: role::relic - changes not applied by puppet? on which node or instance is it? - https://phabricator.wikimedia.org/T104537#1478059 (10yuvipanda) /me pokes @Coren [03:47:06] 6Labs: Investigate why novaadmin was no longer projectadmin of the puppet3-diffs project - https://phabricator.wikimedia.org/T104440#1478060 (10yuvipanda) 5Open>3Invalid a:3yuvipanda CNR [03:52:17] 6Labs, 10Analytics-Cluster, 10wikitech.wikimedia.org: Include role::analytics::hadoop roles in default list of labs puppet groups - https://phabricator.wikimedia.org/T70391#1478063 (10yuvipanda) 5Open>3declined I have cleaned the default roles to a minimum, and these should just be project specific roles... [03:53:17] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Labs-Vagrant, 10MediaWiki-Vagrant, and 2 others: Update Vagrant role for Extension:OpenStackManager - https://phabricator.wikimedia.org/T103874#1478065 (10yuvipanda) This is a lost cause, IMO. Anything we set up on MWV won't match production in any way or f... [03:54:28] 6Labs: Re-evaluate use of NFS in WMT project - https://phabricator.wikimedia.org/T103750#1478066 (10yuvipanda) So... it currently has only /data/project afaict, should we say 'it needs that, and that alone' and close the ticket? Or can that be re-evaluated too? [03:54:52] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Grant shell user right with project memberships and remove autocreation of shell requests - https://phabricator.wikimedia.org/T97334#1478067 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Wooo [03:56:56] 6Labs: Find alternative solutions for video project's use of NFS - https://phabricator.wikimedia.org/T102402#1478070 (10yuvipanda) @matanya can I kill everything except /data/scratch then? Do you still want the ~2T of data in /data/project? [03:59:04] 6Labs: Get rid of Gluster Copy and PMTPA NFS Copies from labstore1001 - https://phabricator.wikimedia.org/T102390#1478071 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This was done. [03:59:29] 6Labs, 6operations: New instances stuck unable to run puppet (and no sshing in!) - https://phabricator.wikimedia.org/T101916#1478074 (10yuvipanda) 5Open>3Resolved a:3yuvipanda New images were built. [04:01:22] 6Labs: Remove old backups-of-backups from NFS - https://phabricator.wikimedia.org/T99061#1478079 (10yuvipanda) 5Open>3Resolved a:3yuvipanda These were gone during the outage. [04:02:50] 6Labs, 10Labs-Infrastructure: The Salt minion client id should be the FQDN, not ec2_instance_id - https://phabricator.wikimedia.org/T71502#1478086 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This is the case now. [04:03:43] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478090 (10yuvipanda) I'm tempted to mark as declined and hope that Horizon fixes things... [04:04:05] 6Labs, 10Labs-Infrastructure: Oddity about service groups "awb" in Tools pre and post transition - https://phabricator.wikimedia.org/T65754#1478093 (10yuvipanda) Was this done? [04:04:38] 6Labs, 10Labs-Infrastructure: role::mediawiki-install::labs in an eqiad instance thinks to be in pmtpa - https://phabricator.wikimedia.org/T64370#1478096 (10yuvipanda) 5Open>3declined a:3yuvipanda That role is no longer available on wikitech by default, and should be killed from puppet too. [04:05:14] 6Labs, 10Labs-Infrastructure: MediaWiki files set up by role::mediawiki-install::labs don't have proper permissions - https://phabricator.wikimedia.org/T64368#1478100 (10yuvipanda) 5Open>3declined a:3yuvipanda Using that role has been unsupported since availability of labs-vagrant, instances should proba... [04:05:59] 6Labs, 10Labs-Infrastructure: Provide Redis feed of recent changes for Wikimedia wikis - https://phabricator.wikimedia.org/T61721#1478103 (10yuvipanda) 5Open>3declined a:3yuvipanda RCStream is probably good enough, and do not think we'll have much of a load change from providing a redis proxy. [04:06:31] 6Labs, 10Labs-Infrastructure: provide bastion redundancy via DNS round robin - https://phabricator.wikimedia.org/T59834#1478107 (10yuvipanda) 5Open>3declined Let's not do this, this will confuse people running screen and what not. We have redundancy now by being able to switch over the IP address in case s... [04:07:18] 6Labs, 10Labs-Infrastructure: Have shell requests marked as uncompleted or completed automatically - https://phabricator.wikimedia.org/T47456#1478110 (10yuvipanda) 5Open>3Invalid a:3yuvipanda No more shell requests!!!1 [04:08:07] 6Labs, 10Labs-Infrastructure: default labs MediaWiki config will generate https links - https://phabricator.wikimedia.org/T58389#1478114 (10yuvipanda) 5Open>3declined a:3yuvipanda Old and unmaintained puppet code, use labs-vagrant instead! [04:11:08] 6Labs, 10Labs-Infrastructure: Add "open for all" project feature - start with bastion - https://phabricator.wikimedia.org/T46173#1478118 (10yuvipanda) 5Open>3declined a:3yuvipanda Let's not do this - adding people to a project is trivial, and we do not have a block based on the shellmanagers group anymore. [04:12:15] 6Labs, 10Wikimedia-Labs-General, 5Patch-For-Review: /etc/mailname is set to "labs-vmbuilder-precise.eqiad.wmflabs" - https://phabricator.wikimedia.org/T66962#1478122 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Definitely not the case anymore. [04:13:08] 6Labs, 10Wikimedia-Labs-General: Just get rid of creepy default vimrc - https://phabricator.wikimedia.org/T51339#1478126 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I see it on really old instances now, but don't think this is set by default anymore. [04:13:40] 6Labs, 10Wikimedia-Labs-General, 7JavaScript: WMFLabs Graphite: Dashboard is empty (Uncaught exception in javascript) - https://phabricator.wikimedia.org/T73742#1478130 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I do not see those errors anymore? [04:14:32] 6Labs, 10Wikimedia-Labs-General: Rename project bots to wm-bot - https://phabricator.wikimedia.org/T57691#1478134 (10yuvipanda) Can projects even be renamed? [04:15:29] 6Labs, 10Wikimedia-Labs-General: Fix virt1000 OAI errors - https://phabricator.wikimedia.org/T87079#1478137 (10yuvipanda) 5Open>3Invalid a:3yuvipanda What is OAI? virt1000 is no longer alive. I assume this can be closed (re-open if this is still a problem) [04:18:09] 6Labs: Renaming scheme for labs servers - https://phabricator.wikimedia.org/T95042#1478141 (10yuvipanda) 5Open>3Resolved This was all done anyway, so it's ok :) [04:19:36] 6Labs, 6operations: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1478148 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Marking as invalid because there's no unpuppetized (or otherwise) bond0 now. [05:44:41] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL tools.tools-webgrid-lighttpd-1404.diskspace.root.byte_percentfree (<30.00%) [06:27:26] 6Labs: Find alternative solutions for video project's use of NFS - https://phabricator.wikimedia.org/T102402#1478219 (10Matanya) I still need that data. until i figure out how to upload those files given commons limitation. [06:44:45] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1404 is OK All targets OK [07:39:36] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478300 (10Nemo_bis) > I'm tempted to mark as declined and hope that Horizon fixes things... Nope. Bug can still be reproduced. [07:41:05] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478301 (10Nemo_bis) [08:19:42] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:20:00] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 30.00% of data above the critical threshold [0.0] [08:22:12] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 22.22% of data above the critical threshold [0.0] [08:23:30] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:25:16] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 22.22% of data above the critical threshold [0.0] [08:25:44] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 40.00% of data above the critical threshold [0.0] [08:25:50] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [08:25:52] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 30.00% of data above the critical threshold [0.0] [08:26:36] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 40.00% of data above the critical threshold [0.0] [08:27:06] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 22.22% of data above the critical threshold [0.0] [08:29:02] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:29:36] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:29:54] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:29:55] PROBLEM - Puppet failure on tools-master is CRITICAL 30.00% of data above the critical threshold [0.0] [08:31:38] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 50.00% of data above the critical threshold [0.0] [08:33:36] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 50.00% of data above the critical threshold [0.0] [08:33:42] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:34:24] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 44.44% of data above the critical threshold [0.0] [08:34:28] PROBLEM - Puppet failure on tools-mailrelay-02 is CRITICAL 50.00% of data above the critical threshold [0.0] [08:38:32] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 40.00% of data above the critical threshold [0.0] [08:39:28] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 44.44% of data above the critical threshold [0.0] [08:39:57] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 40.00% of data above the critical threshold [0.0] [08:39:57] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:40:56] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:41:10] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 33.33% of data above the critical threshold [0.0] [08:43:40] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 30.00% of data above the critical threshold [0.0] [08:44:31] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 22.22% of data above the critical threshold [0.0] [08:44:41] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:46:02] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 30.00% of data above the critical threshold [0.0] [08:46:32] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 22.22% of data above the critical threshold [0.0] [08:47:16] PROBLEM - Puppet failure on tools-submit is CRITICAL 44.44% of data above the critical threshold [0.0] [08:48:24] PROBLEM - Puppet failure on tools-mail is CRITICAL 50.00% of data above the critical threshold [0.0] [08:48:58] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:49:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 66.67% of data above the critical threshold [0.0] [08:49:40] PROBLEM - Puppet failure on tools-shadow is CRITICAL 60.00% of data above the critical threshold [0.0] [08:59:41] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [08:59:59] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [09:00:11] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL 55.56% of data above the critical threshold [0.0] [09:00:46] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [09:02:12] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [09:03:34] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [09:04:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [09:04:54] RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0] [09:05:17] RECOVERY - Puppet failure on tools-exec-1219 is OK Less than 1.00% above the threshold [0.0] [09:05:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [09:05:51] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [09:06:36] RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0] [09:06:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [09:07:06] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [09:08:42] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [09:09:02] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [09:09:30] RECOVERY - Puppet failure on tools-mailrelay-02 is OK Less than 1.00% above the threshold [0.0] [09:09:56] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [09:10:15] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1478439 (10Magnus) Three weeks in, and still broken? Just created a new tool, do I need to use another tool's credentials yet again? [09:13:34] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [09:13:42] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [09:14:22] RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0] [09:14:56] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [09:16:09] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [09:18:36] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [09:19:00] how to tell bigbrother to reread .bigbrotherrc? [09:19:30] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [09:19:42] RECOVERY - Puppet failure on tools-exec-1214 is OK Less than 1.00% above the threshold [0.0] [09:19:56] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [09:21:00] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [09:21:51] 6Labs, 10pywikibot-core: pywikipedia.org down? - https://phabricator.wikimedia.org/T106311#1478487 (10Chmarkine) >>! In T106311#1476249, @valhallasw wrote: > the question is: what should it be replaced with...? CNAME to wikimedia.org? [09:23:20] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [09:23:58] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [09:24:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [09:24:28] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [09:24:40] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [09:26:03] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [09:26:29] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [09:27:15] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [09:35:13] RECOVERY - Puppet failure on tools-exec-1404 is OK Less than 1.00% above the threshold [0.0] [10:07:32] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [10:55:02] 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: Audit projects' use of NFS, and remove it where not necessary - https://phabricator.wikimedia.org/T102240#1478671 (10JohnLewis) [10:55:04] 6Labs: Re-evaluate use of NFS in WMT project - https://phabricator.wikimedia.org/T103750#1478669 (10JohnLewis) 5Open>3Resolved Yeah, currently it needs that and that alone. We'll see if we can deprecate it in future but for now, it is needed. [12:49:32] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure: Increase number of Jenkins slaves to spread load and prevent browser test failures on beta - https://phabricator.wikimedia.org/T72049#1478963 (10hashar) that was a request to #labs-infrastructure to bump the project quota. [14:18:54] 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1479124 (10bmansurov) It started working when I created a new instance. I may have chosen a different version of Ubuntu initially. I'm not sure though. [14:32:30] 6Labs, 10Tool-Labs: Reenable backups for /home and /data/project - https://phabricator.wikimedia.org/T63103#1479142 (10scfc) 5Open>3declined (IMHO this was a duplicate of T85608 (or vice versa), so declining this as well.) [15:00:17] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Labs-Vagrant, 10MediaWiki-Vagrant, and 2 others: Update Vagrant role for Extension:OpenStackManager - https://phabricator.wikimedia.org/T103874#1479197 (10scfc) I don't want to update the role to "really" match production (i. e. "where you can do stuff just... [15:12:09] 10MediaWiki-extensions-OpenStackManager: OpenStackManager: Nova resource pages are endlessly prepended with line break whitespace on each update by Labslogbot - https://phabricator.wikimedia.org/T58316#1479235 (10scfc) 5Open>3Resolved a:3scfc A cursory look at the (deleted and before that moved) page refer... [15:15:59] 10MediaWiki-extensions-OpenStackManager: OpenStackManager: Nova resource pages are endlessly prepended with line break whitespace on each update by Labslogbot - https://phabricator.wikimedia.org/T58316#1479250 (10scfc) a:5scfc>3None [15:23:42] 6Labs, 10Labs-Infrastructure, 7Composer, 7Upstream: Composer activity from Labs hosts can be rate limited by GitHub - https://phabricator.wikimedia.org/T106452#1479271 (10bd808) a:5bd808>3None GitHub does not have a whitelist capability for anonymous API requests. They suggest that we use authenticated... [15:29:16] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1479280 (10scfc) [15:50:23] 10MediaWiki-extensions-OpenStackManager: Special:NovaInstance should restrict project filter to projects where the current user is an administrator - https://phabricator.wikimedia.org/T106820#1479311 (10scfc) 3NEW [15:50:59] 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1479319 (10MusikAnimal) The problem does indeed come and go, and lately (past several days) I don't think I've had any issues. I have a log file for the Ruby script and I see that... [15:51:16] 10MediaWiki-extensions-OpenStackManager: Special:NovaInstance should restrict project filter to projects where the current user is an administrator - https://phabricator.wikimedia.org/T106820#1479320 (10scfc) [15:55:35] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1479341 (10yuvipanda) Indeed, but nobody is planning on working on it afaict... [17:25:28] 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1479921 (10scfc) @notconfusing's question was related to inter-project data transfer, AFAIUI. There is certainly the possibility of an issue with (lack of) NFS synchronicity (CMI... [17:31:25] 6Labs, 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1479950 (10aripstra) Adding this to the Design research work board for awareness. Please reach out if you need feedback on your survey or anything. [17:31:40] 6Labs, 10Tool-Labs, 6Learning-and-Evaluation, 6WMF-Design-Research: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1479955 (10aripstra) [17:40:46] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1480035 (10yuvipanda) (did a manual run just now, should have it for new tools atm). Disruption due to wikimania + current priority being getting cross DC backups of... [17:46:42] 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1480096 (10scfc) p:5Normal>3Lowest Then let's adjust the priority to reflect that. But the bug is quite clearly still there. [17:53:31] Coren: should we stop the current runs and run them via systemd instead? [17:54:40] YuviPanda: Hm. If you feel it really necessary - otherwise I'd be happy to start them after that run is done. They're not so long anymore; last time they took <24h [17:54:52] Coren: yeah but that's saturday :D [17:55:19] YuviPanda: True, but that's why - I'd rather be around start-to-finish when we let them loose automatically first. [17:55:33] Coren: we aren't running them automatically [17:55:33] Coren: these don't restart at all. [17:55:39] Coren: they just run and then when done... stop. [17:55:41] Hm. True. [17:55:46] I have a Restart=No in there [17:55:56] Coren: the only thing this gives us is 1. validation that the scripts are ok, 2. logging properly [17:56:00] Oh hah - others is actually already done. [17:56:11] ncie [17:56:20] I'm still hacking around puppet [17:56:41] Ah, cool, and everything worked fine at the remote end too - the temporary snapshot got discarded once the rsync completed. [17:56:42] 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1480192 (10scfc) @MusikAnimal, sorry, I posted my reply without checking if comments had come in since I opened the browser's tab. I just looked at `qacct -j exec -o tools.musikb... [17:57:47] YuviPanda: Means you can test the systemd thing with others at no risk [17:57:55] Coren: cool. [17:58:08] And it should complete really fast too. [17:58:10] Coren: so do we still have any deletion steps left? [17:58:17] Coren: I guess we need to remove local snapshot when done? [17:58:53] YuviPanda: We'll want to clean up the local snapshots at interval. At least any time there is less than 6T available in the vg (so there is room for another backup) [17:59:14] Coren: why 6T? [17:59:30] YuviPanda: Room enough for 3 new snapshots. [17:59:43] Coren: are we doing all snapshots at 2T? [18:00:42] Ah, no, 1T. Dunno why I was thinking 2T [18:00:46] Coren: we should also do alerts for when snapshots are going to get full [18:00:49] So 3T being the low water mark. [18:00:52] the others one got full the other day [18:01:15] YuviPanda: Yeah, they have limited lifetime. I don't think it's worth alerting - just discard the ones getting full. [18:01:34] Coren: also how did you get around the ssh key problem? [18:03:16] YuviPanda: I added the full set of signatures to ~/.ssh/known_hosts so that it doesn't get trampled over by puppet with ssh-keyscan. It's not clear how to automate that though, short of adding more key types to the global known hosts (which seems like a bad idea to me) [18:03:45] 6Labs, 10Labs-Infrastructure: Transition service groups to new globally unique names and UIDs - https://phabricator.wikimedia.org/T60997#1480233 (10scfc) [18:03:46] 6Labs, 10Labs-Infrastructure: Oddity about service groups "awb" in Tools pre and post transition - https://phabricator.wikimedia.org/T65754#1480230 (10scfc) 5Open>3Invalid a:3scfc No (AFAIK), but if I look now at https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup, the "tools.awb" and "tools.loc... [18:03:57] Coren: ugh, please let's not do manual hacks without filing bugs and being loud at them. broken windows and all that - this is how we ended up at our current situation, let's not go back towards that again. [18:04:06] I'll work on a script that (a) discards snapshots getting too full and (b) discards the oldest remaining snapshots until at least 3T is left. [18:04:39] YuviPanda: Hm, yes. You're right of course - this requires a bug being filed. [18:05:26] Coren: yes, and I feel very, very strongly about manual hacks, esp. on the NFS systems :) So let's not do that at all, no matter how insignificant without proper documentation elsewhere. [18:09:35] Coren: so... ul 24 18:09:17 labstore1002 storage-replicate[10803]: CRITICAL:root:unable to create local snapshot (labstore-others20150724): Logical volume "others20150724" already exists in volume group "labstore" [18:09:45] Coren: I think we should add more granular timestamp to that. [18:10:31] Hm, yes - the original intent was daily so that was okay but if you want to do it more frequently then it needs at least HHMM [18:11:02] Coren: let's go full hog and do HHMMSS. am uploading patch now [18:30:10] 6Labs, 3Labs-Sprint-105: Do a manual backup of labstore1002 - https://phabricator.wikimedia.org/T104882#1480318 (10yuvipanda) 5Open>3Invalid Yup [18:39:53] Coren: hmm, killing the process leaves the lockfile as is... [18:39:53] Jul 24 18:39:27 labstore1002 storage-replicate[13903]: WARNING:root:Skipping replication; already in progress since 2015-07-24% H%:M:53 [18:39:59] even thought he process isn't alive. [18:40:18] YuviPanda: Hm. How did you kill it? [18:40:26] Coren: service stop. [18:40:29] maybe that's why [18:40:55] YuviPanda: Ah. Hm. [18:41:26] YuviPanda: I'm pretty sure we want an aborted rsync to require manual intervention as a rule, really - because the destination is possibly inconsistent. [18:41:39] Coren: aren't we creating new snapshots in the dest as well? [18:41:48] Coren: so it shouldn't be inconsistent, right? esp. with second level granularity now [18:42:29] YuviPanda: The /snapshot/ is known to be consistent - which is why there is the need for a manual intervention (or at least a human making a call) [18:42:43] YuviPanda: After an aborted backup there are two choices: [18:43:31] YuviPanda: Either you drop the "live" fs and make the snapshot the canonical fs - returning to a known consistent state - or you restart the copy over the partial previous one. [18:43:51] YuviPanda: In theory, the former is the safest option. [18:43:59] what do you mean by 'live' fs? [18:44:37] YuviPanda: the destination has a - say - 'tools' lv. That's the target of the rsync. Before the rsync starts, it creates a snapshot. [18:45:01] YuviPanda: After the rsync completes successfuly, the snapshot is just dropped since the destination 'tools' is now known to be equal to the source. [18:45:35] YuviPanda: An aborted copy has the destination with 'tools' (inconsistent) and 'toolsYYMMDDHHSS' which is the snapshot pre-rsync [18:45:47] aha! I see. [18:45:57] but this is an rsync right, so it theoretically should b ok with option 2 [18:46:57] YuviPanda: It *should* be, yes, and it would be 99% of the time at least. But we don't know _what_ caused the rsync to fail by that time and if it's something like out-of-space or an I/O error it wouldn't be. [18:47:14] YuviPanda: Certainly, option 2 works if you wilfully stopped a backup that had no issues. [18:47:29] Coren: oh I agree it should require manual intervention. I just want to automate 1. alerting and 2. bringing it back up. [18:47:30] YuviPanda: But making that can, IMO, needs a human. [18:47:43] * Coren nods. [18:47:49] so I should be able to run a simple script that starts it back up after checking, and have a flowchart I can follow [18:47:59] (1) is simple once these are continuous - alert if the process isn't running. [18:48:28] We agree then. Alerting is simple - the presence of the lock dir at the beginning of a run is always an error state since we know we run it in a loop. [18:49:50] Cleanup requires (a) delete one of the two fs at destination (b) unmount the source snapshot if applicable (c) rmdir the lock directory. [18:50:23] (a), for option 2, is just "lvremove the readonly snapshot" [18:52:55] Coren: hmm, so I see [18:52:58] others backup owi-aos--- 5.00t [18:53:01] others20150724183453 backup swi-a-s--- 1.00t others 0.00 [18:53:17] Coren: I can't actually drop others, can I? doesn't the snapshot rely on others being there? [18:53:20] sorry, still being n00by [18:53:28] I'm trying to understand option 1 (remove live FS) [18:56:35] You can, but it needs more steps. You have to umount and deactivate the origin, then merge the snapshot back into the origin. [18:56:42] (The remount it) [18:57:44] So strictly speaking, you aren't removing the origin I guess, you're rolling it back. In practice, the result is the same though. [18:58:04] You're keeping the old extents and dropping the new ones. [18:59:58] Coren: hmm, (2) sounds simpler... :D [19:00:44] Heh. It is. :-) But having (1) possible means we can recover from a broken copy. [19:01:00] Coren: so I can just rm the lock directory on source and start again? that will mean: 1. snapshot is still present on dest, can recover if needed 2. no need to delete anything. [19:01:37] YuviPanda: Hmmm. Yes, but keep in mind the destination has very little elbow room for extra snapshots. [19:02:02] YuviPanda: So you'd want to delete the extra snapshot asap after the copy. [19:02:19] Coren: right. in this case I feel ok deleting the extra snapshot before the copy as well, but let's see. [19:02:42] Coren: also - if there's no room for extra snapshots, only those new snapshots would fail, right? won't affect the existing ones? [19:02:58] Right. [19:04:25] Coren: https://wikitech.wikimedia.org/wiki/NFS_Backups is that accurate? [19:05:28] Added a steup [19:06:10] Also, are you the one who mounted tools20150715 in /tmp? [19:06:54] Coren: ah, yes. recovered some files for Cyberpower678 [19:06:57] can be unmounted now [19:07:01] kk [19:07:30] Coren: ok, verifying those steps now [19:09:39] Coren: hmm, the lockfile was gone... [19:10:00] ... gone? [19:10:08] as in, it was deleted before I got to it [19:10:45] There's already a -others backup running atm [19:11:01] Or did you just start it? [19:11:03] Coren: I just started it [19:11:14] Coren: I checked to see if lockdir existed, and it didn't so I started it anyway [19:11:16] omg. I think I know. [19:11:25] lulz [19:11:33] go on... [19:11:33] Dumb idiotic bug. [19:12:11] Well, apparently, even if __enter__ fails, __exit__ will be invoked. [19:12:25] Because there is no status flag or anything. [19:13:02] 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1480496 (10yuvipanda) Can this be closed then? [19:13:03] So while the script dies noting the lock dir is there, it'll remove it as it exits (and that will work because there is no script running in it) [19:13:15] The fix is trivial. [19:16:28] 6Labs: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1480517 (10yuvipanda) Backup recovery steps in process of being documented at https://wikitech.wikimedia.org/wiki/NFS_Backups [19:17:00] Coren, ping [19:17:40] Cyberpower678: Yes? [19:18:17] Coren, have you been able to allocating resources for Cyberbot? [19:18:25] *look into [19:43:37] 6Labs, 3Labs-Sprint-105, 5Patch-For-Review: Automate snapshots / backups of labstore - https://phabricator.wikimedia.org/T105027#1480637 (10yuvipanda) [19:43:38] 6Labs: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1480638 (10yuvipanda) [19:54:24] 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1480677 (10Negative24) @yuvipanda I don't really know how I would investigate the issue. The issue was fixed by explicitly opening the port so this wasn't much of a priority for me. I take a quick look in about... [20:04:23] 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1480724 (10bmansurov) 5Open>3Invalid a:3bmansurov Closing since the error didn't happen the second time I tried. [20:06:30] 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1480734 (10scfc) I'm sorry, when reading the task description I somehow missed that you explained that the behaviour changed when the security group definition was changed. Thus it cannot be related to `ferm` (... [20:22:34] YuviPanda: does the content model for Hiera: strip comments and newlines on purpose or by accident? [20:28:13] bd808: the normalization is on purpose, I think (to check that the yaml is valid), but stripping comments is a bit odd [20:28:58] *nod* it makes annotating the mess of settings I need for a trebuchet cluster hard [20:29:31] I'm also not sure where the yaml content model is defined, it doesn't seem standard? :/ [20:30:01] I pinged YuviPanda because as I recall he made it happen [20:40:58] Coren, umm... [20:40:58] you still there? :D [20:41:36] Cyberpower678: Yeah, but I didn't get any cycles to consider your thing since I've been back from Mexico. I've a bit of email and phab catchup to do, and you're in that pile. :-) [20:42:03] Heh. [20:42:09] Coren, thanks. :-) [20:56:32] bd808: accident, actually :) it's the stupid spyc. [20:56:43] bd808: switching to a sane yaml library that round trips should fix things perhaps [20:56:46] spyc? yuck [20:57:33] which extension is that baked into? [20:58:07] bd808: OpenStackManager [20:58:27] oh. I don't think I'm going to muck about with that one [20:59:01] you should obviously be using yaml from pecl though :) [20:59:14] bd808: indeed, unfortunately nobody wants to, particularly because there's no way to actually test it at all... [20:59:31] bd808: although, the YAML part is fairly self contained - it's just a YAML ContentHandler, with absolutely nothing fancy at all [20:59:53] is that the only reason spyc is in there? [21:00:47] bd808: yup [21:01:47] Coren: nice! others is only 20miuns [21:01:48] *mins [21:01:49] I should submit a patch to spyc to let it use yaml_parse if it's available [21:02:15] it has that for syck which is unmaintained and buggy as hell [21:02:29] YuviPanda: Yep. [21:05:17] bd808: as a workaround, you can add a 'comment key' to the actual key, e.g. "toollabs::active_proxy": tools-webproxy-02 \n "toollabs::active_proxy_comment": blah blah blah [21:05:23] ugly, but better than nothing as all [21:05:41] *nod* yeah [21:07:05] YuviPanda: alternatively, can we just kill the pre-save transform and just check for validity on save? [21:07:21] we could. we could even check for validity only on read [21:07:47] although... random crap is often still valid yaml [21:08:03] indeed, so a pre-save transform won't fix anything [21:08:09] besdies, if puppet doesn't recognize it it'll barf. [21:08:12] which is the ultimate test anyway [21:08:26] well, yeah, but we'd like to have yaml errors caught in advance [21:08:37] my yaml parser wouldn't round trip from yaml->php->yaml with comment either actually [21:08:44] buuuut [21:08:52] I can't think of any yaml parser that would [21:09:06] we can do the beautify on display instead of on the source? [21:10:20] YuviPanda: https://github.com/wikimedia/mediawiki-extensions-OpenStackManager/compare/master...valhallasw:patch-1 [21:10:48] then the comments only show up in the edit window, but that's OK I'd say, and you see a parsed tree in the display/preview [21:11:00] valhallasw`cloud: we can just not beautify at all! [21:11:16] YuviPanda: it's not about beautify, it's about checkign the yaml means what you think it means [21:11:35] valhallasw`cloud: that's practically not doable in wikitech alone, no? [21:11:41] either way, I'm happy to merge your patch later today [21:12:06] not to check the meaning, but to check whether the syntax is correct [21:12:12] basically what I'd otherwise use https://yaml-online-parser.appspot.com/ for [21:19:21] YuviPanda: also it's untested, obviously :> [21:44:04] 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481075 (10Krinkle) 3NEW [21:48:16] 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481106 (10Krinkle) The `.name` column is also NULL for the same rows. [21:55:28] Coren: would you mind if I converted the script to using exception style over return style? [21:56:56] valhallasw`cloud: https://phabricator.wikimedia.org/T105721 [21:57:56] YuviPanda: I'm not entirely sure what your definition of 'service' is there [21:58:06] valhallasw`cloud: 'things labs team is responsible for' [21:58:08] hmm [21:58:18] mmm [21:58:19] 'things labs team is responsible for that can have an easily checked test written for them' :P [21:58:29] 'Tool labs' is, in some sense, a whole batch of services [21:58:35] ideally they're overlapping 100%, but unfortunately we aren't in an ideal world [21:58:44] (login hosts, SGE execution, mail delivery, ...) [21:58:47] Cyberpower678: Did you delete WikiHistory again [21:58:48] valhallasw`cloud: yes, those already have p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/ErLK [21:59:00] I just got back and bigbrother seems to be vomiting [21:59:47] valhallasw`cloud: some of the things from there should be moved back too, maybe [22:05:18] YuviPanda: also, what's the goal of the list? communication what is/isn't labs' responsibility? alerting? reporting? [22:05:34] what's the goal of the reporting? [22:06:00] valhallasw`cloud: metrics. 'quarterly goal' is 99.5% uptime for all labs services [22:07:52] right, so the goal is identifying the weak points [22:08:36] valhallasw`cloud: yes, all the 'points' first and then the weak points [22:08:53] 6Labs: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1481158 (10valhallasw) [22:10:14] What about things like LDAP? are those just part of 'instance availability'? [22:11:41] valhallasw`cloud: should probably add LDAP there I tihnk [22:12:33] 6Labs: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1481161 (10valhallasw) [22:15:40] (03Restored) 10BryanDavis: Add empty releases/id_rsa.upload [labs/private] - 10https://gerrit.wikimedia.org/r/225251 (owner: 10BryanDavis) [22:16:00] YuviPanda: can you merge https://gerrit.wikimedia.org/r/#/c/225251/ for me? [22:16:14] it's just junk to fix a prod/labs mismatch [22:16:33] YuviPanda: I guess routing issues are mostly outside of the scope, although it would be interesting to measure. Hard to do a yes/no uptime measurement for, though [22:18:02] Yes [22:18:51] And outside the labs team anyway - those are core ops networking [22:18:53] And anything affecting labs will affect prod too [22:27:05] 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481206 (10Krenair) Ew, it tries to parse InitialiseSettings: https://github.com/wikimedia/operations-software/blob/master/maintain-repl... [23:11:27] 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481393 (10scfc) (Not sure if it is already possible or worthy of a RFE task, but if there was an "API" to `InitialiseSettings.php` inde... [23:11:59] 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481397 (10Krenair) a:3Krenair [23:45:19] andrewbogott, Coren: around?