[00:05:45] <wikibugs>	 6Labs, 10Tool-Labs: can't login to tools-shadow-01 - https://phabricator.wikimedia.org/T104781#1477501 (10yuvipanda) Root key doesn't work either - I think the instance is dead. @Coren? was it DOA or is there something you did inside that needs rescuing?
[00:43:30] <wikibugs>	 10Tool-Labs-tools-meetbot: Update meetbot to not hang out in 'wikimedia-office' but instead 'wikimedia-meeting' - https://phabricator.wikimedia.org/T103404#1477726 (10Krenair)
[01:41:04] <wikibugs>	 6Labs: Investigate spikes in Labs NFS network usage - https://phabricator.wikimedia.org/T95392#1477872 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Oh well.
[01:41:45] <wikibugs>	 6Labs, 6operations: lvm 'others20150715' snapshot full on labstore1001 - https://phabricator.wikimedia.org/T106601#1477876 (10yuvipanda) 5Open>3Resolved a:3yuvipanda The snapshot has been deleted by @Coren
[01:50:53] <wikibugs>	 6Labs, 10Tool-Labs, 10Wikimedia-Git-or-Gerrit: git clone operations/mediawiki-config on tool labs fail: recursion detected in die_errno handler - https://phabricator.wikimedia.org/T106393#1477887 (10zhuyifei1999) 5Open>3Resolved a:3zhuyifei1999 Retried just now. Cannot reproduce the error anymore. (weird)
[03:35:18] <wikibugs>	 6Labs, 10Tool-Labs: Reenable backups for /home and /data/project - https://phabricator.wikimedia.org/T63103#1478027 (10yuvipanda)
[03:35:19] <wikibugs>	 6Labs, 5Patch-For-Review: Replicate data between codfw and eqiad - https://phabricator.wikimedia.org/T85606#1478026 (10yuvipanda)
[03:36:59] <wikibugs>	 6Labs, 10Labs-Infrastructure: Recover /data/scratch/ content - https://phabricator.wikimedia.org/T106324#1478031 (10yuvipanda) 5Open>3declined a:3yuvipanda I think this is too late and /data/scratch was obliterated during the recovery. /data/scratch should not also contain valuble information.
[03:38:55] <wikibugs>	 6Labs, 10Labs-Vagrant: failing puppet - https://phabricator.wikimedia.org/T106442#1478043 (10yuvipanda) Seems to work fine for me?
[03:39:08] <wikibugs>	 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1478044 (10yuvipanda)
[03:39:34] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-107, 5Patch-For-Review: nfs-exports-daemon hangs, prevents new instances from accessing nfs - https://phabricator.wikimedia.org/T106076#1478045 (10yuvipanda) 5Open>3Resolved
[03:39:54] <wikibugs>	 6Labs, 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1478046 (10yuvipanda) @leila has graciously accepted to help run this one :)
[03:40:52] <wikibugs>	 6Labs, 10Labs-Infrastructure, 6Security, 10wikitech.wikimedia.org, 7Security-Other: Huge rash of bot accounts on wikitech - https://phabricator.wikimedia.org/T105350#1478050 (10yuvipanda) 5Open>3Invalid a:3yuvipanda AFAICT this was a misunderstanding, do re-open if not.
[03:42:27] <wikibugs>	 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1478053 (10yuvipanda) @Negative24 ping? Were your issues resolved? Was ferm the issue?
[03:44:42] <wikibugs>	 6Labs, 10Tool-Labs, 10Wikimania-Hackathon-2015: Workshop: Doing Research on Wikimedia things as a volunteer - tools and communities - https://phabricator.wikimedia.org/T91062#1478055 (10yuvipanda) 5Open>3Resolved
[03:46:43] <wikibugs>	 6Labs, 10Tool-Labs: role::relic - changes not applied by puppet? on which node or instance is it? - https://phabricator.wikimedia.org/T104537#1478059 (10yuvipanda) /me pokes @Coren
[03:47:06] <wikibugs>	 6Labs: Investigate why novaadmin was no longer projectadmin of the puppet3-diffs project - https://phabricator.wikimedia.org/T104440#1478060 (10yuvipanda) 5Open>3Invalid a:3yuvipanda CNR
[03:52:17] <wikibugs>	 6Labs, 10Analytics-Cluster, 10wikitech.wikimedia.org: Include role::analytics::hadoop roles in default list of labs puppet groups - https://phabricator.wikimedia.org/T70391#1478063 (10yuvipanda) 5Open>3declined I have cleaned the default roles to a minimum, and these should just be project specific roles...
[03:53:17] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Labs-Vagrant, 10MediaWiki-Vagrant, and 2 others: Update Vagrant role for Extension:OpenStackManager - https://phabricator.wikimedia.org/T103874#1478065 (10yuvipanda) This is a lost cause, IMO. Anything we set up on MWV won't match production in any way or f...
[03:54:28] <wikibugs>	 6Labs: Re-evaluate use of NFS in WMT project - https://phabricator.wikimedia.org/T103750#1478066 (10yuvipanda) So... it currently has only /data/project afaict, should we say 'it needs that, and that alone' and close the ticket? Or can that be re-evaluated too?
[03:54:52] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Grant shell user right with project memberships and remove autocreation of shell requests - https://phabricator.wikimedia.org/T97334#1478067 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Wooo
[03:56:56] <wikibugs>	 6Labs: Find alternative solutions for video project's use of NFS - https://phabricator.wikimedia.org/T102402#1478070 (10yuvipanda) @matanya can I kill everything except /data/scratch then? Do you still want the ~2T of data in /data/project?
[03:59:04] <wikibugs>	 6Labs: Get rid of Gluster Copy and PMTPA NFS Copies from labstore1001 - https://phabricator.wikimedia.org/T102390#1478071 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This was done.
[03:59:29] <wikibugs>	 6Labs, 6operations: New instances stuck unable to run puppet (and no sshing in!) - https://phabricator.wikimedia.org/T101916#1478074 (10yuvipanda) 5Open>3Resolved a:3yuvipanda New images were built.
[04:01:22] <wikibugs>	 6Labs: Remove old backups-of-backups from NFS - https://phabricator.wikimedia.org/T99061#1478079 (10yuvipanda) 5Open>3Resolved a:3yuvipanda These were gone during the outage.
[04:02:50] <wikibugs>	 6Labs, 10Labs-Infrastructure: The Salt minion client id should be the FQDN, not ec2_instance_id - https://phabricator.wikimedia.org/T71502#1478086 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This is the case now.
[04:03:43] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478090 (10yuvipanda) I'm tempted to mark as declined and hope that Horizon fixes things...
[04:04:05] <wikibugs>	 6Labs, 10Labs-Infrastructure: Oddity about service groups "awb" in Tools pre and post transition - https://phabricator.wikimedia.org/T65754#1478093 (10yuvipanda) Was this done?
[04:04:38] <wikibugs>	 6Labs, 10Labs-Infrastructure: role::mediawiki-install::labs in an eqiad instance thinks to be in pmtpa - https://phabricator.wikimedia.org/T64370#1478096 (10yuvipanda) 5Open>3declined a:3yuvipanda That role is no longer available on wikitech by default, and should be killed from puppet too.
[04:05:14] <wikibugs>	 6Labs, 10Labs-Infrastructure: MediaWiki files set up by role::mediawiki-install::labs don't have proper permissions - https://phabricator.wikimedia.org/T64368#1478100 (10yuvipanda) 5Open>3declined a:3yuvipanda Using that role has been unsupported since availability of labs-vagrant, instances should proba...
[04:05:59] <wikibugs>	 6Labs, 10Labs-Infrastructure: Provide Redis feed of recent changes for Wikimedia wikis - https://phabricator.wikimedia.org/T61721#1478103 (10yuvipanda) 5Open>3declined a:3yuvipanda RCStream is probably good enough, and do not think we'll have much of a load change from providing a redis proxy.
[04:06:31] <wikibugs>	 6Labs, 10Labs-Infrastructure: provide bastion redundancy via DNS round robin - https://phabricator.wikimedia.org/T59834#1478107 (10yuvipanda) 5Open>3declined Let's not do this, this will confuse people running screen and what not. We have redundancy now by being able to switch over the IP address in case s...
[04:07:18] <wikibugs>	 6Labs, 10Labs-Infrastructure: Have shell requests marked as uncompleted or completed automatically - https://phabricator.wikimedia.org/T47456#1478110 (10yuvipanda) 5Open>3Invalid a:3yuvipanda No more shell requests!!!1
[04:08:07] <wikibugs>	 6Labs, 10Labs-Infrastructure: default labs MediaWiki config will generate https links - https://phabricator.wikimedia.org/T58389#1478114 (10yuvipanda) 5Open>3declined a:3yuvipanda Old and unmaintained puppet code, use labs-vagrant instead!
[04:11:08] <wikibugs>	 6Labs, 10Labs-Infrastructure: Add "open for all" project feature - start with bastion - https://phabricator.wikimedia.org/T46173#1478118 (10yuvipanda) 5Open>3declined a:3yuvipanda Let's not do this - adding people to a project is trivial, and we do not have a block based on the shellmanagers group anymore.
[04:12:15] <wikibugs>	 6Labs, 10Wikimedia-Labs-General, 5Patch-For-Review: /etc/mailname is set to "labs-vmbuilder-precise.eqiad.wmflabs" - https://phabricator.wikimedia.org/T66962#1478122 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Definitely not the case anymore.
[04:13:08] <wikibugs>	 6Labs, 10Wikimedia-Labs-General: Just get rid of creepy default vimrc - https://phabricator.wikimedia.org/T51339#1478126 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I see it on really old instances now, but don't think this is set by default anymore.
[04:13:40] <wikibugs>	 6Labs, 10Wikimedia-Labs-General, 7JavaScript: WMFLabs Graphite: Dashboard is empty (Uncaught exception in javascript) - https://phabricator.wikimedia.org/T73742#1478130 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I do not see those errors anymore?
[04:14:32] <wikibugs>	 6Labs, 10Wikimedia-Labs-General: Rename project bots to wm-bot - https://phabricator.wikimedia.org/T57691#1478134 (10yuvipanda) Can projects even be renamed?
[04:15:29] <wikibugs>	 6Labs, 10Wikimedia-Labs-General: Fix virt1000 OAI errors - https://phabricator.wikimedia.org/T87079#1478137 (10yuvipanda) 5Open>3Invalid a:3yuvipanda What is OAI? virt1000 is no longer alive. I assume this can be closed (re-open if this is still a problem)
[04:18:09] <wikibugs>	 6Labs: Renaming scheme for labs servers - https://phabricator.wikimedia.org/T95042#1478141 (10yuvipanda) 5Open>3Resolved This was all done anyway, so it's ok :)
[04:19:36] <wikibugs>	 6Labs, 6operations: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1478148 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Marking as invalid because there's no unpuppetized (or otherwise) bond0 now.
[05:44:41] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL tools.tools-webgrid-lighttpd-1404.diskspace.root.byte_percentfree (<30.00%)
[06:27:26] <wikibugs>	 6Labs: Find alternative solutions for video project's use of NFS - https://phabricator.wikimedia.org/T102402#1478219 (10Matanya) I still need that data. until i figure out how to upload those files given commons limitation.
[06:44:45] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1404 is OK All targets OK
[07:39:36] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478300 (10Nemo_bis) > I'm tempted to mark as declined and hope that Horizon fixes things...  Nope. Bug can still be reproduced.
[07:41:05] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1478301 (10Nemo_bis)
[08:19:42] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 20.00% of data above the critical threshold [0.0]
[08:20:00] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 30.00% of data above the critical threshold [0.0]
[08:22:12] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 22.22% of data above the critical threshold [0.0]
[08:23:30] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 20.00% of data above the critical threshold [0.0]
[08:25:16] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 22.22% of data above the critical threshold [0.0]
[08:25:44] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 40.00% of data above the critical threshold [0.0]
[08:25:50] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0]
[08:25:52] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 30.00% of data above the critical threshold [0.0]
[08:26:36] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 40.00% of data above the critical threshold [0.0]
[08:27:06] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 22.22% of data above the critical threshold [0.0]
[08:29:02] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 20.00% of data above the critical threshold [0.0]
[08:29:36] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:29:54] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:29:55] <shinken-wm>	 PROBLEM - Puppet failure on tools-master is CRITICAL 30.00% of data above the critical threshold [0.0]
[08:31:38] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 50.00% of data above the critical threshold [0.0]
[08:33:36] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 50.00% of data above the critical threshold [0.0]
[08:33:42] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 20.00% of data above the critical threshold [0.0]
[08:34:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 44.44% of data above the critical threshold [0.0]
[08:34:28] <shinken-wm>	 PROBLEM - Puppet failure on tools-mailrelay-02 is CRITICAL 50.00% of data above the critical threshold [0.0]
[08:38:32] <shinken-wm>	 PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 40.00% of data above the critical threshold [0.0]
[08:39:28] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 44.44% of data above the critical threshold [0.0]
[08:39:57] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 40.00% of data above the critical threshold [0.0]
[08:39:57] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:40:56] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 20.00% of data above the critical threshold [0.0]
[08:41:10] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 33.33% of data above the critical threshold [0.0]
[08:43:40] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 30.00% of data above the critical threshold [0.0]
[08:44:31] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 22.22% of data above the critical threshold [0.0]
[08:44:41] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:46:02] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 30.00% of data above the critical threshold [0.0]
[08:46:32] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 22.22% of data above the critical threshold [0.0]
[08:47:16] <shinken-wm>	 PROBLEM - Puppet failure on tools-submit is CRITICAL 44.44% of data above the critical threshold [0.0]
[08:48:24] <shinken-wm>	 PROBLEM - Puppet failure on tools-mail is CRITICAL 50.00% of data above the critical threshold [0.0]
[08:48:58] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:49:18] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 66.67% of data above the critical threshold [0.0]
[08:49:40] <shinken-wm>	 PROBLEM - Puppet failure on tools-shadow is CRITICAL 60.00% of data above the critical threshold [0.0]
[08:59:41] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0]
[08:59:59] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0]
[09:00:11] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL 55.56% of data above the critical threshold [0.0]
[09:00:46] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0]
[09:02:12] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0]
[09:03:34] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0]
[09:04:38] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0]
[09:04:54] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0]
[09:05:17] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1219 is OK Less than 1.00% above the threshold [0.0]
[09:05:49] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0]
[09:05:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0]
[09:06:36] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0]
[09:06:38] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0]
[09:07:06] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0]
[09:08:42] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0]
[09:09:02] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0]
[09:09:30] <shinken-wm>	 RECOVERY - Puppet failure on tools-mailrelay-02 is OK Less than 1.00% above the threshold [0.0]
[09:09:56] <shinken-wm>	 RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0]
[09:10:15] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1478439 (10Magnus) Three weeks in, and still broken? Just created a new tool, do I need to use another tool's credentials yet again?
[09:13:34] <shinken-wm>	 RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0]
[09:13:42] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0]
[09:14:22] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0]
[09:14:56] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0]
[09:16:09] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0]
[09:18:36] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0]
[09:19:00] <eranroz>	 how to tell bigbrother to reread .bigbrotherrc?
[09:19:30] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0]
[09:19:42] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1214 is OK Less than 1.00% above the threshold [0.0]
[09:19:56] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0]
[09:21:00] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0]
[09:21:51] <wikibugs>	 6Labs, 10pywikibot-core: pywikipedia.org down? - https://phabricator.wikimedia.org/T106311#1478487 (10Chmarkine) >>! In T106311#1476249, @valhallasw wrote: > the question is: what should it be replaced with...?  CNAME to wikimedia.org?
[09:23:20] <shinken-wm>	 RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0]
[09:23:58] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0]
[09:24:20] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0]
[09:24:28] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0]
[09:24:40] <shinken-wm>	 RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0]
[09:26:03] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0]
[09:26:29] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0]
[09:27:15] <shinken-wm>	 RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0]
[09:35:13] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1404 is OK Less than 1.00% above the threshold [0.0]
[10:07:32] <shinken-wm>	 PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0]
[10:55:02] <wikibugs>	 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: Audit projects' use of NFS, and remove it where not necessary - https://phabricator.wikimedia.org/T102240#1478671 (10JohnLewis)
[10:55:04] <wikibugs>	 6Labs: Re-evaluate use of NFS in WMT project - https://phabricator.wikimedia.org/T103750#1478669 (10JohnLewis) 5Open>3Resolved Yeah, currently it needs that and that alone. We'll see if we can deprecate it in future but for now, it is needed.
[12:49:32] <wikibugs>	 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure: Increase number of Jenkins slaves to spread load and prevent browser test failures on beta - https://phabricator.wikimedia.org/T72049#1478963 (10hashar) that was a request to #labs-infrastructure to bump the project quota.
[14:18:54] <wikibugs>	 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1479124 (10bmansurov) It started working when I created a new instance. I may have chosen a different version of Ubuntu initially. I'm not sure though.
[14:32:30] <wikibugs>	 6Labs, 10Tool-Labs: Reenable backups for /home and /data/project - https://phabricator.wikimedia.org/T63103#1479142 (10scfc) 5Open>3declined (IMHO this was a duplicate of T85608 (or vice versa), so declining this as well.)
[15:00:17] <wikibugs>	 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Labs-Vagrant, 10MediaWiki-Vagrant, and 2 others: Update Vagrant role for Extension:OpenStackManager - https://phabricator.wikimedia.org/T103874#1479197 (10scfc) I don't want to update the role to "really" match production (i. e. "where you can do stuff just...
[15:12:09] <wikibugs>	 10MediaWiki-extensions-OpenStackManager: OpenStackManager: Nova resource pages are endlessly prepended with line break whitespace on each update by Labslogbot - https://phabricator.wikimedia.org/T58316#1479235 (10scfc) 5Open>3Resolved a:3scfc A cursory look at the (deleted and before that moved) page refer...
[15:15:59] <wikibugs>	 10MediaWiki-extensions-OpenStackManager: OpenStackManager: Nova resource pages are endlessly prepended with line break whitespace on each update by Labslogbot - https://phabricator.wikimedia.org/T58316#1479250 (10scfc) a:5scfc>3None
[15:23:42] <wikibugs>	 6Labs, 10Labs-Infrastructure, 7Composer, 7Upstream: Composer activity from Labs hosts can be rate limited by GitHub - https://phabricator.wikimedia.org/T106452#1479271 (10bd808) a:5bd808>3None GitHub does not have a whitelist capability for anonymous API requests. They suggest that we use authenticated...
[15:29:16] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1479280 (10scfc)
[15:50:23] <wikibugs>	 10MediaWiki-extensions-OpenStackManager: Special:NovaInstance should restrict project filter to projects where the current user is an administrator - https://phabricator.wikimedia.org/T106820#1479311 (10scfc) 3NEW
[15:50:59] <wikibugs>	 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1479319 (10MusikAnimal) The problem does indeed come and go, and lately (past several days) I don't think I've had any issues. I have a log file for the Ruby script and I see that...
[15:51:16] <wikibugs>	 10MediaWiki-extensions-OpenStackManager: Special:NovaInstance should restrict project filter to projects where the current user is an administrator - https://phabricator.wikimedia.org/T106820#1479320 (10scfc)
[15:55:35] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1479341 (10yuvipanda) Indeed, but nobody is planning on working on it afaict...
[17:25:28] <wikibugs>	 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1479921 (10scfc) @notconfusing's question was related to inter-project data transfer, AFAIUI.  There is certainly the possibility of an issue with (lack of) NFS synchronicity (CMI...
[17:31:25] <wikibugs>	 6Labs, 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1479950 (10aripstra) Adding this to the Design research work board for awareness. Please reach out if you need feedback on your survey or anything.
[17:31:40] <wikibugs>	 6Labs, 10Tool-Labs, 6Learning-and-Evaluation, 6WMF-Design-Research: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1479955 (10aripstra)
[17:40:46] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1480035 (10yuvipanda) (did a manual run just now, should have it for new tools atm).  Disruption due to wikimania + current priority being getting cross DC backups of...
[17:46:42] <wikibugs>	 6Labs, 10Labs-Infrastructure: "The specified resource does not exist" when you try to configure an instance and are not a projectadmin - https://phabricator.wikimedia.org/T67379#1480096 (10scfc) p:5Normal>3Lowest Then let's adjust the priority to reflect that.  But the bug is quite clearly still there.
[17:53:31] <YuviPanda>	 Coren: should we stop the current runs and run them via systemd instead?
[17:54:40] <Coren>	 YuviPanda: Hm.  If you feel it really necessary - otherwise I'd be happy to start them after that run is done.  They're not so long anymore; last time they took <24h
[17:54:52] <YuviPanda>	 Coren: yeah but that's saturday :D
[17:55:19] <Coren>	 YuviPanda: True, but that's why - I'd rather be around start-to-finish when we let them loose automatically first.
[17:55:33] <YuviPanda>	 Coren: we aren't running them automatically
[17:55:33] <YuviPanda>	 Coren: these don't restart at all.
[17:55:39] <YuviPanda>	 Coren: they just run and then when done... stop.
[17:55:41] <Coren>	 Hm.  True.
[17:55:46] <YuviPanda>	 I have a Restart=No in there
[17:55:56] <YuviPanda>	 Coren: the only thing this gives us is 1. validation that the scripts are ok, 2. logging properly
[17:56:00] <Coren>	 Oh hah - others is actually already done.
[17:56:11] <YuviPanda>	 ncie
[17:56:20] <YuviPanda>	 I'm still hacking around puppet
[17:56:41] <Coren>	 Ah, cool, and everything worked fine at the remote end too - the temporary snapshot got discarded once the rsync completed.
[17:56:42] <wikibugs>	 6Labs, 10Tool-Labs: Permission issues and/or failure to load Ruby environment on trusty - https://phabricator.wikimedia.org/T106170#1480192 (10scfc) @MusikAnimal, sorry, I posted my reply without checking if comments had come in since I opened the browser's tab.  I just looked at `qacct -j exec -o tools.musikb...
[17:57:47] <Coren>	 YuviPanda: Means you can test the systemd thing with others at no risk
[17:57:55] <YuviPanda>	 Coren: cool.
[17:58:08] <Coren>	 And it should complete really fast too.
[17:58:10] <YuviPanda>	 Coren: so do we still have any deletion steps left? 
[17:58:17] <YuviPanda>	 Coren: I guess we need to remove local snapshot when done?
[17:58:53] <Coren>	 YuviPanda: We'll want to clean up the local snapshots at interval.  At least any time there is less than 6T available in the vg (so there is room for another backup)
[17:59:14] <YuviPanda>	 Coren: why 6T?
[17:59:30] <Coren>	 YuviPanda: Room enough for 3 new snapshots.
[17:59:43] <YuviPanda>	 Coren: are we doing all snapshots at 2T?
[18:00:42] <Coren>	 Ah, no, 1T.  Dunno why I was thinking 2T
[18:00:46] <YuviPanda>	 Coren: we should also do alerts for when snapshots are going to get full
[18:00:49] <Coren>	 So 3T being the low water mark.
[18:00:52] <YuviPanda>	 the others one got full the other day
[18:01:15] <Coren>	 YuviPanda: Yeah, they have limited lifetime.  I don't think it's worth alerting - just discard the ones getting full.
[18:01:34] <YuviPanda>	 Coren: also how did you get around the ssh key problem?
[18:03:16] <Coren>	 YuviPanda: I added the full set of signatures to ~/.ssh/known_hosts so that it doesn't get trampled over by puppet with ssh-keyscan.  It's not clear how to automate that though, short of adding more key types to the global known hosts (which seems like a bad idea to me)
[18:03:45] <wikibugs>	 6Labs, 10Labs-Infrastructure: Transition service groups to new globally unique names and UIDs - https://phabricator.wikimedia.org/T60997#1480233 (10scfc)
[18:03:46] <wikibugs>	 6Labs, 10Labs-Infrastructure: Oddity about service groups "awb" in Tools pre and post transition - https://phabricator.wikimedia.org/T65754#1480230 (10scfc) 5Open>3Invalid a:3scfc No (AFAIK), but if I look now at https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup, the "tools.awb" and "tools.loc...
[18:03:57] <YuviPanda>	 Coren: ugh, please let's not do manual hacks without filing bugs and being loud at them. broken windows and all that - this is how we ended up at our current situation, let's not go back towards that again.
[18:04:06] <Coren>	 I'll work on a script that (a) discards snapshots getting too full and (b) discards the oldest remaining snapshots until at least 3T is left.
[18:04:39] <Coren>	 YuviPanda: Hm, yes.  You're right of course - this requires a bug being filed.
[18:05:26] <YuviPanda>	 Coren: yes, and I feel very, very strongly about manual hacks, esp. on the NFS systems :) So let's not do that at all, no matter how insignificant without proper documentation elsewhere.
[18:09:35] <YuviPanda>	 Coren: so... ul 24 18:09:17 labstore1002 storage-replicate[10803]: CRITICAL:root:unable to create local snapshot (labstore-others20150724): Logical volume "others20150724" already exists in volume group "labstore"
[18:09:45] <YuviPanda>	 Coren: I think we should add more granular timestamp to that.
[18:10:31] <Coren>	 Hm, yes - the original intent was daily so that was okay but if you want to do it more frequently then it needs at least HHMM
[18:11:02] <YuviPanda>	 Coren: let's go full hog and do HHMMSS. am uploading patch now
[18:30:10] <wikibugs>	 6Labs, 3Labs-Sprint-105: Do a manual backup of labstore1002 - https://phabricator.wikimedia.org/T104882#1480318 (10yuvipanda) 5Open>3Invalid Yup
[18:39:53] <YuviPanda>	 Coren: hmm, killing the process leaves the lockfile as is...
[18:39:53] <YuviPanda>	 Jul 24 18:39:27 labstore1002 storage-replicate[13903]: WARNING:root:Skipping replication; already in progress since 2015-07-24% H%:M:53
[18:39:59] <YuviPanda>	 even thought he process isn't alive.
[18:40:18] <Coren>	 YuviPanda: Hm.  How did you kill it?
[18:40:26] <YuviPanda>	 Coren: service stop.
[18:40:29] <YuviPanda>	 maybe that's why
[18:40:55] <Coren>	 YuviPanda: Ah.  Hm.
[18:41:26] <Coren>	 YuviPanda: I'm pretty sure we want an aborted rsync to require manual intervention as a rule, really - because the destination is possibly inconsistent.
[18:41:39] <YuviPanda>	 Coren: aren't we creating new snapshots in the dest as well?
[18:41:48] <YuviPanda>	 Coren: so it shouldn't be inconsistent, right? esp. with second level granularity now
[18:42:29] <Coren>	 YuviPanda: The /snapshot/ is known to be consistent - which is why there is the need for a manual intervention (or at least a human making a call)
[18:42:43] <Coren>	 YuviPanda: After an aborted backup there are two choices:
[18:43:31] <Coren>	 YuviPanda: Either you drop the "live" fs and make the snapshot the canonical fs - returning to a known consistent state - or you restart the copy over the partial previous one.
[18:43:51] <Coren>	 YuviPanda: In theory, the former is the safest option.
[18:43:59] <YuviPanda>	 what do you mean by 'live' fs?
[18:44:37] <Coren>	 YuviPanda: the destination has a - say - 'tools' lv.  That's the target of the rsync.  Before the rsync starts, it creates a snapshot.
[18:45:01] <Coren>	 YuviPanda: After the rsync completes successfuly, the snapshot is just dropped since the destination 'tools' is now known to be equal to the source.
[18:45:35] <Coren>	 YuviPanda: An aborted copy has the destination with 'tools' (inconsistent) and 'toolsYYMMDDHHSS' which is the snapshot pre-rsync
[18:45:47] <YuviPanda>	 aha! I see.
[18:45:57] <YuviPanda>	 but this is an rsync right, so it theoretically should b ok with option 2
[18:46:57] <Coren>	 YuviPanda: It *should* be, yes, and it would be 99% of the time at least.  But we don't know _what_ caused the rsync to fail by that time and if it's something like out-of-space or an I/O error it wouldn't be.
[18:47:14] <Coren>	 YuviPanda: Certainly, option 2 works if you wilfully stopped a backup that had no issues.
[18:47:29] <YuviPanda>	 Coren: oh I agree it should require manual intervention. I just want to automate 1. alerting and 2. bringing it back up.
[18:47:30] <Coren>	 YuviPanda: But making that can, IMO, needs a human.
[18:47:43] * Coren nods.
[18:47:49] <YuviPanda>	 so I should be able to run a simple script that starts it back up after checking, and have a flowchart I can follow
[18:47:59] <YuviPanda>	 (1) is simple once these are continuous - alert if the process isn't running.
[18:48:28] <Coren>	 We agree then.  Alerting is simple - the presence of the lock dir at the beginning of a run is always an error state since we know we run it in a loop.
[18:49:50] <Coren>	 Cleanup requires (a) delete one of the two fs at destination (b) unmount the source snapshot if applicable (c) rmdir the lock directory.
[18:50:23] <Coren>	 (a), for option 2, is just "lvremove the readonly snapshot"
[18:52:55] <YuviPanda>	 Coren: hmm, so I see 
[18:52:58] <YuviPanda>	   others               backup owi-aos---  5.00t                                                    
[18:53:01] <YuviPanda>	   others20150724183453 backup swi-a-s---  1.00t      others 0.00                                   
[18:53:17] <YuviPanda>	 Coren: I can't actually drop others, can I? doesn't the snapshot rely on others being there?
[18:53:20] <YuviPanda>	 sorry, still being n00by
[18:53:28] <YuviPanda>	 I'm trying to understand option 1 (remove live FS)
[18:56:35] <Coren>	 You can, but it needs more steps.  You have to umount and deactivate the origin, then merge the snapshot back into the origin.
[18:56:42] <Coren>	 (The remount it)
[18:57:44] <Coren>	 So strictly speaking, you aren't removing the origin I guess, you're rolling it back.  In practice, the result is the same though.
[18:58:04] <Coren>	 You're keeping the old extents and dropping the new ones.
[18:59:58] <YuviPanda>	 Coren: hmm, (2) sounds simpler... :D
[19:00:44] <Coren>	 Heh.  It is.  :-)  But having (1) possible means we can recover from a broken copy.
[19:01:00] <YuviPanda>	 Coren: so I can just rm the lock directory on source and start again? that will mean: 1. snapshot is still present on dest, can recover if needed 2. no need to delete anything.
[19:01:37] <Coren>	 YuviPanda: Hmmm.  Yes, but keep in mind the destination has very little elbow room for extra snapshots.
[19:02:02] <Coren>	 YuviPanda: So you'd want to delete the extra snapshot asap after the copy.
[19:02:19] <YuviPanda>	 Coren: right. in this case I feel ok deleting the extra snapshot before the copy as well, but let's see.
[19:02:42] <YuviPanda>	 Coren: also - if there's no room for extra snapshots, only those new snapshots would fail, right? won't affect the existing ones?
[19:02:58] <Coren>	 Right.
[19:04:25] <YuviPanda>	 Coren: https://wikitech.wikimedia.org/wiki/NFS_Backups is that accurate?
[19:05:28] <Coren>	 Added a steup
[19:06:10] <Coren>	 Also, are you the one who mounted tools20150715 in /tmp?
[19:06:54] <YuviPanda>	 Coren: ah, yes. recovered some files for Cyberpower678 
[19:06:57] <YuviPanda>	 can be unmounted now
[19:07:01] <Coren>	 kk
[19:07:30] <YuviPanda>	 Coren: ok, verifying those steps now
[19:09:39] <YuviPanda>	 Coren: hmm, the lockfile was gone...
[19:10:00] <Coren>	 ... gone?
[19:10:08] <YuviPanda>	 as in, it was deleted before I got to it
[19:10:45] <Coren>	 There's already a -others backup running atm
[19:11:01] <Coren>	 Or did you just start it?
[19:11:03] <YuviPanda>	 Coren: I just started it
[19:11:14] <YuviPanda>	 Coren: I checked to see if lockdir existed, and it didn't so I started it anyway
[19:11:16] <Coren>	 omg.  I think I know.
[19:11:25] <Coren>	 lulz
[19:11:33] <YuviPanda>	 go on...
[19:11:33] <Coren>	 Dumb idiotic bug.
[19:12:11] <Coren>	 Well, apparently, even if __enter__ fails, __exit__ will be invoked.
[19:12:25] <Coren>	 Because there is no status flag or anything.
[19:13:02] <wikibugs>	 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1480496 (10yuvipanda) Can this be closed then?
[19:13:03] <Coren>	 So while the script dies noting the lock dir is there, it'll remove it as it exits (and that will work because there is no script running in it)
[19:13:15] <Coren>	 The fix is trivial.
[19:16:28] <wikibugs>	 6Labs: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1480517 (10yuvipanda) Backup recovery steps in process of being documented at https://wikitech.wikimedia.org/wiki/NFS_Backups
[19:17:00] <Cyberpower678>	 Coren, ping
[19:17:40] <Coren>	 Cyberpower678: Yes?
[19:18:17] <Cyberpower678>	 Coren, have you been able to allocating resources for Cyberbot?
[19:18:25] <Cyberpower678>	 *look into
[19:43:37] <wikibugs>	 6Labs, 3Labs-Sprint-105, 5Patch-For-Review: Automate snapshots / backups of labstore - https://phabricator.wikimedia.org/T105027#1480637 (10yuvipanda)
[19:43:38] <wikibugs>	 6Labs: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1480638 (10yuvipanda)
[19:54:24] <wikibugs>	 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1480677 (10Negative24) @yuvipanda I don't really know how I would investigate the issue. The issue was fixed by explicitly opening the port so this wasn't much of a priority for me. I take a quick look in about...
[20:04:23] <wikibugs>	 6Labs, 10Labs-Vagrant: failing puppet on instance testing-instructions - https://phabricator.wikimedia.org/T106442#1480724 (10bmansurov) 5Open>3Invalid a:3bmansurov Closing since the error didn't happen the second time I tried.
[20:06:30] <wikibugs>	 6Labs: Investigate per-project open security group policy - https://phabricator.wikimedia.org/T104894#1480734 (10scfc) I'm sorry, when reading the task description I somehow missed that you explained that the behaviour changed when the security group definition was changed.  Thus it cannot be related to `ferm` (...
[20:22:34] <bd808>	 YuviPanda: does the content model for Hiera:<project> strip comments and newlines on purpose or by accident?
[20:28:13] <valhallasw`cloud>	 bd808: the normalization is on purpose, I think (to check that the yaml is valid), but stripping comments is a bit odd
[20:28:58] <bd808>	 *nod* it makes annotating the mess of settings I need for a trebuchet cluster hard
[20:29:31] <valhallasw`cloud>	 I'm also not sure where the yaml content model is defined, it doesn't seem standard? :/
[20:30:01] <bd808>	 I pinged YuviPanda because as I recall he made it happen
[20:40:58] <Cyberpower678>	 Coren, umm...
[20:40:58] <Cyberpower678>	 you still there? :D
[20:41:36] <Coren>	 Cyberpower678: Yeah, but I didn't get any cycles to consider your thing since I've been back from Mexico.  I've a bit of email and phab catchup to do, and you're in that pile.  :-)
[20:42:03] <Cyberpower678>	 Heh.
[20:42:09] <Cyberpower678>	 Coren, thanks. :-)
[20:56:32] <YuviPanda>	 bd808: accident, actually :) it's the stupid spyc.
[20:56:43] <YuviPanda>	 bd808: switching to a sane yaml library that round trips should fix things perhaps
[20:56:46] <bd808>	 spyc? yuck
[20:57:33] <bd808>	 which extension is that baked into?
[20:58:07] <YuviPanda>	 bd808: OpenStackManager
[20:58:27] <bd808>	 oh. I don't think I'm going to muck about with that one
[20:59:01] <bd808>	 you should obviously be using yaml from pecl though :)
[20:59:14] <YuviPanda>	 bd808: indeed, unfortunately nobody wants to, particularly because there's no way to actually test it at all...
[20:59:31] <YuviPanda>	 bd808: although, the YAML part is fairly self contained - it's just a YAML ContentHandler, with absolutely nothing fancy at all
[20:59:53] <bd808>	 is that the only reason spyc is in there?
[21:00:47] <YuviPanda>	 bd808: yup
[21:01:47] <YuviPanda>	 Coren: nice! others is only 20miuns
[21:01:48] <YuviPanda>	 *mins
[21:01:49] <bd808>	 I should submit a patch to spyc to let it use yaml_parse if it's available
[21:02:15] <bd808>	 it has that for syck which is unmaintained and buggy as hell
[21:02:29] <Coren>	 YuviPanda: Yep.
[21:05:17] <valhallasw`cloud>	 bd808: as a workaround, you can add a 'comment key' to the actual key, e.g.   "toollabs::active_proxy": tools-webproxy-02 \n "toollabs::active_proxy_comment": blah blah blah
[21:05:23] <valhallasw`cloud>	 ugly, but better than nothing as all
[21:05:41] <bd808>	 *nod* yeah
[21:07:05] <valhallasw`cloud>	 YuviPanda: alternatively, can we just kill the pre-save transform and just check for validity on save?
[21:07:21] <YuviPanda>	 we could. we could even check for validity only on read
[21:07:47] <valhallasw`cloud>	 although... random crap is often still valid yaml
[21:08:03] <YuviPanda>	 indeed, so a pre-save transform won't fix anything
[21:08:09] <YuviPanda>	 besdies, if puppet doesn't recognize it it'll barf.
[21:08:12] <YuviPanda>	 which is the ultimate test anyway
[21:08:26] <valhallasw`cloud>	 well, yeah, but we'd like to have yaml errors caught in advance
[21:08:37] <bd808>	 my yaml parser wouldn't round trip from yaml->php->yaml with comment either actually
[21:08:44] <valhallasw`cloud>	 buuuut
[21:08:52] <bd808>	 I can't think of any yaml parser that would
[21:09:06] <valhallasw`cloud>	 we can do the beautify on display instead of on the source?
[21:10:20] <valhallasw`cloud>	 YuviPanda: https://github.com/wikimedia/mediawiki-extensions-OpenStackManager/compare/master...valhallasw:patch-1
[21:10:48] <valhallasw`cloud>	 then the comments only show up in the edit window, but that's OK I'd say, and you see a parsed tree in the display/preview
[21:11:00] <YuviPanda>	 valhallasw`cloud: we can just not beautify at all!
[21:11:16] <valhallasw`cloud>	 YuviPanda: it's not about beautify, it's about checkign the yaml means what you think it means
[21:11:35] <YuviPanda>	 valhallasw`cloud: that's practically not doable in wikitech alone, no?
[21:11:41] <YuviPanda>	 either way, I'm happy to merge your patch later today
[21:12:06] <valhallasw`cloud>	 not to check the meaning, but to check whether the syntax is correct
[21:12:12] <valhallasw`cloud>	 basically what I'd otherwise use https://yaml-online-parser.appspot.com/ for
[21:19:21] <valhallasw`cloud>	 YuviPanda: also it's untested, obviously :>
[21:44:04] <wikibugs>	 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481075 (10Krinkle) 3NEW
[21:48:16] <wikibugs>	 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481106 (10Krinkle) The `.name` column is also NULL for the same rows.
[21:55:28] <YuviPanda>	 Coren: would you mind if I converted the script to using exception style over return style?
[21:56:56] <YuviPanda>	 valhallasw`cloud: https://phabricator.wikimedia.org/T105721
[21:57:56] <valhallasw`cloud>	 YuviPanda: I'm not entirely sure what your definition of 'service' is there
[21:58:06] <YuviPanda>	 valhallasw`cloud: 'things labs team is responsible for'
[21:58:08] <YuviPanda>	 hmm
[21:58:18] <valhallasw`cloud>	 mmm
[21:58:19] <YuviPanda>	 'things labs team is responsible for that can have an easily checked test written for them' :P
[21:58:29] <valhallasw`cloud>	 'Tool labs' is, in some sense, a whole batch of services
[21:58:35] <YuviPanda>	 ideally they're overlapping 100%, but unfortunately we aren't in an ideal world
[21:58:44] <valhallasw`cloud>	 (login hosts, SGE execution, mail delivery, ...)
[21:58:47] <SigmaWP>	 Cyberpower678: Did you delete WikiHistory again
[21:58:48] <YuviPanda>	 valhallasw`cloud: yes, those already have p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/ErLK
[21:59:00] <SigmaWP>	 I just got back and bigbrother seems to be vomiting
[21:59:47] <YuviPanda>	 valhallasw`cloud: some of the things from there should be moved back too, maybe
[22:05:18] <valhallasw`cloud>	 YuviPanda: also, what's the goal of the list? communication what is/isn't labs' responsibility? alerting? reporting?
[22:05:34] <valhallasw`cloud>	 what's the goal of the reporting?
[22:06:00] <YuviPanda>	 valhallasw`cloud: metrics. 'quarterly goal' is 99.5% uptime for all labs services
[22:07:52] <valhallasw`cloud>	 right, so the goal is identifying the weak points
[22:08:36] <YuviPanda>	 valhallasw`cloud: yes, all the 'points' first and then the weak points
[22:08:53] <wikibugs>	 6Labs: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1481158 (10valhallasw)
[22:10:14] <valhallasw`cloud>	 What about things like LDAP? are those just part of 'instance availability'?
[22:11:41] <YuviPanda>	 valhallasw`cloud: should probably add LDAP there I tihnk
[22:12:33] <wikibugs>	 6Labs: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1481161 (10valhallasw)
[22:15:40] <grrrit-wm>	 (03Restored) 10BryanDavis: Add empty releases/id_rsa.upload [labs/private] - 10https://gerrit.wikimedia.org/r/225251 (owner: 10BryanDavis)
[22:16:00] <bd808>	 YuviPanda: can you merge https://gerrit.wikimedia.org/r/#/c/225251/ for me?
[22:16:14] <bd808>	 it's just junk to fix a prod/labs mismatch
[22:16:33] <valhallasw`cloud>	 YuviPanda: I guess routing issues are mostly outside of the scope, although it would be interesting to measure. Hard to do a yes/no uptime measurement for, though
[22:18:02] <YuviPanda>	 Yes 
[22:18:51] <YuviPanda>	 And outside the labs team anyway - those are core ops networking 
[22:18:53] <YuviPanda>	 And anything affecting labs will affect prod too
[22:27:05] <wikibugs>	 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481206 (10Krenair) Ew, it tries to parse InitialiseSettings: https://github.com/wikimedia/operations-software/blob/master/maintain-repl...
[23:11:27] <wikibugs>	 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481393 (10scfc) (Not sure if it is already possible or worthy of a RFE task, but if there was an "API" to `InitialiseSettings.php` inde...
[23:11:59] <wikibugs>	 6Labs, 10Tool-Labs, 10Labs-Infrastructure, 7Regression: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1481397 (10Krenair) a:3Krenair
[23:45:19] <Krenair>	 andrewbogott, Coren: around?