[00:09:45] <wikibugs_>	 10MediaWiki-extensions-OpenStackManager, 10Wikimedia-log-errors: PHP Fatal Error: Call to a member function doLogout() on a non-object - https://phabricator.wikimedia.org/T168750#3389273 (10demon) p:05Triage>03Normal The failure in `getUnscopedToken()` in obvious: ``` lang=php, name=nova/OpenStackNovaContr...
[00:32:01] <legoktm>	 Krinkle, twentyafterfour: all channel configs should auto-deploy within a few minutes
[01:02:02] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1434 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[01:29:05] <wikibugs_>	 10Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Restrict creating service groups to white-listed projects - https://phabricator.wikimedia.org/T158328#3389422 (10bd808)
[01:29:06] <wikibugs_>	 10Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs: The future of service groups and service users on Labs - https://phabricator.wikimedia.org/T162945#3389420 (10bd808)
[02:07:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1434 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:33:03] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1434 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[03:08:01] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1434 is OK: OK: Less than 1.00% above the threshold [0.0]
[03:35:07] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[04:00:07] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[04:23:38] <Esther>	 Hmmmmmmmmm. https://toolsadmin.wikimedia.org/ doesn't require 2FA.
[04:23:41] <Esther>	 I got in. \o/
[06:34:01] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1434 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[06:40:56] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1420 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[06:42:16] <shinken-wm>	 PROBLEM - Puppet errors on tools-puppetmaster-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[07:09:03] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1434 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:15:57] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1420 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:17:17] <shinken-wm>	 RECOVERY - Puppet errors on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:36:13] <wikibugs_>	 10Tool-Labs-tools-Other: MediaWiki instance on Tool Labs used as a spam relay - https://phabricator.wikimedia.org/T169040#3389639 (10Magnus) 05Open>03Resolved Spam deleted, database locked. The "actual" tool at https://tools.wmflabs.org/comprende/ still works in read-only mode.
[08:18:35] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[08:53:36] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:59:58] <wikibugs_>	 10Tool-Labs-tools-Other: CropTool also for .tif(f) image files? - https://phabricator.wikimedia.org/T169183#3390026 (10MarcoAurelio) CropTool issue tracker can be found at https://github.com/danmichaelo/croptool -- maybe @danmichaelo can port this report there?
[10:08:50] <wikibugs_>	 10Labs, 10DBA, 10User-Urbanecm: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193#3390099 (10Urbanecm)
[10:10:31] <wikibugs_>	 10Labs, 10DBA, 10User-Urbanecm: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193#3390114 (10Marostegui) As the database isn't created in production yet, please let us know when it is done, so we can sanitize it on sanitarium hosts and labs. And create the views.  ``` ro...
[12:21:50] <wikibugs_>	 (03Draft1) 10Paladox: Disable men-check for puppet-paladox3 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362182
[12:22:05] <wikibugs_>	 (03PS2) 10Paladox: Disable men-check for puppet-paladox3 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362182
[12:22:08] <wikibugs_>	 (03CR) 10Paladox: [V: 032 C: 032] Disable men-check for puppet-paladox3 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362182 (owner: 10Paladox)
[12:44:28] <wikibugs_>	 10Labs, 10Labs-Infrastructure: Restarting tools after NFS issues - https://phabricator.wikimedia.org/T169210#3390570 (10Magnus)
[14:08:17] <shinken-wm>	 PROBLEM - Puppet errors on tools-puppetmaster-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:18:17] <shinken-wm>	 RECOVERY - Puppet errors on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:19:35] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[14:54:35] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:02:11] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3365588 (10Halfak) Weird.  We have been getting restarts when there's a labs-wide kernel update.  What makes you think we aren't getting updates and what should we...
[15:09:07] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3391158 (10Paladox) @halfak they update the labs machine that hosts the vm. So the machines got updated but not the vms.
[15:20:12] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:22:27] <shinken-wm>	 PROBLEM - High iowait on tools-grid-master is CRITICAL: CRITICAL: tools.tools-grid-master.cpu.total.iowait (>28.57%)
[15:23:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1008 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[15:23:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1429 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0]
[15:24:25] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[15:24:43] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[15:25:01] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.005 second response time
[15:29:01] <bd808>	 shinken complaints about tools hosts and services are expected as the NFS server reboots happen
[15:29:33] <paladox>	 hi, it seems sudo is failing for me. When i do sudo it freezes.
[15:30:36] <bd808>	 paladox: when do run which command via sudo?
[15:30:42] <bd808>	 and on what host(s)?
[15:30:45] <paladox>	 sudo su
[15:30:47] <paladox>	 all hosts
[15:31:08] <bd808>	 paladox: specifically what project?
[15:31:11] <paladox>	 git
[15:31:24] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3391332 (10RobH)
[15:31:27] <chasemp>	 bd808: so http://wikitech.wikimedia.org/ seems to be having issues...cannot possbly be related to our stuff
[15:31:32] <chasemp>	 ..or so I would think
[15:31:34] <chasemp>	 can you confirm?
[15:31:43] <paladox>	 host is puppet-paladox3
[15:31:57] <bd808>	 chasemp: I'm starting my review meeting with Victoria right now
[15:32:04] <paladox>	 phabricator project also
[15:32:07] <chasemp>	 bd808: k
[15:33:56] <paladox>	 i got warnnings like:
[15:33:56] <paladox>	 PROBLEM - puppet on Puppet-Paladox is UNKNOWN: <Timeout exceeded.><Terminated by signal 9 (Killed).>
[15:34:06] <paladox>	 which got me to ssh and sudo su freezes.
[15:34:13] <paladox>	 happends with the ores project too
[15:36:32] <Betacommand>	 chasemp: eta for reboots to finish?
[15:38:36] <paladox>	 im guessing the wikitech problem is affecting sudo?
[15:41:24] <madhuvishy>	 there's some unrelated ldap issues ongoing
[15:44:43] <bd808>	 chasemp: wikitech works well with curl form silver. issue must be upstream somewhere
[15:44:49] <bd808>	 curl -v https://wikitech.wikimedia.org/wiki/Main_Page --resolve 'wikitech.wikimedia.org:443:127.0.0.1'
[15:45:53] <chasemp>	 bd808: ldap was down for awhile in eqiad and may still be killing keystone and killing wikitch I think
[15:47:25] <shinken-wm>	 RECOVERY - High iowait on tools-grid-master is OK: OK: All targets OK
[15:48:02] <paladox>	 wikitech is back
[15:48:06] <paladox>	 and sudo works now :)
[15:48:21] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1009 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:48:51] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [0.0]
[15:48:59] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:49:19] <shinken-wm>	 PROBLEM - Puppet errors on tools-package-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:49:25] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1005 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [0.0]
[15:49:45] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1006 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:49:49] <shinken-wm>	 PROBLEM - Puppet errors on tools-static-11 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:49:53] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1023 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:20] <shinken-wm>	 PROBLEM - Puppet errors on tools-services-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:38] <shinken-wm>	 PROBLEM - Puppet errors on tools-static-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:40] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:42] <shinken-wm>	 PROBLEM - Puppet errors on tools-paws-worker-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:42] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1425 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:48] <shinken-wm>	 PROBLEM - Puppet errors on tools-docker-registry-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:50:58] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:51:04] <shinken-wm>	 PROBLEM - Puppet errors on tools-paws-master-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:51:05] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:51:08] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:51:12] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1017 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:51:12] <shinken-wm>	 PROBLEM - Puppet errors on tools-k8s-master-01 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0]
[15:51:15] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1021 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [0.0]
[15:51:47] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:11] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1015 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:13] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1424 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[15:52:25] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1411 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:27] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1002 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[15:52:27] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1439 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:31] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1413 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]
[15:52:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:39] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[15:52:49] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0]
[15:54:20] <andrewbogott>	 We just suffered an ldap outage (not just a cloud thing, wmf-wide) which caused many many things to break.  For the most part things should be in the process of recovering and coming back online.  Best to give it 10-15 minutes before reporting new issues :)
[15:56:59] <addshore>	 oooh, apparently I was not in here
[15:57:59] <chasemp>	 !log tools reboot tools-docker-registery-01 for nfs 
[15:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:02:13] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1424 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:23] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:29] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:35] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:02:52] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:04:42] <madhuvishy>	 !log tools reboot tools-worker-1022 tools-worker-1009
[16:04:44] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:05:20] <shinken-wm>	 RECOVERY - Puppet errors on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:05:42] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1425 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:06:48] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:07:27] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1439 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:10:45] <chasemp>	 !log tools tools-flannel-etcd-01:~$ sudo service etcd restart
[16:10:47] <shinken-wm>	 RECOVERY - Puppet errors on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:22:42] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:27:20] <chasemp>	 !log tools restart k8s components on master (madhu)
[16:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:27:24] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:27:54] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3391527 (10Halfak) I'm surprised to find out that we aren't getting regular updates on these vms.  Why has that happened and how do we change it?
[16:29:01] <wikibugs_>	 10Labs, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Unable to add service group to service groups - https://phabricator.wikimedia.org/T128400#3391534 (10bd808)
[16:29:05] <wikibugs_>	 10Striker, 10Epic, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#3391533 (10bd808)
[16:32:34] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3391538 (10Paladox) The only way to change that is to run a cron script that does apt-get update and then apt-get upgrade -y
[16:32:51] <halfak>	 o/ 
[16:33:15] <halfak>	 I've just learned that labs/cloud VMs don't install any OS updates by default. 
[16:33:18] <halfak>	 Is that right?  
[16:33:38] <chasemp>	 halfak: outage in progress so we are all caught up
[16:33:46] <chasemp>	 all of opsen pretty much even
[16:33:51] <halfak>	 gotcha. 
[16:33:55] * halfak holds question
[16:33:57] <halfak>	 godspeed. 
[16:35:06] <paladox>	 halfak you can do a cronscript that does it :)
[16:35:28] <halfak>	 paladox, roger that.  Was just hoping that maybe cloud folks had some preferred method.  
[16:35:34] * halfak likes standards and recommendations :) 
[16:35:47] <paladox>	 ok :)
[16:38:22] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:38:50] <bd808>	 halfak: short answer is that it is up you as the owner of the VM. we push major security patches but not other things generally.
[16:39:18] <halfak>	 bd808, gotcha.  So there's no recommended pattern.  What do you think of the cron job strategy?
[16:39:19] <bd808>	 maybe worth a phab task to discuss what 'best practices' might be
[16:39:27] <halfak>	 cool
[16:39:28] <halfak>	 will make
[16:40:03] <halfak>	 bd808, what are you calling these VMs these days?  <something> vps <something>?
[16:40:16] <halfak>	 I like "cloud nodules"
[16:40:18] <halfak>	 ;0 
[16:40:23] <halfak>	 *;) 
[16:40:34] <bd808>	 the new name for the whole pile is 'Wikimedia VPS'
[16:40:41] <halfak>	 Cool will use that in the task
[16:42:34] <paravoid>	 wasn't supposed to be Wikimedia Cloud VPS after all?
[16:42:39] <paravoid>	 due to trademark reasons or something?
[16:46:29] <wikibugs_>	 10cloud-services-team, 10Scoring-platform-team: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391616 (10Halfak)
[16:46:55] <wikibugs_>	 10cloud-services-team, 10Scoring-platform-team: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391605 (10Halfak)
[16:47:10] <paravoid>	 there used to be unattended-upgrades installed in labs VMs in the past, not sure what happened with that
[16:47:32] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3365588 (10Halfak) I've talked to @bd808 in #wikimedia-cloud and he doesn't think there's a best practice for this yet so I created {T169247} and set that as a blo...
[16:48:19] <wikibugs_>	 10Labs, 10cloud-services-team, 10Scoring-platform-team: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391628 (10Paladox)
[16:48:36] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10ORES, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3391629 (10Halfak)
[16:48:56] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10cloud-services-team, 10Scoring-platform-team, 10Documentation: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391630 (10Halfak)
[16:49:00] <paravoid>	 in any case, it's not as simple as just running a cronjob or a puppet role
[16:49:10] <paravoid>	 in many cases you have to restart services or even reboot the box
[16:49:12] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10cloud-services-team, 10Scoring-platform-team, 10Documentation: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391605 (10Halfak) I'd like to see a puppet class that is enabled by default that sets up a...
[16:50:32] <paravoid>	 having something that installs packages automatically isn't a bad idea (at least for non-production VMs, as it can cause issues too), but don't assume this makes you secure
[16:52:00] <paladox>	 i always install updates :). I have a check that notifys me wwhen there are updates. (which it does everyday) it can even install them without me needing to do that but i need to manually verify that the update wont break anything.
[16:52:25] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10cloud-services-team, 10Scoring-platform-team, 10Documentation: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391637 (10Halfak) Looks like the cron strategy is recommended practice. https://help.ubuntu...
[16:52:52] <paravoid>	 halfak: ^^
[16:53:54] <halfak>	 paravoid, +1 thanks.  For sure not secure, but *better* for sure. 
[16:54:19] <halfak>	 In when it comes to labs and not having direct ops support for the VMs, I need cron-ops to do its best ;) 
[16:55:03] <andrewbogott>	 we do use unattended upgrades on labs VMs.  Only for things that don't require reboot though.
[16:55:15] <halfak>	 andrewbogott, oh!  good to know. 
[16:55:32] <paladox>	 "unattended upgrades" when do those run?
[16:55:52] <paladox>	 I am always getting updates and having to run the updates (includes stuff that does not need restarting)
[16:55:52] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10cloud-services-team, 10Scoring-platform-team, 10Documentation: Document recommended process for installing OS upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247#3391605 (10faidon) Labs used to have unattended-upgrades install fleet-wide, not sure what h...
[16:56:46] <paravoid>	 yeah, unsurprisingly cron-ops on a virtualized shared infrastructure doesn't offer much security :P
[17:00:21] <paravoid>	 ftr, as Bryan knows, I'd really prefer if WMCS called VPS "Wikimedia Cloud VPS" which stays in the cloud namespace
[17:00:53] <paravoid>	 I left a comment in the consultation (and suggested "Wikimedia Cloud Machines" as an alternative), but it was disregarded
[17:01:14] <halfak>	 I'm happy with "cloud" names. :)  
[17:01:23] <halfak>	 WMCVPS
[17:01:26] <paravoid>	 yeah, it's much more consistent/unambiguous IMHO
[17:01:28] <halfak>	 OMGWTFBBQ
[17:03:16] <paravoid>	 the page renames happened even before the consultation ended, it didn't feel like it was much up for discussion :(
[17:12:57] <madhuvishy>	 !log drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
[17:12:57] <stashbot>	 madhuvishy: Unknown project "drain"
[17:13:03] <madhuvishy>	 !log tools drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
[17:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:16:11] <shinken-wm>	 RECOVERY - Puppet errors on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:18:39] <shinken-wm>	 PROBLEM - SSH on tools-worker-1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:20:10] <madhuvishy>	 !log tools drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
[17:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:20:15] <andrewbogott>	 !log tools rebooting tools-static-10
[17:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:21:08] <shinken-wm>	 PROBLEM - Host tools-static-10 is DOWN: CRITICAL - Host Unreachable (10.68.22.238)
[17:22:08] <andrewbogott>	 !log rebooting tools-static-11
[17:22:09] <stashbot>	 andrewbogott: Unknown project "rebooting"
[17:22:39] <bd808>	 !log tools rebooting tools-static-11
[17:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:23:32] <shinken-wm>	 RECOVERY - SSH on tools-worker-1003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[17:23:56] <shinken-wm>	 RECOVERY - Host tools-static-10 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms
[17:28:27] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1439 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:28:59] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:29:29] <shinken-wm>	 PROBLEM - SSH on tools-worker-1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:31:03] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3391861 (10Papaul)
[17:31:03] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:31:35] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3391862 (10Papaul)
[17:33:33] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3391868 (10Papaul)
[17:34:16] <shinken-wm>	 RECOVERY - SSH on tools-worker-1007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[17:34:48] <shinken-wm>	 RECOVERY - Puppet errors on tools-static-11 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:35:20] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3391869 (10Papaul)
[17:35:38] <shinken-wm>	 RECOVERY - Puppet errors on tools-static-10 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:37:07] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) Port information   eth0   ge-8/0/0 eth1   ge-8/0/3
[17:37:17] <madhuvishy>	 !log tools drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
[17:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:39:19] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3391879 (10Papaul) Port information   ge-1/0/13
[17:39:24] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:40:40] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:40:54] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3391882 (10Papaul) port information   ge-1/0/17
[17:42:09] <wikibugs_>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3391883 (10Papaul) port information   ge-1/0/13
[17:48:38] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1008 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:48:58] <madhuvishy>	 andrewbogott: sync up now?
[17:49:07] <andrewbogott>	 aaaalmost
[17:50:49] <TabbyCat>	 bd808: that wikitech troll... it may be wise to CU the accounts and set rangeblocks in place, just sayin'
[17:52:40] <bd808>	 TabbyCat: probably not a horrible idea. I don't have CU (and don't really want it)
[17:53:16] <TabbyCat>	 bd808: I can understand, I don't have CU there either (I do on other wmf wikis).
[17:53:40] <TabbyCat>	 blacklisting with antispoof the names he targets would be a good idea, just pinged you over there
[17:54:47] <bd808>	 they are very clueful about how on-wiki procedures work. willing to age accounts and using many tricks to make the harassing names
[17:55:22] <bd808>	 its perpetual whack-a-mole. Nothing that is going to automatically keep them out
[17:55:53] <TabbyCat>	 well, a cell with a lock and no internet access could do it
[17:56:04] <harej>	 (is this a specific person known to us?)
[17:56:33] <TabbyCat>	 banned dewiki troll most probably
[17:56:54] <TabbyCat>	 since he targets mainly dewiki users and one in particular
[17:57:10] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:00:36] <madhuvishy>	 !log tools drain cordon reboot uncordon tools-worker-1015
[18:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[18:01:51] <Freddy2001>	 I have a problem with webproxies. Anyone here, who can help me setting up a wiki?
[18:03:28] <bd808>	 Freddy2001: on a Labs instance or somewhere else?
[18:03:29] <madhuvishy>	 Freddy2001: hi! we have been dealing with some issues all morning, and people are in meetings now/taking a break. Make a ticket if you don't hear back in a bit :)
[18:05:20] <Freddy2001>	 yes on labs. it is not a issue related to the reboots.
[18:06:45] <bd808>	 Freddy2001: ok. what sort of problem are you having?
[18:07:56] <shinken-wm>	 PROBLEM - SSH on tools-worker-1017 is CRITICAL: Connection refused
[18:07:59] <bd808>	 one thing that often trips people up is not having the right ports open in the security groups applied to the instance
[18:08:01] <Freddy2001>	 how can i set up a wiki on a labs instance, which just works with one webproxy-subdomain? I would use apache vhosts, but now all domains point to /var/www/
[18:08:29] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1439 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:08:57] <bd808>	 Freddy2001: one way to do it is with https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs
[18:09:35] <Freddy2001>	 i do not want vagrant, because i need to run another webservice with apache on port 80
[18:09:43] <SMalyshev>	 are nfs reboots for labs done?
[18:10:19] <bd808>	 Freddy2001: mw-vagrant doesn't take over port 80, it uses 8080 so that still may work for you
[18:11:06] <bd808>	 SMalyshev: yes. we are still keeping an eye on a few lingering issues, but the main reboots are done for this morning
[18:11:09] <Freddy2001>	 i just want webproxy 1 pointing to /var/www/html und webproxy 2 pointing to /var/www/wiki
[18:11:36] <Freddy2001>	 both running on apache
[18:11:40] <SMalyshev>	 bd808: ok, I was going to run a long reload, so don't want NFS to go away from under it :) I guess it's ok now to start it
[18:11:43] <bd808>	 Freddy2001: ok. so you will need 2 apache vhosts locally
[18:11:49] <Freddy2001>	 yes
[18:12:29] <bd808>	 you could either run them on different ports or use the hostname passed in via the proxy to pick the right one
[18:12:30] <Freddy2001>	 but what exactly do i need in the config? the usual configuation of vhosts does not work for me here…
[18:12:44] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1422 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[18:12:56] <shinken-wm>	 RECOVERY - SSH on tools-worker-1017 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[18:14:28] <bd808>	 Freddy2001: the easiest thing to do would be to add a "ServerName name_of_proxy.wmflabs.org" to each vhost I think
[18:14:41] <bd808>	 what is your usual config?
[18:14:56] <Freddy2001>	 yes i have tried this already but it does not work
[18:15:17] <bd808>	 https://httpd.apache.org/docs/2.4/vhosts/name-based.html
[18:15:30] <Freddy2001>	 https://httpd.apache.org/docs/2.4/vhosts/examples.html ← thats what i tried
[18:15:30] <bd808>	 Freddy2001: can you share the config that isn't working for you?
[18:16:11] <valhallasw`cloud>	 Freddy2001: what host are we talking about?
[18:18:24] <Freddy2001>	 https://pastebin.com/0PpRgH4M
[18:19:47] <bd808>	 Freddy2001: try using "<VirtualHost *:80>" for both blocks
[18:20:29] <bd808>	 that's the binding on the local vm. those hostnames are going to resolve to ips that are not on your machine at all
[18:21:13] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:21:43] <bd808>	 that matches the https://httpd.apache.org/docs/2.4/vhosts/examples.html#purename method
[18:22:13] <bd808>	 apache will look at the Host header in the request and use it to chose the correct vhost
[18:22:47] <bd808>	 it does that by matching the Host header against the ServerName directive(s) in each vhost
[18:27:41] <Freddy2001>	 no, i still face the same problem
[18:28:25] <bd808>	 Freddy2001: what is the outcome? All requests end up going to the same vhost or something else?
[18:28:39] <Freddy2001>	 https://pastebin.com/7na4yCvL ← thats my config now
[18:29:07] <Freddy2001>	 but all requests point to /var/www without any subfolders
[18:29:34] <bd808>	 so neither vhost there is working
[18:29:53] <bd808>	 I think the ":80" n the ServerName may be causing problems
[18:30:14] <Freddy2001>	 without i get exactly the same
[18:30:46] <bd808>	 Freddy2001: can I log into the VM and look around?
[18:31:14] <Freddy2001>	 whats your shellname?
[18:31:33] <bd808>	 bd808. I'm labsroot so I just need to know where to look
[18:32:20] <Freddy2001>	 hence i do not need to grant you anything on horizon? 
[18:32:41] <bd808>	 nope. just tell me the VMs name
[18:32:47] <Freddy2001>	 have a look at the apache config of the instance vrt, please
[18:33:03] <bd808>	 cool. let me take a look
[18:34:09] <madhuvishy>	 !log tools drain cordon reboot uncordon tools-worker-1018 tools-worker-1023
[18:34:24] <madhuvishy>	 !log tools reboot tools-cron-01
[18:34:49] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1020 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:37:04] <bd808>	 Freddy2001: I may have fixed it. Take a look
[18:37:35] <bd808>	 All I did was remove the ":80" parts and then restart apache to read the changed config
[18:37:52] <Freddy2001>	 great! thank you very much
[18:37:57] <bd808>	 yw
[18:38:07] <madhuvishy>	 !log tools reboot tools-exec-1414
[18:38:15] <Freddy2001>	 that is exactly what i want to have
[18:38:30] <bd808>	 :) now you can get to the real work
[18:38:32] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:38:48] <Freddy2001>	 i tried removing the :80, but maybe i did sth wrong?
[18:39:01] <madhuvishy>	 !log tools reboot tools-exec-1413
[18:39:32] <bd808>	 Freddy2001: did you remember to run `service apache2 restart` after changing the config? That's the only thing I can think of
[18:40:00] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1012 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[18:40:32] <madhuvishy>	 !log tools reboot tools-exec-1419
[18:40:34] <Freddy2001>	 yes i did. but did you edited sites-available or sites-enabled?
[18:41:13] <valhallasw`cloud>	 the entries in sites-enabled should be symlinks to files in sites-available
[18:41:15] <shinken-wm>	 PROBLEM - High iowait on tools-webgrid-lighttpd-1419 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1419.cpu.total.iowait (>33.33%)
[18:41:39] <Freddy2001>	 w
[18:41:49] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1025 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[18:41:58] <bd808>	 I think I was in sites-enabled, but I already closed the window so I don't have scrollback to check
[18:42:11] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1011 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:42:16] <Freddy2001>	 here can check if they are symlinks?
[18:42:16] <bd808>	 but ideally they are symlinks as valhallasw`cloud says
[18:42:37] <bd808>	 ls -l should show you
[18:43:16] <shinken-wm>	 PROBLEM - High iowait on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1403.cpu.total.iowait (>33.33%)
[18:43:23] <Freddy2001>	 yes it shows me a symlink
[18:45:42] <shinken-wm>	 RECOVERY - Puppet errors on tools-paws-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:46:50] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1026 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[18:47:04] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[18:48:27] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[18:49:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1008 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:49:51] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:49:54] <shinken-wm>	 PROBLEM - Puppet errors on tools-docker-registry-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:50:26] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[18:50:42] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1027 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[18:51:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-paws-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:52:44] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1422 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:53:16] <shinken-wm>	 RECOVERY - High iowait on tools-webgrid-lighttpd-1403 is OK: OK: All targets OK
[18:55:26] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[18:57:13] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1021 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[18:57:51] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[19:01:57] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1016 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[19:02:11] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:02:11] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1017 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[19:03:13] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1015 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:03:18] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[19:04:16] <shinken-wm>	 RECOVERY - Puppet errors on tools-package-builder-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:04:18] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1009 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:04:49] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1441 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[19:05:24] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1019 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[19:05:31] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[19:06:26] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1414 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[19:06:28] <shinken-wm>	 PROBLEM - Puppet errors on tools-checker-02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0]
[19:07:01] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.019 second response time
[19:08:35] <shinken-wm>	 PROBLEM - Puppet errors on tools-checker-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[19:08:45] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1422 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[19:10:32] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[19:12:02] <shinken-wm>	 PROBLEM - Puppet errors on tools-paws-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[19:15:55] <shinken-wm>	 PROBLEM - Host tools-worker-1005 is DOWN: CRITICAL - Host Unreachable (10.68.23.47)
[19:16:16] <shinken-wm>	 RECOVERY - High iowait on tools-webgrid-lighttpd-1419 is OK: OK: All targets OK
[19:17:14] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:18:08] <shinken-wm>	 PROBLEM - Host tools-worker-1004 is DOWN: CRITICAL - Host Unreachable (10.68.22.78)
[19:18:30] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:19:59] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:20:31] <shinken-wm>	 RECOVERY - Host tools-worker-1005 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms
[19:20:43] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1006 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[19:21:51] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1026 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:23:25] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:25:03] <shinken-wm>	 PROBLEM - Host tools-worker-1007 is DOWN: CRITICAL - Host Unreachable (10.68.23.53)
[19:25:23] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:25:23] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:25:41] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1027 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:26:13] <shinken-wm>	 PROBLEM - Host tools-worker-1010 is DOWN: CRITICAL - Host Unreachable (10.68.20.94)
[19:27:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-paws-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:27:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:28:18] <shinken-wm>	 PROBLEM - Host tools-worker-1015 is DOWN: CRITICAL - Host Unreachable (10.68.18.101)
[19:28:48] <shinken-wm>	 PROBLEM - Host tools-worker-1014 is DOWN: CRITICAL - Host Unreachable (10.68.21.26)
[19:29:18] <andrewbogott>	 I'm kicking shinken again
[19:32:14] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:32:18] <shinken-wm>	 PROBLEM - Host tools-worker-1017 is DOWN: CRITICAL - Host Unreachable (10.68.22.118)
[19:32:50] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:33:02] <shinken-wm>	 RECOVERY - Host tools-worker-1015 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[19:33:24] <bd808>	 !log tools.zoomviewer Stopping webservice. We are having NFS load problems and this tool seems to be contributing dramatically on tools-webgrid-lighttpd-1422
[19:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zoomviewer/SAL
[19:33:50] <shinken-wm>	 PROBLEM - Host tools-worker-1020 is DOWN: CRITICAL - Host Unreachable (10.68.17.223)
[19:34:20] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:35:27] <Matthew_>	 Tool Labs is appearing to be very fliky right now - tools are intermittently timing out.  Known iue?
[19:35:29] <Matthew_>	 *issue
[19:35:40] <Reedy>	 yes
[19:35:47] <Matthew_>	 Okay, thank you :)
[19:35:48] <Reedy>	 NFS issues
[19:35:52] <Matthew_>	 Ah.
[19:36:27] <shinken-wm>	 RECOVERY - Host tools-worker-1004 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms
[19:37:19] <shinken-wm>	 PROBLEM - Host tools-worker-1025 is DOWN: CRITICAL - Host Unreachable (10.68.22.147)
[19:37:26] <bd808>	 TL;DR is that we rebooted to get a kernel security update and may be having NFS server problems as a result using the new kernel
[19:37:59] <bd808>	 We are slowly shutting things down to try and get the load to level off on the NFS server side
[19:38:12] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:38:14] <Matthew_>	 Okay.  
[19:38:26] <bd808>	 its a not fun day :/
[19:38:39] <Matthew_>	 Understandable.  Thank you for looking into it though.
[19:39:48] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1441 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:40:02] <shinken-wm>	 RECOVERY - Host tools-worker-1025 is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms
[19:40:22] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:40:30] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:40:44] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[19:41:26] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1414 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:41:26] <shinken-wm>	 RECOVERY - Puppet errors on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:43:37] <shinken-wm>	 RECOVERY - Puppet errors on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:45:03] <shinken-wm>	 RECOVERY - Host tools-worker-1007 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms
[19:45:31] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:46:51] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:48:10] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:49:58] <shinken-wm>	 RECOVERY - Host tools-worker-1014 is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms
[19:50:40] <shinken-wm>	 RECOVERY - Host tools-worker-1016 is UP: PING OK - Packet loss = 0%, RTA = 4.59 ms
[19:50:43] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:50:57] <shinken-wm>	 RECOVERY - Host tools-worker-1018 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms
[19:51:01] <shinken-wm>	 RECOVERY - Host tools-worker-1017 is UP: PING OK - Packet loss = 0%, RTA = 3.20 ms
[19:51:23] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:55:42] <madhuvishy>	 !log tools Killed liangent-php jobs and usrd-tools jobs
[19:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[19:56:58] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:57:12] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:03:18] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:09:59] <shinken-wm>	 RECOVERY - Host tools-worker-1020 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms
[20:11:09] <wikibugs_>	 10Labs, 10Tool-Labs: 502s and kubernetes-based tool labs service not restarting - https://phabricator.wikimedia.org/T169263#3392379 (10mahmoud)
[20:13:46] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1422 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:16:16] <wikibugs_>	 10Labs, 10Tool-Labs: 502s and kubernetes-based tool labs service not restarting - https://phabricator.wikimedia.org/T169263#3392459 (10Matthewrbowker) I can tell you the IRC issue: there's been a rebranding.  New channel is #wikimedia-cloud.  @bd808 informed me they were looking into it.  Don't know if there's...
[20:19:19] <wikibugs_>	 10Labs, 10Tool-Labs: 502s and kubernetes-based tool labs service not restarting - https://phabricator.wikimedia.org/T169263#3392464 (10bd808) The Kubernetes cluster is not allowing new job submissions globally right now. We are in the midst of a NFS related event following today's planned server reboot for ker...
[20:19:49] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:21:38] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[20:22:36] <wikibugs_>	 (03Draft1) 10Paladox: Testing mutante php 7 apache change [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362281
[20:22:38] <wikibugs_>	 (03PS2) 10Paladox: Testing mutante php 7 apache change [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362281
[20:24:07] <wikibugs_>	 (03CR) 10Paladox: [V: 032 C: 032] Testing mutante php 7 apache change [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362281 (owner: 10Paladox)
[20:25:59] <wikibugs_>	 (03PS1) 10Paladox: Revert "Testing mutante php 7 apache change" [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362291
[20:26:03] <wikibugs_>	 (03CR) 10Paladox: [V: 032 C: 032] Revert "Testing mutante php 7 apache change" [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362291 (owner: 10Paladox)
[20:26:55] <wikibugs_>	 (03Draft1) 10Paladox: Retrying mutante php 7.0 change part 2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362292
[20:26:57] <wikibugs_>	 (03PS2) 10Paladox: Retrying mutante php 7.0 change part 2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362292
[20:26:59] <wikibugs_>	 (03CR) 10Paladox: [V: 032 C: 032] Retrying mutante php 7.0 change part 2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362292 (owner: 10Paladox)
[20:27:21] <mhashemirc>	 heyyy rebranded irc how we doin, kubernetes seems sad, but i'm sure you're all on top of it :)
[20:27:49] <mhashemirc>	 will i need to manually restart my service once the NFS/kubernetes issue passes?
[20:28:09] <wikibugs_>	 10Labs, 10Tool-Labs: Homedir for user whym is very large (>60G) - https://phabricator.wikimedia.org/T169265#3392568 (10bd808)
[20:28:43] <wikibugs_>	 (03Draft1) 10Paladox: Fix class [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362293
[20:28:47] <wikibugs_>	 (03PS2) 10Paladox: Fix class [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362293
[20:29:12] <wikibugs_>	 (03CR) 10Paladox: [V: 032 C: 032] Fix class [labs/icinga2] - 10https://gerrit.wikimedia.org/r/362293 (owner: 10Paladox)
[20:31:26] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:33:21] <andrewbogott>	 !log tools depooling, rebooting, and repooling every lighttpd node three at a time
[20:41:37] <bd808>	 mhashemirc: If its running via `webservice` I think it will start up automatically once we allow new Kubernetes pods to be created.
[20:42:04] <bd808>	 We will send a follow up to https://lists.wikimedia.org/pipermail/labs-announce/2017-June/000243.html when things are in better shape, so you might want to check it then
[20:44:07] <wikibugs_>	 10Labs, 10Tool-Labs: 502s and kubernetes-based tool labs service not restarting - https://phabricator.wikimedia.org/T169263#3392709 (10bd808) We finally got an announcement of the ongoing issue up on the mailing lists: https://lists.wikimedia.org/pipermail/labs-announce/2017-June/000243.html  WE will follow up...
[20:50:10] <madhuvishy>	 !log tools deppoling, rebooting and repooling all grid exec nodes
[20:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:59:50] <gwicke>	 hello cloud people; I would like to verify that HHVM request timeouts have indeed been fixed, and am considering using deployment-prep instances for this
[21:00:36] <gwicke>	 any hints on  the magic incantation to curl a PHP script temporarily dropped into the local MW checkout?
[21:02:59] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1016 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[21:03:31] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:03:37] <gwicke>	 nm, I think I have it figured out
[21:03:53] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1419 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[21:04:40] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1429 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:10:49] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1441 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:15:10] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:16:26] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[21:16:40] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1027 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[21:18:11] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:18:59] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1419 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:19:25] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[21:19:59] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.008 second response time
[21:33:22] <Niharika>	 madhuvishy: Scripted meeting?
[21:33:57] <madhuvishy>	 Niharika: ugh sorry dealing with outage all day - can you tell Angel I can't make it
[21:34:08] <Niharika>	 No worries. 
[21:38:00] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:53:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-paws-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:54:24] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:56:25] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:56:41] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1027 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:03:36] <legoktm>	 bad wikibugs
[22:05:41] <bd808>	 legoktm: the job grid and k8s clusters are still a bit sad
[22:06:03] <legoktm>	 there was a ghost process running so it was scraping phab twice
[22:06:24] <legoktm>	 and I think the gerrit part is dead entirely
[22:06:34] <legoktm>	 bd808: is all the reboots done for now?
[22:07:42] <bd808>	 legoktm: unsure. teh load average on the NFS server is bouncing around wildly and we are still trying to figure out if it is safe to walk away for the night
[22:08:11] <legoktm>	 ok :/
[22:08:29] <bd808>	 like spikes up to 100+ and the historic load was always less than 3
[22:09:30] <bd808>	 in the last hour its been a high as 107 and as low as 3 :/
[22:12:06] <wikibugs>	 (03CR) 10Legoktm: "test" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362003 (owner: 10Gilles)
[22:13:55] <legoktm>	 bd808: in hopefully good news, I updated one of my tools to not clone all MW extensions + skins + libraries to NFS
[22:14:07] <bd808>	 \o/
[22:14:16] <bd808>	 less NFS use is always better
[22:32:22] <mutante>	 i am currently seeing a "Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)" on an instance (VPS) that worked fine yesterday, but it seems it's just this one and there is other stuff going on, so i'll just delete it and use a new one
[23:00:32] <andrewbogott>	 mutante: we can't think of a reason why that would relate to the current problems… did you build a new one and did it come up ok?
[23:00:58] <andrewbogott>	 and if not ,is the old one still there?
[23:01:14] <madhuvishy>	 !log tools Uncordoned all k8s-workers
[23:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[23:10:46] <mutante>	 andrewbogott: i deleted, made a new one, it doesn't have the problem, it's ok
[23:10:53] <andrewbogott>	 mutante: ok then :)
[23:11:04] <mutante>	 andrewbogott: yes, i also don't know why, but ok :)
[23:26:12] <wikibugs>	 10Labs: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket" - https://phabricator.wikimedia.org/T169281#3393352 (10madhuvishy)
[23:40:51] <wikibugs>	 10Labs, 10Tool-Labs: Homedir for user cosmiclattes is very large (>60G) - https://phabricator.wikimedia.org/T169283#3393452 (10bd808)