[00:07:30] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651046 (10leila) Sounds good, @DarTar. Closing it. [00:07:40] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651047 (10leila) 05Open>03Resolved [01:06:02] Hm.. cvn instances have been puppet-failing for the last 20 days. https://tools.wmflabs.org/nagf/?project=cvn&nagf-range=month#h_overview_puppetagent [01:08:39] why are they failing? [01:11:42] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651177 (10Legoktm) 05Resolved>03declined [02:20:38] PROBLEM - Puppet staleness on tools-exec-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [07:17:01] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2651434 (10Marostegui) This happened again and I have updated the bug report with Percona and MariaDB with the following information - gdb stacktraces - show engi... [07:18:04] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2651436 (10Marostegui) [07:45:52] PROBLEM - Puppet staleness on tools-k8s-master-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:17:26] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:45:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [11:20:52] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:28] 06Labs, 10Continuous-Integration-Infrastructure, 06Operations, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2651917 (10hashar) I have refreshed the package on https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf5/ .... [12:57:50] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2651977 (10Marostegui) MariaDB just answered: // Thanks. Lets wait and see what Percona guys come up with, TokuDB is their area of expertise. If they claim it's... [13:28:26] PROBLEM - SSH on tools-checker-02 is CRITICAL: Server answer [13:33:28] RECOVERY - SSH on tools-checker-02 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [13:33:49] PROBLEM - High iowait on tools-webgrid-lighttpd-1205 is CRITICAL [13:33:50] PROBLEM - High iowait on tools-exec-1215 is CRITICAL [13:33:51] PROBLEM - High iowait on tools-webgrid-lighttpd-1406 is CRITICAL [13:33:52] PROBLEM - High iowait on tools-exec-1217 is CRITICAL [13:33:52] PROBLEM - High iowait on tools-webgrid-lighttpd-1204 is CRITICAL [13:33:54] PROBLEM - High iowait on tools-flannel-etcd-01 is CRITICAL [13:33:54] PROBLEM - High iowait on tools-k8s-etcd-03 is CRITICAL [13:33:56] PROBLEM - High iowait on tools-worker-1019 is CRITICAL [13:33:58] PROBLEM - High iowait on tools-exec-1206 is CRITICAL [13:34:02] PROBLEM - High iowait on tools-precise-dev is CRITICAL [13:34:06] PROBLEM - High iowait on tools-webgrid-lighttpd-1207 is CRITICAL [13:34:08] PROBLEM - High iowait on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:10] PROBLEM - High iowait on tools-webgrid-lighttpd-1208 is CRITICAL [13:34:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL [13:34:16] PROBLEM - High iowait on tools-worker-1006 is CRITICAL [13:34:16] PROBLEM - High iowait on tools-proxy-01 is CRITICAL [13:34:17] PROBLEM - High iowait on tools-webgrid-lighttpd-1401 is CRITICAL [13:34:18] PROBLEM - High iowait on tools-exec-1408 is CRITICAL [13:34:26] PROBLEM - High iowait on tools-webgrid-lighttpd-1402 is CRITICAL [13:34:26] PROBLEM - High iowait on tools-webgrid-lighttpd-1413 is CRITICAL [13:34:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:32] PROBLEM - High iowait on tools-webgrid-lighttpd-1206 is CRITICAL [13:34:34] PROBLEM - High iowait on tools-webgrid-generic-1403 is CRITICAL [13:34:36] PROBLEM - High iowait on tools-worker-1020 is CRITICAL [13:34:38] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:44] PROBLEM - High iowait on tools-webgrid-lighttpd-1404 is CRITICAL [13:34:46] PROBLEM - High iowait on tools-webgrid-lighttpd-1414 is CRITICAL [13:34:48] PROBLEM - High iowait on tools-flannel-etcd-02 is CRITICAL [13:34:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL [13:34:49] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL [13:34:49] PROBLEM - Puppet staleness on tools-logs-02 is CRITICAL [13:34:50] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1401 is CRITICAL [13:34:51] PROBLEM - High iowait on tools-webgrid-generic-1401 is CRITICAL [13:34:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL [13:34:52] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1413 is CRITICAL [13:34:53] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL [13:34:54] PROBLEM - Puppet staleness on tools-elastic-02 is CRITICAL [13:34:54] PROBLEM - Free space - all mounts on tools-worker-1017 is CRITICAL [13:34:55] PROBLEM - Free space - all mounts on tools-checker-01 is CRITICAL [13:34:56] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1408 is CRITICAL [13:34:57] PROBLEM - Free space - all mounts on tools-k8s-master-02 is CRITICAL [13:34:57] PROBLEM - High iowait on tools-exec-1406 is CRITICAL [13:34:58] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1203 is CRITICAL [13:34:59] PROBLEM - Puppet staleness on tools-webgrid-generic-1402 is CRITICAL [13:35:00] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1408 is CRITICAL [13:35:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL [13:35:01] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL [13:35:03] PROBLEM - Puppet staleness on tools-exec-1408 is CRITICAL [13:35:04] PROBLEM - Puppet staleness on tools-worker-1004 is CRITICAL [13:35:04] PROBLEM - Free space - all mounts on tools-worker-1020 is CRITICAL [13:35:05] PROBLEM - Free space - all mounts on tools-worker-1009 is CRITICAL [13:35:06] PROBLEM - High iowait on tools-exec-1202 is CRITICAL [13:35:07] PROBLEM - Puppet run on tools-checker-01 is CRITICAL [13:35:07] PROBLEM - High iowait on tools-worker-1008 is CRITICAL [13:35:08] PROBLEM - Puppet staleness on tools-webgrid-generic-1403 is CRITICAL [13:35:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL [13:35:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL [13:35:10] PROBLEM - Puppet staleness on tools-redis-1001 is CRITICAL [13:36:31] PROBLEM - High iowait on tools-worker-1005 is CRITICAL [13:36:31] PROBLEM - High iowait on tools-exec-1210 is CRITICAL [13:36:33] PROBLEM - High iowait on tools-exec-1219 is CRITICAL [13:36:37] PROBLEM - High iowait on tools-redis-1001 is CRITICAL [13:36:39] PROBLEM - Puppet run on tools-exec-cyberbot is CRITICAL [13:36:39] PROBLEM - High iowait on tools-exec-1407 is CRITICAL [13:36:41] PROBLEM - High iowait on tools-webgrid-lighttpd-1409 is CRITICAL [13:36:43] PROBLEM - High iowait on tools-docker-registry-01 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-worker-1014 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-exec-1214 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-bastion-03 is CRITICAL [13:36:45] PROBLEM - High iowait on tools-k8s-master-02 is CRITICAL [13:36:45] PROBLEM - High iowait on tools-bastion-02 is CRITICAL [13:36:48] PROBLEM - High iowait on tools-services-01 is CRITICAL [13:36:48] PROBLEM - High iowait on tools-webgrid-lighttpd-1405 is CRITICAL [13:36:52] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL [13:36:52] PROBLEM - Puppet run on tools-services-02 is CRITICAL [13:36:52] PROBLEM - Puppet staleness on tools-puppetmaster-02 is CRITICAL [13:36:53] PROBLEM - High iowait on tools-worker-1018 is CRITICAL [13:36:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL [13:36:53] PROBLEM - Puppet run on tools-elastic-03 is CRITICAL [13:36:54] PROBLEM - High iowait on tools-exec-1410 is CRITICAL [13:36:55] PROBLEM - Free space - all mounts on tools-exec-1202 is CRITICAL [13:36:56] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL [13:36:57] PROBLEM - Free space - all mounts on tools-k8s-etcd-02 is CRITICAL [13:36:57] PROBLEM - Puppet staleness on tools-exec-1405 is CRITICAL [13:36:59] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL [13:37:00] PROBLEM - Puppet staleness on tools-services-01 is CRITICAL [13:37:00] PROBLEM - High iowait on tools-exec-1211 is CRITICAL [13:37:02] PROBLEM - Puppet staleness on tools-exec-1203 is CRITICAL [13:37:02] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1405 is CRITICAL [13:37:04] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1410 is CRITICAL [13:37:05] PROBLEM - High iowait on tools-worker-1011 is CRITICAL [13:37:05] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL [13:37:05] PROBLEM - High iowait on tools-worker-1013 is CRITICAL [13:37:07] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL [13:37:08] PROBLEM - Free space - all mounts on tools-exec-1205 is CRITICAL [13:37:11] PROBLEM - Free space - all mounts on tools-static-10 is CRITICAL [13:37:11] PROBLEM - High iowait on tools-elastic-02 is CRITICAL [13:37:12] PROBLEM - Free space - all mounts on tools-grid-shadow is CRITICAL [13:37:13] PROBLEM - Puppet run on tools-prometheus-02 is CRITICAL [13:37:14] PROBLEM - Puppet staleness on tools-exec-1216 is CRITICAL [13:37:14] PROBLEM - High iowait on tools-elastic-01 is CRITICAL [13:37:15] PROBLEM - Free space - all mounts on tools-bastion-03 is CRITICAL [13:37:15] PROBLEM - Free space - all mounts on tools-worker-1014 is CRITICAL [13:37:16] PROBLEM - Puppet run on tools-grid-master is CRITICAL [13:37:16] PROBLEM - Puppet staleness on tools-proxy-02 is CRITICAL [13:37:50] Lord. [13:38:10] yeah idk what's up here [13:38:28] If it comes back on I'll quiet it. [13:38:41] got the same in -releng for deployment-prep. So I guess shinken exploded [13:39:18] I tried sshing to the last couple of those hosts, they seem to work [13:39:24] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL [13:39:24] PROBLEM - Puppet run on tools-k8s-master-02 is CRITICAL [13:39:24] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1403 is CRITICAL [13:39:24] I wonder why shinken thinks they're broken [13:39:25] PROBLEM - Free space - all mounts on tools-worker-1007 is CRITICAL [13:39:26] PROBLEM - Free space - all mounts on tools-elastic-02 is CRITICAL [13:39:26] yeah so far spot checking all seem fine... [13:39:27] PROBLEM - Free space - all mounts on tools-exec-1201 is CRITICAL [13:39:41] @q Shanmugamp7 [13:39:47] Bah. [13:39:48] troubles with the shinken instance itself? [13:39:52] Sorry. [13:40:04] @uq Shanmugamp7 [13:40:07] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.0.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [13:40:07] @help [13:40:14] I fixed it tom29739 [13:40:23] godog: seems possible, I'm still lookng for confirmation of an issue within tools itself [13:40:36] It's not just tools chasemp [13:40:38] 18<hashar18> got the same in -releng for deployment-prep. So I guess shinken exploded [13:40:46] I know: add, changepass, channel-info, channellist, commands, configure, drop, github-, github+, github-off, github-on, grant, grantrole, help, info, instance, join, language, notify, optools-off, optools-on, optools-permanent-off, optools-permanent-on, part, rc-ping, rc-restart, reauth, recentchanges-bot-off, recentchanges-bot-on, recentchanges-minor-off, recentchanges-minor-on, recentchanges-off, recentchanges-on, reload, restart, revoke, revokerole, seen, seen-host, seen-off, seen-on, seenrx, suppress-off, suppress-on, systeminfo, system-rm, time, traffic-off, traffic-on, translate, trustadd, trustdel, trusted, uptime, verbosity--, verbosity++, wd, whoami [13:40:46] @commands [13:42:03] it's super weird the high iowait was triggered, I assumed NFS issues but can't find any yet [13:42:12] it could be totally false positive [14:03:51] I'm going to import something something in labsdb in enwiki replica. It might take some time :) [14:07:54] 06Labs, 10Labs-Infrastructure, 10DBA, 05Security: MySQL password has been deployed in clear text to labtestweb2001, potentially other hosts - https://phabricator.wikimedia.org/T146146#2652122 (10jcrespo) [14:10:26] 06Labs, 10Labs-Infrastructure, 10DBA, 05Security: MySQL password has been deployed in clear text to labtestweb2001, potentially other hosts - https://phabricator.wikimedia.org/T146146#2652137 (10jcrespo) [14:27:55] yuvipanda: Hey, what do you think about this? https://quarry.wmflabs.org/query/12647 [14:28:29] More number in weighted sum means more quality [14:52:20] hello [14:53:38] I'm looking for some documentation about local replicas of the database on paws, but I can't find any. Is there someone who could please help me? Thanks [15:05:50] 06Labs, 06Operations, 07Tracking: Performance test new secondary labstore HA cluster - https://phabricator.wikimedia.org/T146153#2652274 (10madhuvishy) [15:06:09] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster - https://phabricator.wikimedia.org/T146154#2652289 (10madhuvishy) [15:06:52] Sigfrido: the db's accessed via paws are the same replicas available to the rest of labs [15:09:53] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652316 (10madhuvishy) [15:20:01] chasemp: thanks, but while trying to connect to, for example, "enwiki.labsdb" I get "Access denied for user 'tools.paws'@'10.68.21.130' (using password: NO)". From the documentation in https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database I expected that the credentials where per tool based [15:23:05] Sigfrido, what command are you running?? [15:23:10] s/??/?/ [15:23:28] you shouldn't get 'using password: NO' [15:23:57] you also shouldn't get the host '10.68.21.130' [15:24:05] or the user 'tools.paws' now that I think about it [15:26:51] Krenair: connector = pymysql.connect(host='enwiki.labsdb') [15:27:21] right, sorry, '10.68.21.130' is the connecting host. that's fine [15:27:46] but you do need to read your mysql credentials file and use the login details from that [15:28:37] Some time ago I wrote some code for it: [15:28:39] config = ConfigParser() [15:28:39] config.read('replica.my.cnf') [15:28:39] user = config.get('client', 'user')[1:-1] # Strip first and last characters - just apostrophes [15:28:40] password = config.get('client', 'password')[1:-1] # Strip first and last characters - just apostrophes [15:28:56] er, you also need `from ConfigParser import ConfigParser` at the beginning [15:29:30] Krenair: thanks a lot, I'm going to try it soon [15:29:47] now I have to catch a train, maybe I'll check the logs online [15:29:52] bye [15:42:08] !log tools move floating ip from tools-checker-02 (failed) to tools-checker-01 [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [16:16:12] 10Striker, 15User-bd808: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710#2652471 (10bd808) a:03bd808 [16:17:29] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 13Patch-For-Review, 15User-bd808: Modernize the admin tool's codebase - https://phabricator.wikimedia.org/T140254#2652477 (10bd808) 05Open>03stalled [16:40:26] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652606 (10chasemp) [16:50:46] Krenair: just tested the code, it seems that either replica.my.cnf is empty for paws or maybe I'm missing something. config.read('replica.my.cnf') returns an empty list, like config.sections() [16:51:04] do you have a replica.my.cnf in your home directory there? [16:53:49] Krenair: ok, then I misunderstood the docs. I believed that the configuration file belonged to the tool, but it should be in my jupyter home instead. Am I correct? [16:54:29] don't know about jupyter or paws specifics [16:54:39] I know that on tools-login I have a replica.my.cnf, and each tool created also gets one of it's own [17:01:52] Krenair: the end-users for the tool "paws" log in using their mediawiki credentials, they may not have a labs account or a replica.my.cnf file. I'm going to browse phabricator tasks about that and see if it is a known issue or a feature still in development. Thanks [17:02:08] yuvipanda, ^ [17:03:30] the docs are not so useful yet -- https://www.mediawiki.org/wiki/PAWS [17:04:04] bd808: that's why I'm here ;-) [17:11:56] hi Sigfrido [17:12:20] Sigfrido: the mysql credentials are passed in as environment variables. look at http://paws-public.wmflabs.org/paws-public/User:YuviPanda/replicahelper.ipynb for an example [17:17:51] yuvipanda: thanks! [17:20:18] !log tools reboot tools-checker-02 [17:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:33:16] !log tools reboot tools-puppetmaster-01 [17:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:40:32] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2652872 (10yuvipanda) [17:49:48] !log tools webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting [17:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:53:58] !log repool tools-webgrid-lighttpd-1412 [17:53:58] repool is not a valid project. [17:54:52] !log tools repool tools-webgrid-lighttpd-1412 [17:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:55:06] 06Labs, 06Operations: Puppet broken on labcontrol1002 - https://phabricator.wikimedia.org/T145185#2652951 (10Andrew) 05Open>03Resolved This was fixed with a hiera change a while ago. [17:55:49] !log reboot tools-exec-1410 [17:55:49] reboot is not a valid project. [17:58:12] !log tools reboot tools-exec-1410 [17:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [17:58:34] tx andrewbogott :) [18:29:03] RECOVERY - Free space - all mounts on tools-worker-1018 is OK: OK: tools.tools-worker-1018.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) [18:29:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:04] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:04] RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:05] RECOVERY - Free space - all mounts on tools-exec-1403 is OK: OK: tools.tools-exec-1403.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:06] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:07] RECOVERY - Puppet staleness on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:07] RECOVERY - Puppet staleness on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:09] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:10] RECOVERY - Free space - all mounts on tools-worker-1006 is OK: OK: tools.tools-worker-1006.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) tools.tools-worker-1006.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:12] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:12] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:13] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1409 is OK: OK: tools.tools-webgrid-lighttpd-1409.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:15] RECOVERY - Free space - all mounts on tools-grid-master is OK: OK: tools.tools-grid-master.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:15] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:16] RECOVERY - Free space - all mounts on tools-worker-1015 is OK: OK: tools.tools-worker-1015.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) [18:29:16] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:17] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1403 is OK: OK: tools.tools-webgrid-lighttpd-1403.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:18] RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:19] RECOVERY - Puppet staleness on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:20] RECOVERY - Free space - all mounts on tools-worker-1007 is OK: OK: tools.tools-worker-1007.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) tools.tools-worker-1007.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:21] RECOVERY - Free space - all mounts on tools-elastic-02 is OK: OK: All targets OK [18:29:25] Okay [18:29:27] Now I have fixed it [18:29:31] Thanks to ottomata [18:29:39] \o/ [18:30:00] I'm going to leave it quieted while I have dinner [18:30:24] as it has like several hundred checks to correct [18:45:23] I'm merging some changes to the self hosted puppetmaster stuff, expect some restarts causing transient puppet issues [18:48:11] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:48:17] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:48:25] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:49:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:49:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:49:23] ^ transient from me doing things [18:49:45] PROBLEM - Puppet run on tools-puppetmaster-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:49:53] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:49:59] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:50:00] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:50:14] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:50:24] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:50:44] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:50:48] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:51:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:51:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:51:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:51:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:51:32] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:51:40] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:51:42] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:52:04] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:55:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:59:46] RECOVERY - Puppet run on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:06:40] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:23:33] Puppet started failing on 3/4 cvn project instances as of August 31. [19:23:35] Sep 20 18:41:52 cvn-app5 puppet-agent[28267]: Caching catalog for cvn-app5.cvn.eqiad.wmflabs [19:23:35] Sep 20 18:41:54 cvn-app5 puppet-agent[28267]: Applying configuration version '1474396064' [19:23:35] Sep 20 18:41:55 cvn-app5 puppet-agent[28267]: (/Stage[main]/Base::Labs/User[root]/password) changed password [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]) Not removing directory; use 'force' to override [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]) Not removing directory; use 'force' to override [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: Could not remove existing file [19:23:37] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]/ensure) change from directory to link failed: Could not remove existing file [19:23:37] Sep 20 18:42:12 cvn-app5 puppet-agent[28267]: Finished catalog run in 18.30 seconds [19:23:37] (Sorry) [19:24:38] 06Labs: PUppet runs broken on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2600477 (10Krinkle) The same happened on cvn project instances. {F4493091 size=full} ``` Sep 20 18:41:52 cvn-app5 pup... [19:24:53] 06Labs: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653561 (10Krinkle) [19:25:05] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653564 (10Krinkle) p:05Triage>03Unbreak! [19:25:11] bd808: :) [19:25:19] Krinkle, which puppetmaster does it use? [19:25:30] Default. [19:25:38] It's a plain instance with some extra packages and services installed locally. [19:25:44] No role or master changes. [19:25:44] RECOVERY - Puppet run on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:14] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:02] RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:12] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:16] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:26] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:58] Krinkle: the fix on the boxes I ran into it on was to rmdir the /data/scratch dir [19:29:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:25] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:55] RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:59] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:01] RECOVERY - Puppet run on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:06] The breakage was related to some changes that madhuvishy made to where the NFS volumes mount and very old base images as I recall [19:30:13] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:23] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:26] bd808: Ah, probably no ensure=>absent for something that has been gone for a while? [19:30:44] yeah I think something like that [19:30:49] RECOVERY - Puppet run on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:59] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:31] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:36] Krinkle: yeah - it was a weird edge case - I merged a patch this morning - should no longer fail [19:31:39] RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [19:32:04] Yeah cvn-app5 has this: [19:32:05] | image | ubuntu-12.04-precise (deprecated 2014-04-17) (ff0a06ae-e7bf-4533-a3aa-176c366fdb4a) | [19:32:13] Krinkle, that's very old [19:32:44] I fixed it on some integration hosts that were similarly ancient [19:33:01] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653577 (10Krinkle) Affected instances: * cvn-app4 (created 2 years, 5 months ago; ubuntu-12.04-precise (deprecated 2014-04-17)) * cvn-app5 (created 2 years, 5... [19:33:05] cvn-apache8 (created 1 year, 5 months ago; ubuntu-14.04-trusty (deprecated 2015-06-13)) [19:33:10] This one is also affected [19:33:10] Krinkle: all the puppet code assumes /data/scratch etc are nfs mounts - those are all handled. but these instances seem to just have them as directories. i put in a conditional to make this not happen [19:33:17] I think majority of instances are >1 year old, no? [19:33:20] do you wanna try running puppet now to see [19:33:49] I ran `sudo rmdir /data/scratch` (safe, not recursive) on one of them (cvn-app5) [19:33:52] The other ones unchanged. [19:33:58] I'll see how the next puppet run goes [19:34:27] madhuvishy: /usr/local/sbin/puppet-run ? [19:34:53] Krinkle: uhh puppet agent -tv I'd think [19:35:19] Krinkle: you shouldn't have to remove those directories either [19:35:23] puppet should succeed [19:35:31] madhuvishy: I was told not to use that because it runs with different environment than what puppet does (specifically, it used to break some chmod related things, since 'sudo' from a real user vs. puppet from cron is different) [19:35:40] But that may've been fixed between now and a year ago [19:35:46] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:36:09] Krinkle: aah I'm not sure then [19:36:17] chasemp: do you know anything about ^ [19:36:29] Krinkle: sudo -i puppet agent -tv should be fine [19:37:14] The runs from ~18:30 UTC failed on all three. The more recent run from ~19:15 or ~19:30 (last 15min) succeeded it seems [19:37:16] puppet-run works or puppet agent --test [19:37:17] Yay [19:37:27] I don't know what the exeption was but it should't exist anymore [19:37:32] I wonder if most instances were really created >1 year ago [19:37:39] puppet-run is safer I suppose w/ timeouts and some magic [19:37:44] and it's how cron does it [19:37:56] Yeah, it has a random backoff delay to avoid cron churning on the master at the same time [19:38:18] And the puppet-run script also used to contain some chmod and umode normalisation [19:38:46] since by default puppet will run with whatever you run it with (aside from requiring sudo) [19:39:19] chasemp: madhuvishy: So which commit fixed it? [19:40:19] Krinkle: https://gerrit.wikimedia.org/r/#/c/308941/ [19:40:33] Thanks [19:40:34] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653593 (10Krinkle) 05Open>03Resolved Fixed by . [19:40:58] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653596 (10Krinkle) [19:42:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:48:21] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#2653614 (10Andrew) This is now fixed in the scripts, but won't be in place until new images are built. [19:55:45] andrewbogott: hello! Does labs supports resizing an instance (eg changing the flavor)? [19:55:51] nope [19:55:55] I wish. [19:56:03] seems compute nodes needs passwordless ssh which I guess we dont have :] [19:56:05] thx! [19:56:06] Openstack claims to support it but I've never seen an instance survive the operation. [19:56:34] There used to be a rebuild option where you could change the flavour. [19:56:35] hence why it is disabled in horizon I guess [19:56:46] That'd basically recreate the instance. [19:56:53] I wish we could get rid of the flavors and instead pick the right num of cpu/ram/disk [19:57:09] ^ that'd be great [19:57:52] It's why people pick extra large instances often, because they need all that disk. [20:00:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:54] andrewbogott: and to get a new flavor added to a project that needs novaadmin right? So I just have to fill a task? [20:01:14] yeah, I can create a new flavor [20:01:42] we have a limited available flavor w/ big disk etc afaiu andrewbogott [20:02:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [20:10:28] PROBLEM - Host tools-webgrid-lighttpd-1417 is DOWN: CRITICAL - Host Unreachable (10.68.20.188) [20:10:38] RECOVERY - Puppet staleness on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [3600.0] [20:11:12] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653709 (10hashar) [20:11:43] andrewbogott: https://phabricator.wikimedia.org/T146209 for a 8 vCPU, 8G RAM and 60G disk [20:12:01] can probably use just 6G of RAM though [20:13:14] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:16:21] !log deployment-prep enabled trusty-backports on deployment-puppetmaster [20:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [20:19:29] 06Labs, 10Labs-Infrastructure: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#2653788 (10AlexMonk-WMF) [20:27:48] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2653808 (10Andrew) [20:27:50] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653807 (10Andrew) [20:30:47] andrewbogott: if you can get us the new flavor tonight that would be nice. That is more or less on the way to migrate to Jessie :] [20:32:39] 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2653821 (10madhuvishy) [20:33:47] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2653843 (10Andrew) [20:33:49] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653840 (10Andrew) 05Open>03Resolved a:03Andrew [20:33:53] hashar: done, I think [20:34:14] !log tools Created new instance tools-webgrid-lighttpd-1415 (T146212) [20:34:16] T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212 [20:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:34:23] !log tools Created new instance tools-webgrid-lighttpd-1416 (T146212) [20:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:34:42] !log tools Created new instance tools-webgrid-lighttpd-1418 (T146212) [20:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:37:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [20:42:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:00] 06Labs: Undo labtest realm hacks - https://phabricator.wikimedia.org/T146150#2653877 (10Aklapper) [20:44:14] Any ideas how should I set dir="rtl"/dir="ltr" when displaying results from the API? Trying to get fawiki working better [20:46:48] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:47:10] ^ working on it [20:48:24] andrewbogott: got the new flavor, you are awesome thank you very much :) [20:48:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:53:21]