[00:07:30] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651046 (10leila) Sounds good, @DarTar. Closing it. [00:07:40] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651047 (10leila) 05Open>03Resolved [01:06:02] Hm.. cvn instances have been puppet-failing for the last 20 days. https://tools.wmflabs.org/nagf/?project=cvn&nagf-range=month#h_overview_puppetagent [01:08:39] why are they failing? [01:11:42] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2651177 (10Legoktm) 05Resolved>03declined [02:20:38] PROBLEM - Puppet staleness on tools-exec-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [07:17:01] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2651434 (10Marostegui) This happened again and I have updated the bug report with Percona and MariaDB with the following information - gdb stacktraces - show engi... [07:18:04] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2651436 (10Marostegui) [07:45:52] PROBLEM - Puppet staleness on tools-k8s-master-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:17:26] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:45:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [11:20:52] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:28] 06Labs, 10Continuous-Integration-Infrastructure, 06Operations, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2651917 (10hashar) I have refreshed the package on https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf5/ .... [12:57:50] 06Labs, 10Labs-Infrastructure, 10DBA, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2651977 (10Marostegui) MariaDB just answered: // Thanks. Lets wait and see what Percona guys come up with, TokuDB is their area of expertise. If they claim it's... [13:28:26] PROBLEM - SSH on tools-checker-02 is CRITICAL: Server answer [13:33:28] RECOVERY - SSH on tools-checker-02 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [13:33:49] PROBLEM - High iowait on tools-webgrid-lighttpd-1205 is CRITICAL [13:33:50] PROBLEM - High iowait on tools-exec-1215 is CRITICAL [13:33:51] PROBLEM - High iowait on tools-webgrid-lighttpd-1406 is CRITICAL [13:33:52] PROBLEM - High iowait on tools-exec-1217 is CRITICAL [13:33:52] PROBLEM - High iowait on tools-webgrid-lighttpd-1204 is CRITICAL [13:33:54] PROBLEM - High iowait on tools-flannel-etcd-01 is CRITICAL [13:33:54] PROBLEM - High iowait on tools-k8s-etcd-03 is CRITICAL [13:33:56] PROBLEM - High iowait on tools-worker-1019 is CRITICAL [13:33:58] PROBLEM - High iowait on tools-exec-1206 is CRITICAL [13:34:02] PROBLEM - High iowait on tools-precise-dev is CRITICAL [13:34:06] PROBLEM - High iowait on tools-webgrid-lighttpd-1207 is CRITICAL [13:34:08] PROBLEM - High iowait on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:10] PROBLEM - High iowait on tools-webgrid-lighttpd-1208 is CRITICAL [13:34:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL [13:34:16] PROBLEM - High iowait on tools-worker-1006 is CRITICAL [13:34:16] PROBLEM - High iowait on tools-proxy-01 is CRITICAL [13:34:17] PROBLEM - High iowait on tools-webgrid-lighttpd-1401 is CRITICAL [13:34:18] PROBLEM - High iowait on tools-exec-1408 is CRITICAL [13:34:26] PROBLEM - High iowait on tools-webgrid-lighttpd-1402 is CRITICAL [13:34:26] PROBLEM - High iowait on tools-webgrid-lighttpd-1413 is CRITICAL [13:34:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:32] PROBLEM - High iowait on tools-webgrid-lighttpd-1206 is CRITICAL [13:34:34] PROBLEM - High iowait on tools-webgrid-generic-1403 is CRITICAL [13:34:36] PROBLEM - High iowait on tools-worker-1020 is CRITICAL [13:34:38] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1407 is CRITICAL [13:34:44] PROBLEM - High iowait on tools-webgrid-lighttpd-1404 is CRITICAL [13:34:46] PROBLEM - High iowait on tools-webgrid-lighttpd-1414 is CRITICAL [13:34:48] PROBLEM - High iowait on tools-flannel-etcd-02 is CRITICAL [13:34:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL [13:34:49] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL [13:34:49] PROBLEM - Puppet staleness on tools-logs-02 is CRITICAL [13:34:50] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1401 is CRITICAL [13:34:51] PROBLEM - High iowait on tools-webgrid-generic-1401 is CRITICAL [13:34:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL [13:34:52] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1413 is CRITICAL [13:34:53] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL [13:34:54] PROBLEM - Puppet staleness on tools-elastic-02 is CRITICAL [13:34:54] PROBLEM - Free space - all mounts on tools-worker-1017 is CRITICAL [13:34:55] PROBLEM - Free space - all mounts on tools-checker-01 is CRITICAL [13:34:56] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1408 is CRITICAL [13:34:57] PROBLEM - Free space - all mounts on tools-k8s-master-02 is CRITICAL [13:34:57] PROBLEM - High iowait on tools-exec-1406 is CRITICAL [13:34:58] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1203 is CRITICAL [13:34:59] PROBLEM - Puppet staleness on tools-webgrid-generic-1402 is CRITICAL [13:35:00] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1408 is CRITICAL [13:35:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL [13:35:01] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL [13:35:03] PROBLEM - Puppet staleness on tools-exec-1408 is CRITICAL [13:35:04] PROBLEM - Puppet staleness on tools-worker-1004 is CRITICAL [13:35:04] PROBLEM - Free space - all mounts on tools-worker-1020 is CRITICAL [13:35:05] PROBLEM - Free space - all mounts on tools-worker-1009 is CRITICAL [13:35:06] PROBLEM - High iowait on tools-exec-1202 is CRITICAL [13:35:07] PROBLEM - Puppet run on tools-checker-01 is CRITICAL [13:35:07] PROBLEM - High iowait on tools-worker-1008 is CRITICAL [13:35:08] PROBLEM - Puppet staleness on tools-webgrid-generic-1403 is CRITICAL [13:35:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL [13:35:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL [13:35:10] PROBLEM - Puppet staleness on tools-redis-1001 is CRITICAL [13:36:31] PROBLEM - High iowait on tools-worker-1005 is CRITICAL [13:36:31] PROBLEM - High iowait on tools-exec-1210 is CRITICAL [13:36:33] PROBLEM - High iowait on tools-exec-1219 is CRITICAL [13:36:37] PROBLEM - High iowait on tools-redis-1001 is CRITICAL [13:36:39] PROBLEM - Puppet run on tools-exec-cyberbot is CRITICAL [13:36:39] PROBLEM - High iowait on tools-exec-1407 is CRITICAL [13:36:41] PROBLEM - High iowait on tools-webgrid-lighttpd-1409 is CRITICAL [13:36:43] PROBLEM - High iowait on tools-docker-registry-01 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-worker-1014 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-exec-1214 is CRITICAL [13:36:44] PROBLEM - High iowait on tools-bastion-03 is CRITICAL [13:36:45] PROBLEM - High iowait on tools-k8s-master-02 is CRITICAL [13:36:45] PROBLEM - High iowait on tools-bastion-02 is CRITICAL [13:36:48] PROBLEM - High iowait on tools-services-01 is CRITICAL [13:36:48] PROBLEM - High iowait on tools-webgrid-lighttpd-1405 is CRITICAL [13:36:52] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL [13:36:52] PROBLEM - Puppet run on tools-services-02 is CRITICAL [13:36:52] PROBLEM - Puppet staleness on tools-puppetmaster-02 is CRITICAL [13:36:53] PROBLEM - High iowait on tools-worker-1018 is CRITICAL [13:36:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL [13:36:53] PROBLEM - Puppet run on tools-elastic-03 is CRITICAL [13:36:54] PROBLEM - High iowait on tools-exec-1410 is CRITICAL [13:36:55] PROBLEM - Free space - all mounts on tools-exec-1202 is CRITICAL [13:36:56] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1404 is CRITICAL [13:36:57] PROBLEM - Free space - all mounts on tools-k8s-etcd-02 is CRITICAL [13:36:57] PROBLEM - Puppet staleness on tools-exec-1405 is CRITICAL [13:36:59] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL [13:37:00] PROBLEM - Puppet staleness on tools-services-01 is CRITICAL [13:37:00] PROBLEM - High iowait on tools-exec-1211 is CRITICAL [13:37:02] PROBLEM - Puppet staleness on tools-exec-1203 is CRITICAL [13:37:02] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1405 is CRITICAL [13:37:04] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1410 is CRITICAL [13:37:05] PROBLEM - High iowait on tools-worker-1011 is CRITICAL [13:37:05] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL [13:37:05] PROBLEM - High iowait on tools-worker-1013 is CRITICAL [13:37:07] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL [13:37:08] PROBLEM - Free space - all mounts on tools-exec-1205 is CRITICAL [13:37:11] PROBLEM - Free space - all mounts on tools-static-10 is CRITICAL [13:37:11] PROBLEM - High iowait on tools-elastic-02 is CRITICAL [13:37:12] PROBLEM - Free space - all mounts on tools-grid-shadow is CRITICAL [13:37:13] PROBLEM - Puppet run on tools-prometheus-02 is CRITICAL [13:37:14] PROBLEM - Puppet staleness on tools-exec-1216 is CRITICAL [13:37:14] PROBLEM - High iowait on tools-elastic-01 is CRITICAL [13:37:15] PROBLEM - Free space - all mounts on tools-bastion-03 is CRITICAL [13:37:15] PROBLEM - Free space - all mounts on tools-worker-1014 is CRITICAL [13:37:16] PROBLEM - Puppet run on tools-grid-master is CRITICAL [13:37:16] PROBLEM - Puppet staleness on tools-proxy-02 is CRITICAL [13:37:50] Lord. [13:38:10] yeah idk what's up here [13:38:28] If it comes back on I'll quiet it. [13:38:41] got the same in -releng for deployment-prep. So I guess shinken exploded [13:39:18] I tried sshing to the last couple of those hosts, they seem to work [13:39:24] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL [13:39:24] PROBLEM - Puppet run on tools-k8s-master-02 is CRITICAL [13:39:24] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1403 is CRITICAL [13:39:24] I wonder why shinken thinks they're broken [13:39:25] PROBLEM - Free space - all mounts on tools-worker-1007 is CRITICAL [13:39:26] PROBLEM - Free space - all mounts on tools-elastic-02 is CRITICAL [13:39:26] yeah so far spot checking all seem fine... [13:39:27] PROBLEM - Free space - all mounts on tools-exec-1201 is CRITICAL [13:39:41] @q Shanmugamp7 [13:39:47] Bah. [13:39:48] troubles with the shinken instance itself? [13:39:52] Sorry. [13:40:04] @uq Shanmugamp7 [13:40:07] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.0.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [13:40:07] @help [13:40:14] I fixed it tom29739 [13:40:23] godog: seems possible, I'm still lookng for confirmation of an issue within tools itself [13:40:36] It's not just tools chasemp [13:40:38] 18<hashar18> got the same in -releng for deployment-prep. So I guess shinken exploded [13:40:46] I know: add, changepass, channel-info, channellist, commands, configure, drop, github-, github+, github-off, github-on, grant, grantrole, help, info, instance, join, language, notify, optools-off, optools-on, optools-permanent-off, optools-permanent-on, part, rc-ping, rc-restart, reauth, recentchanges-bot-off, recentchanges-bot-on, recentchanges-minor-off, recentchanges-minor-on, recentchanges-off, recentchanges-on, reload, restart, revoke, revokerole, seen, seen-host, seen-off, seen-on, seenrx, suppress-off, suppress-on, systeminfo, system-rm, time, traffic-off, traffic-on, translate, trustadd, trustdel, trusted, uptime, verbosity--, verbosity++, wd, whoami [13:40:46] @commands [13:42:03] it's super weird the high iowait was triggered, I assumed NFS issues but can't find any yet [13:42:12] it could be totally false positive [14:03:51] I'm going to import something something in labsdb in enwiki replica. It might take some time :) [14:07:54] 06Labs, 10Labs-Infrastructure, 10DBA, 05Security: MySQL password has been deployed in clear text to labtestweb2001, potentially other hosts - https://phabricator.wikimedia.org/T146146#2652122 (10jcrespo) [14:10:26] 06Labs, 10Labs-Infrastructure, 10DBA, 05Security: MySQL password has been deployed in clear text to labtestweb2001, potentially other hosts - https://phabricator.wikimedia.org/T146146#2652137 (10jcrespo) [14:27:55] yuvipanda: Hey, what do you think about this? https://quarry.wmflabs.org/query/12647 [14:28:29] More number in weighted sum means more quality [14:52:20] hello [14:53:38] I'm looking for some documentation about local replicas of the database on paws, but I can't find any. Is there someone who could please help me? Thanks [15:05:50] 06Labs, 06Operations, 07Tracking: Performance test new secondary labstore HA cluster - https://phabricator.wikimedia.org/T146153#2652274 (10madhuvishy) [15:06:09] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster - https://phabricator.wikimedia.org/T146154#2652289 (10madhuvishy) [15:06:52] Sigfrido: the db's accessed via paws are the same replicas available to the rest of labs [15:09:53] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652316 (10madhuvishy) [15:20:01] chasemp: thanks, but while trying to connect to, for example, "enwiki.labsdb" I get "Access denied for user 'tools.paws'@'10.68.21.130' (using password: NO)". From the documentation in https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database I expected that the credentials where per tool based [15:23:05] Sigfrido, what command are you running?? [15:23:10] s/??/?/ [15:23:28] you shouldn't get 'using password: NO' [15:23:57] you also shouldn't get the host '10.68.21.130' [15:24:05] or the user 'tools.paws' now that I think about it [15:26:51] Krenair: connector = pymysql.connect(host='enwiki.labsdb') [15:27:21] right, sorry, '10.68.21.130' is the connecting host. that's fine [15:27:46] but you do need to read your mysql credentials file and use the login details from that [15:28:37] Some time ago I wrote some code for it: [15:28:39] config = ConfigParser() [15:28:39] config.read('replica.my.cnf') [15:28:39] user = config.get('client', 'user')[1:-1] # Strip first and last characters - just apostrophes [15:28:40] password = config.get('client', 'password')[1:-1] # Strip first and last characters - just apostrophes [15:28:56] er, you also need `from ConfigParser import ConfigParser` at the beginning [15:29:30] Krenair: thanks a lot, I'm going to try it soon [15:29:47] now I have to catch a train, maybe I'll check the logs online [15:29:52] bye [15:42:08] !log tools move floating ip from tools-checker-02 (failed) to tools-checker-01 [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [16:16:12] 10Striker, 15User-bd808: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710#2652471 (10bd808) a:03bd808 [16:17:29] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 13Patch-For-Review, 15User-bd808: Modernize the admin tool's codebase - https://phabricator.wikimedia.org/T140254#2652477 (10bd808) 05Open>03stalled [16:40:26] 06Labs, 06Operations, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652606 (10chasemp) [16:50:46] Krenair: just tested the code, it seems that either replica.my.cnf is empty for paws or maybe I'm missing something. config.read('replica.my.cnf') returns an empty list, like config.sections() [16:51:04] do you have a replica.my.cnf in your home directory there? [16:53:49] Krenair: ok, then I misunderstood the docs. I believed that the configuration file belonged to the tool, but it should be in my jupyter home instead. Am I correct? [16:54:29] don't know about jupyter or paws specifics [16:54:39] I know that on tools-login I have a replica.my.cnf, and each tool created also gets one of it's own [17:01:52] Krenair: the end-users for the tool "paws" log in using their mediawiki credentials, they may not have a labs account or a replica.my.cnf file. I'm going to browse phabricator tasks about that and see if it is a known issue or a feature still in development. Thanks [17:02:08] yuvipanda, ^ [17:03:30] the docs are not so useful yet -- https://www.mediawiki.org/wiki/PAWS [17:04:04] bd808: that's why I'm here ;-) [17:11:56] hi Sigfrido [17:12:20] Sigfrido: the mysql credentials are passed in as environment variables. look at http://paws-public.wmflabs.org/paws-public/User:YuviPanda/replicahelper.ipynb for an example [17:17:51] yuvipanda: thanks! [17:20:18] !log tools reboot tools-checker-02 [17:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:33:16] !log tools reboot tools-puppetmaster-01 [17:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:40:32] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2652872 (10yuvipanda) [17:49:48] !log tools webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting [17:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:53:58] !log repool tools-webgrid-lighttpd-1412 [17:53:58] repool is not a valid project. [17:54:52] !log tools repool tools-webgrid-lighttpd-1412 [17:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:55:06] 06Labs, 06Operations: Puppet broken on labcontrol1002 - https://phabricator.wikimedia.org/T145185#2652951 (10Andrew) 05Open>03Resolved This was fixed with a hiera change a while ago. [17:55:49] !log reboot tools-exec-1410 [17:55:49] reboot is not a valid project. [17:58:12] !log tools reboot tools-exec-1410 [17:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [17:58:34] tx andrewbogott :) [18:29:03] RECOVERY - Free space - all mounts on tools-worker-1018 is OK: OK: tools.tools-worker-1018.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) [18:29:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:04] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:04] RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:05] RECOVERY - Free space - all mounts on tools-exec-1403 is OK: OK: tools.tools-exec-1403.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:06] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:07] RECOVERY - Puppet staleness on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:07] RECOVERY - Puppet staleness on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:09] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:10] RECOVERY - Free space - all mounts on tools-worker-1006 is OK: OK: tools.tools-worker-1006.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) tools.tools-worker-1006.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:12] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:12] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:13] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1409 is OK: OK: tools.tools-webgrid-lighttpd-1409.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:15] RECOVERY - Free space - all mounts on tools-grid-master is OK: OK: tools.tools-grid-master.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:15] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:16] RECOVERY - Free space - all mounts on tools-worker-1015 is OK: OK: tools.tools-worker-1015.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) [18:29:16] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:17] RECOVERY - Free space - all mounts on tools-webgrid-lighttpd-1403 is OK: OK: tools.tools-webgrid-lighttpd-1403.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:18] RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:29:19] RECOVERY - Puppet staleness on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:29:20] RECOVERY - Free space - all mounts on tools-worker-1007 is OK: OK: tools.tools-worker-1007.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found) tools.tools-worker-1007.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [18:29:21] RECOVERY - Free space - all mounts on tools-elastic-02 is OK: OK: All targets OK [18:29:25] Okay [18:29:27] Now I have fixed it [18:29:31] Thanks to ottomata [18:29:39] \o/ [18:30:00] I'm going to leave it quieted while I have dinner [18:30:24] as it has like several hundred checks to correct [18:45:23] I'm merging some changes to the self hosted puppetmaster stuff, expect some restarts causing transient puppet issues [18:48:11] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:48:17] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:48:25] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:49:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:49:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:49:23] ^ transient from me doing things [18:49:45] PROBLEM - Puppet run on tools-puppetmaster-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:49:53] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:49:59] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:50:00] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:50:14] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:50:24] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:50:44] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:50:48] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:51:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:51:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:51:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:51:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:51:32] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:51:40] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:51:42] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:52:04] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:55:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:59:46] RECOVERY - Puppet run on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:06:40] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:23:33] Puppet started failing on 3/4 cvn project instances as of August 31. [19:23:35] Sep 20 18:41:52 cvn-app5 puppet-agent[28267]: Caching catalog for cvn-app5.cvn.eqiad.wmflabs [19:23:35] Sep 20 18:41:54 cvn-app5 puppet-agent[28267]: Applying configuration version '1474396064' [19:23:35] Sep 20 18:41:55 cvn-app5 puppet-agent[28267]: (/Stage[main]/Base::Labs/User[root]/password) changed password [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]) Not removing directory; use 'force' to override [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]) Not removing directory; use 'force' to override [19:23:36] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: Could not remove existing file [19:23:37] Sep 20 18:42:11 cvn-app5 puppet-agent[28267]: (/Stage[main]/Role::Labs::Nfsclient/File[/data/scratch]/ensure) change from directory to link failed: Could not remove existing file [19:23:37] Sep 20 18:42:12 cvn-app5 puppet-agent[28267]: Finished catalog run in 18.30 seconds [19:23:37] (Sorry) [19:24:38] 06Labs: PUppet runs broken on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2600477 (10Krinkle) The same happened on cvn project instances. {F4493091 size=full} ``` Sep 20 18:41:52 cvn-app5 pup... [19:24:53] 06Labs: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653561 (10Krinkle) [19:25:05] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653564 (10Krinkle) p:05Triage>03Unbreak! [19:25:11] bd808: :) [19:25:19] Krinkle, which puppetmaster does it use? [19:25:30] Default. [19:25:38] It's a plain instance with some extra packages and services installed locally. [19:25:44] No role or master changes. [19:25:44] RECOVERY - Puppet run on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:14] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:02] RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:12] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:16] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:26] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:58] Krinkle: the fix on the boxes I ran into it on was to rmdir the /data/scratch dir [19:29:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:25] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:55] RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:59] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:01] RECOVERY - Puppet run on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:06] The breakage was related to some changes that madhuvishy made to where the NFS volumes mount and very old base images as I recall [19:30:13] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:23] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:26] bd808: Ah, probably no ensure=>absent for something that has been gone for a while? [19:30:44] yeah I think something like that [19:30:49] RECOVERY - Puppet run on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:59] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:31] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:36] Krinkle: yeah - it was a weird edge case - I merged a patch this morning - should no longer fail [19:31:39] RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [19:32:04] Yeah cvn-app5 has this: [19:32:05] | image | ubuntu-12.04-precise (deprecated 2014-04-17) (ff0a06ae-e7bf-4533-a3aa-176c366fdb4a) | [19:32:13] Krinkle, that's very old [19:32:44] I fixed it on some integration hosts that were similarly ancient [19:33:01] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653577 (10Krinkle) Affected instances: * cvn-app4 (created 2 years, 5 months ago; ubuntu-12.04-precise (deprecated 2014-04-17)) * cvn-app5 (created 2 years, 5... [19:33:05] cvn-apache8 (created 1 year, 5 months ago; ubuntu-14.04-trusty (deprecated 2015-06-13)) [19:33:10] This one is also affected [19:33:10] Krinkle: all the puppet code assumes /data/scratch etc are nfs mounts - those are all handled. but these instances seem to just have them as directories. i put in a conditional to make this not happen [19:33:17] I think majority of instances are >1 year old, no? [19:33:20] do you wanna try running puppet now to see [19:33:49] I ran `sudo rmdir /data/scratch` (safe, not recursive) on one of them (cvn-app5) [19:33:52] The other ones unchanged. [19:33:58] I'll see how the next puppet run goes [19:34:27] madhuvishy: /usr/local/sbin/puppet-run ? [19:34:53] Krinkle: uhh puppet agent -tv I'd think [19:35:19] Krinkle: you shouldn't have to remove those directories either [19:35:23] puppet should succeed [19:35:31] madhuvishy: I was told not to use that because it runs with different environment than what puppet does (specifically, it used to break some chmod related things, since 'sudo' from a real user vs. puppet from cron is different) [19:35:40] But that may've been fixed between now and a year ago [19:35:46] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:36:09] Krinkle: aah I'm not sure then [19:36:17] chasemp: do you know anything about ^ [19:36:29] Krinkle: sudo -i puppet agent -tv should be fine [19:37:14] The runs from ~18:30 UTC failed on all three. The more recent run from ~19:15 or ~19:30 (last 15min) succeeded it seems [19:37:16] puppet-run works or puppet agent --test [19:37:17] Yay [19:37:27] I don't know what the exeption was but it should't exist anymore [19:37:32] I wonder if most instances were really created >1 year ago [19:37:39] puppet-run is safer I suppose w/ timeouts and some magic [19:37:44] and it's how cron does it [19:37:56] Yeah, it has a random backoff delay to avoid cron churning on the master at the same time [19:38:18] And the puppet-run script also used to contain some chmod and umode normalisation [19:38:46] since by default puppet will run with whatever you run it with (aside from requiring sudo) [19:39:19] chasemp: madhuvishy: So which commit fixed it? [19:40:19] Krinkle: https://gerrit.wikimedia.org/r/#/c/308941/ [19:40:33] Thanks [19:40:34] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653593 (10Krinkle) 05Open>03Resolved Fixed by . [19:40:58] 06Labs, 07Regression: Puppet broken since August 31 on some instances due to labstore::nfs_mount changes - https://phabricator.wikimedia.org/T144460#2653596 (10Krinkle) [19:42:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:48:21] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#2653614 (10Andrew) This is now fixed in the scripts, but won't be in place until new images are built. [19:55:45] andrewbogott: hello! Does labs supports resizing an instance (eg changing the flavor)? [19:55:51] nope [19:55:55] I wish. [19:56:03] seems compute nodes needs passwordless ssh which I guess we dont have :] [19:56:05] thx! [19:56:06] Openstack claims to support it but I've never seen an instance survive the operation. [19:56:34] There used to be a rebuild option where you could change the flavour. [19:56:35] hence why it is disabled in horizon I guess [19:56:46] That'd basically recreate the instance. [19:56:53] I wish we could get rid of the flavors and instead pick the right num of cpu/ram/disk [19:57:09] ^ that'd be great [19:57:52] It's why people pick extra large instances often, because they need all that disk. [20:00:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:54] andrewbogott: and to get a new flavor added to a project that needs novaadmin right? So I just have to fill a task? [20:01:14] yeah, I can create a new flavor [20:01:42] we have a limited available flavor w/ big disk etc afaiu andrewbogott [20:02:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [20:10:28] PROBLEM - Host tools-webgrid-lighttpd-1417 is DOWN: CRITICAL - Host Unreachable (10.68.20.188) [20:10:38] RECOVERY - Puppet staleness on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [3600.0] [20:11:12] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653709 (10hashar) [20:11:43] andrewbogott: https://phabricator.wikimedia.org/T146209 for a 8 vCPU, 8G RAM and 60G disk [20:12:01] can probably use just 6G of RAM though [20:13:14] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:16:21] !log deployment-prep enabled trusty-backports on deployment-puppetmaster [20:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [20:19:29] 06Labs, 10Labs-Infrastructure: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#2653788 (10AlexMonk-WMF) [20:27:48] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2653808 (10Andrew) [20:27:50] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653807 (10Andrew) [20:30:47] andrewbogott: if you can get us the new flavor tonight that would be nice. That is more or less on the way to migrate to Jessie :] [20:32:39] 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2653821 (10madhuvishy) [20:33:47] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2653843 (10Andrew) [20:33:49] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653840 (10Andrew) 05Open>03Resolved a:03Andrew [20:33:53] hashar: done, I think [20:34:14] !log tools Created new instance tools-webgrid-lighttpd-1415 (T146212) [20:34:16] T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212 [20:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:34:23] !log tools Created new instance tools-webgrid-lighttpd-1416 (T146212) [20:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:34:42] !log tools Created new instance tools-webgrid-lighttpd-1418 (T146212) [20:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:37:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [20:42:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:00] 06Labs: Undo labtest realm hacks - https://phabricator.wikimedia.org/T146150#2653877 (10Aklapper) [20:44:14] Any ideas how should I set dir="rtl"/dir="ltr" when displaying results from the API? Trying to get fawiki working better [20:46:48] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:47:10] ^ working on it [20:48:24] andrewbogott: got the new flavor, you are awesome thank you very much :) [20:48:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:53:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:06:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [21:07:15] PROBLEM - Host tools-webgrid-lighttpd-1414 is DOWN: CRITICAL - Host Unreachable (10.68.20.250) [21:08:49] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [21:17:06] !log tools Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212) [21:17:09] T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212 [21:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:21:14] !log tools.paste Restarted webservice, was returning 502s [21:23:39] ^ That bug still exists. [21:23:48] !log tools Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212) [21:23:49] T146212: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212 [21:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:26:39] 06Labs, 10Adminbot: Get a cloak for morebots & labs-morebots - https://phabricator.wikimedia.org/T140547#2468752 (10Paladox) Hi, I can file for the request. But I need to know the owner of the bot with access, since the request will include verifying that you own the bot. [21:32:22] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [21:37:20] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [22:05:59] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [22:06:09] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [22:09:29] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [22:13:08] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [22:34:23] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2653709 (10greg) (This isn't a quota increase request, if that was a mis-aligned upstream task addition ;) ). [22:46:32] Hi everyone [22:46:53] My webservice is down and I'm getting errors when I try to webservice start [22:47:44] File "/usr/bin/webservice-runner", line 26, in - proxy.register(port) - File "/usr/lib/python2.7/dist-packages/toollabs/webservice/proxy.py", line 31, in register - current_ip = socket.gethostbyname(socket.getfqdn()) [22:59:47] Hi ? [22:59:49] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2654467 (10AlexMonk-WMF) [22:59:51] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 07HHVM: OpenStack flavor for beta cluster deployment servers - https://phabricator.wikimedia.org/T146209#2654466 (10AlexMonk-WMF) [23:00:32] Solved :) [23:08:39] And again the same [23:11:58] hi all, I have created some weeks ago a new tool called `wscontest`, I have put togheter a simple web page for describing the tool. I have created the web page locally and then rsync'd it to /data/project/wscontest/public_html/ [23:12:55] however if I go to https://tools.wmflabs.org/wscontest/ I get an error [23:13:01] what am I missing? [23:14:31] CristianCantoro: you need to do webservice start at least once, I just did it for you and it works now [23:14:31] I don't see an error CristianCantoro [23:14:35] ah [23:15:05] Uhm... Anyone? https://www.irccloud.com/pastebin/huxREz97/ [23:15:09] yuvipanda: how can I do that? [23:15:17] Any help for me, please? [23:15:30] jem: doesn't it give any more information ? [23:15:33] jem: I just got that too. [23:15:39] Platonides: Same error as mine. [23:15:42] See my paste. [23:15:45] Ah, I see [23:15:46] which tool is it? [23:15:46] Yes [23:15:50] yuvipanda: xtools [23:16:18] (I tried a few other tools at the beginning and they seemed Ok) [23:16:50] hmm [23:17:07] it's as if it can't get the domain name for localhost [23:17:12] madhuvishy|food: around? I wonder if it's failing on the new lighttpd nodes [23:17:39] Krenair: didn't you do something around this recently? [23:18:02] around what? [23:19:10] Krenair: around resolving a hosts' own fqdn [23:19:21] not really, no [23:19:27] I'll take a look though [23:20:02] ok [23:20:26] Oh, right [23:20:29] yuvipanda [23:20:36] What host did jem's host run on? [23:20:49] Because I think I know exactly which bug this is, just need to confirm [23:20:59] I don't really have an easy way to confirm [23:21:05] Okay [23:21:07] but my suspicion is the new hosts [23:21:10] Which host did any of the jobs that failed in that way? [23:21:25] Can I get that info in some way? [23:21:32] they're all failing in the same way, so I don't have an easy way to find out :) [23:21:42] Krenair: try tools-webgrid-lighttpd-1418.eqiad.wmflabs or 16? [23:21:46] Nothing in my log. [23:22:14] Yep [23:22:15] Hah [23:22:16] It's this [23:22:42] krenair@tools-webgrid-lighttpd-1418:~$ hostname -f; python -c 'import socket; print(socket.getfqdn())' [23:22:42] tools-webgrid-lighttpd-1418.tools.eqiad.wmflabs [23:22:42] ci-jessie-wikimedia-194525.contintcloud.eqiad.wmflabs [23:22:51] My favourite labs bug [23:23:22] 200.20.68.10.in-addr.arpa domain name pointer tools-webgrid-lighttpd-1418.tools.eqiad.wmflabs. [23:23:23] 200.20.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-187714.contintcloud.eqiad.wmflabs. [23:23:23] 200.20.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-194525.contintcloud.eqiad.wmflabs. [23:24:02] !log tools depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order [23:24:05] https://phabricator.wikimedia.org/T115194 [23:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:24:09] Matthew_: jem should work now (I just depooled the new nodes) [23:24:30] WORKSFORME. [23:24:32] Thanks :) [23:24:32] yuvipanda, there's nothing wrong with those hosts, just the labs dns system [23:24:37] madhuvishy|food: chasemp I depooleed the new hosts since they were running into different problems [23:24:45] Krenair: something wrong 'about' these hosts maybe? :) [23:24:50] nope [23:26:18] Krenair: they share an IP that used to belong to something else, I guess? [23:26:35] didn't check 1416 [23:26:39] 1418 does [23:26:47] right [23:27:05] yep, 1416 has the same issue [23:28:27] 06Labs, 10Tool-Labs: Add 3 webgrid-lighttpd trusty nodes to tools project - https://phabricator.wikimedia.org/T146212#2654520 (10yuvipanda) I depooled these - they were running into issues wrt https://phabricator.wikimedia.org/T115194, causing problems when people started webservices: ``` File "/usr/bin/webs... [23:29:06] Krenair: can you comment on ^ as to what's happening? I only have a vague idea [23:29:37] What more do you want to know? [23:30:31] Krenair: is there a way to fix this for the particular instances so madhuvishy|food can go ahead and finish pooling these? [23:30:35] without attempting to fix the problem at large? [23:31:20] If there was I wouldn't do it [23:31:27] I want this bug fixed properly [23:31:33] Not just brushed out of the way of the tools project [23:33:09] Actually, yes, I think there is. [23:33:22] huh [23:33:41] reading backscroll [23:34:07] The only thing I've found about this bug that I haven't yet written on the ticket is sink_nova_fixed_multi appears to create the reverse records, but not delete them on instance deletion [23:34:39] Krenair: given that the ops offsite is next week and we ran out of trusty compute nodes, I doubt we'll be able to actually fix it 'at large' before that, but would not want to leave the grid in a resource starved state when travelling [23:34:46] yuvipanda: Random question. Does Tool Labs have infrastructure in place to return 403s? Specifically if it sees automated request? [23:35:01] Matthew_: there are ways to block individual IPs or user agents [23:35:25] yuvipanda: Has that been done for uptimerobot? Or is that an individual tool creation problem? [23:35:53] i've no idea what uptimerobot is :) [23:36:03] if you want a ua blocked do create a ticket :) [23:36:04] Okay. So that answers that question. [23:36:24] yuvipanda: Yes, Ok here, thanks [23:36:27] https://uptimerobot.com - xtools-ec and xtools-articleinfo are returning 403 when polled by this service. [23:36:46] hey andrewbogott [23:43:37] yuvipanda: Krenair: hmmm may be that's why gridengine-exec on 1418 wouldn't start [23:44:18] kept complaining about not being able to bind socket [23:44:55] I'd log into the designate database [23:47:16] Krenair: you don't have access to it, right? [23:47:37] I'm also curious now - is this failing dns lookups, rather than rdns lookusp? [23:47:57] right. [23:48:13] just in labtest, where we don't seem to have this issue - or haven't found a case of it there yet [23:48:22] yuvipanda, it's just bad rdns data [23:48:36] ptr records that haven't been cleaned up properly [23:48:48] select * from designate.recordsets where domain_id = '8d114f3c815b466cbdd49b91f704ea60' and name = '200.20.68.10.in-addr.arpa.'; [23:49:37] madhuvishy: can you work with Krenair and investigate? [23:53:20] select records.data, records.managed_resource_id from records join recordsets on records.recordset_id = recordsets.id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name = '200.20.68.10.in-addr.arpa.' and recordsets.type = 'PTR'; [23:54:35] and [23:54:44] https://www.irccloud.com/pastebin/pyVYS5aD/ [23:56:54] and just to be very sure [23:58:44] select deleted_at, uuid from nova.instances where display_name = 'ci-trusty-wikimedia-164229'; [23:59:20] hmmm