[00:01:01] (03PS1) 10Dzahn: add shell account for phuedx and add to mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112150 [00:05:21] (03PS2) 10Dzahn: add shell account for phuedx and add to mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112150 [00:06:54] (03CR) 10Dzahn: [C: 031] "actually we already have another ssmith, so it's good this is a nickname and can match labs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112150 (owner: 10Dzahn) [00:24:21] (03PS1) 10Jforrester: Enable wgTemplateDataUseGUI on MediaWiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112156 [00:40:26] mutante: Are you Daniel Zahn? [00:40:36] yes [00:40:55] ok... aude just replied what I wanted to ask you about in the rt ticket [00:40:58] thanks, aude :D [00:41:49] feel free to query if you have questions about the key etc [00:42:30] mutante: As aude said... can you just create the patch, leave the ssh key empty and I then amend [00:43:14] hoo: eh, yea, we can try that, fair [01:06:09] (03PS1) 10Dzahn: add shell account for hoo, admins restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 [01:07:12] mutante: mh... deploy access would be nice [01:08:57] (03CR) 10Dzahn: "hoo will amend with the SSH key, please don't use the labs key for prod though. also might be admins::mortals for deployment part and sho" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:09:45] hoo: i thought that part can go right into code review then [01:10:06] mutante: Ok [01:10:12] you wanted to amend anyways, so suggest what you want to request:) [01:10:22] then of course it needs review from others anwyways [01:10:30] Ok, so I can just add mortals? [01:10:34] * myself to [01:10:47] Is this a trick? [01:10:56] :-) [01:11:09] hoo: yes, mortals means software deployer [01:11:13] When was the last volunteer shell account made? [01:11:28] isnt it fun that he also amends to the actual change :p [01:11:29] hashars? [01:11:35] i dunno:) [01:11:49] Krenair: You'd have to look at [[m:root]]. [01:12:03] Can that be trusted [01:12:18] hah, no. [01:12:25] There are actually quite a few volunteer shell users. [01:12:27] Or there were. [01:12:29] doesn't have dates/times [01:12:34] Yes Gloria, but when was the last one made? [01:12:36] history [01:12:38] :P [01:12:53] Krenair: Right, you'd have to look at the current list and audit. [01:14:07] git log admins.pp [01:14:12] shrug [01:14:22] Might predate git ;) [01:14:27] yes [01:17:47] puppet/manifests$ grep -o "class.*inherits" admins.pp | cut -d " " -f2 | sort | xargs [01:21:27] (03PS2) 10Hoo man: add shell account for hoo, admins restricted/ admin mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:23:41] (03CR) 10Hoo man: "Added a fresh ssh key, also added myself to admin::mortals" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:24:25] Gloria for shell >.> [01:26:19] hoo or hooman [01:26:19] class hoo inherits foo :P [01:26:19] i kept thinking it [01:26:19] but you are "hoo" in labs and LDAP, so ... [01:26:19] givenName: hoo [01:26:19] cn: Hoo man [01:26:21] sn: hoo [01:26:21] :p [01:26:35] hahaha [01:27:29] Is there any point being in mortals AND restricted? [01:28:12] i don't think so, no [01:28:34] will amend [01:28:57] What does the restricted group do? [01:29:05] (03PS3) 10Hoo man: add shell account for hoo, admins mortals [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:29:12] give you a shell on bastion hosts [01:29:26] but not deploying stuff to cluster [01:29:38] Ok, so then you have to be allowed to only specific machines? [01:29:42] s/stuff/mw-config [01:30:02] yea, you'd get restricted AND a specific node [01:30:07] for special cases [01:30:20] so you can get to bastion and then jump from there elsewhere [01:30:27] that would be in site.pp though [01:30:44] well, first of all you just need an account to use [01:31:31] so, add account in admins.pp, then use it like include accounts::fooo in a node in site.pp [01:31:53] and/or add to the admin classes in admins.pp depending on what it's all for [01:37:37] (03CR) 10Dzahn: "you should talk to springle about the db part, so the suggestion was that you get admins::restricted to be able to jump to a bastion host " [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:40:33] Shouldn't need anything [01:40:36] just use sql dbname [01:42:17] not being in mortals in unfortunate [01:45:41] hoo, you're not going to be in mortals now? [01:45:58] maybe [01:49:27] (03PS1) 10Dzahn: decom 'harmon' - rm from site/dsh/partman/dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/112171 [01:51:25] (03CR) 10Hoo man: "According to Reedy the DB part doesn't need any further changes. Not being in mortals would be very unfortunate." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112168 (owner: 10Dzahn) [01:53:11] PROBLEM - DPKG on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:21] PROBLEM - swift-object-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:31] PROBLEM - swift-container-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:31] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:31] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:51] PROBLEM - swift-object-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:52] PROBLEM - swift-account-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:01] PROBLEM - Disk space on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:02] PROBLEM - swift-account-reaper on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:02] PROBLEM - RAID on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:02] PROBLEM - puppet disabled on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:11] PROBLEM - swift-container-server on ms-be1001 is CRITICAL: Timeout while attempting connection [01:54:11] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:11] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:11] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:11] PROBLEM - swift-account-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:22] (03CR) 10Dzahn: "no services running per Ariel, has been reclaimed as spare and not doing anything it seems" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112171 (owner: 10Dzahn) [01:55:23] eh, that box is still up [02:13:36] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-08 02:13:35+00:00 [02:13:45] Logged the message, Master [02:27:14] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-08 02:27:14+00:00 [02:27:21] Logged the message, Master [02:27:47] (03CR) 10Byfserag: "We have consensus." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [02:28:06] (03CR) 10Byfserag: [C: 031] Add transwiki import options for zh.wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [02:47:46] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-08 02:47:46+00:00 [02:47:54] Logged the message, Master [03:58:12] (03PS1) 10Andrew Bogott: Replace some nova-network configs for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112186 [03:58:14] (03PS1) 10Andrew Bogott: Add the use_neutron switch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112187 [03:59:09] (03CR) 10jenkins-bot: [V: 04-1] Add the use_neutron switch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112187 (owner: 10Andrew Bogott) [04:07:16] (03PS2) 10Andrew Bogott: Add the use_neutron switch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112187 [04:09:00] (03CR) 10Andrew Bogott: [C: 032] Replace some nova-network configs for eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112186 (owner: 10Andrew Bogott) [04:12:34] (03PS3) 10Andrew Bogott: Add the use_neutron switch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112187 [04:18:09] (03CR) 10Andrew Bogott: [C: 032] Add the use_neutron switch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112187 (owner: 10Andrew Bogott) [04:31:11] PROBLEM - MySQL Idle Transactions on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:11] PROBLEM - MySQL InnoDB on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:21] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 140 statistics [04:33:11] RECOVERY - MySQL InnoDB on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:33:11] RECOVERY - MySQL Processlist on db1021 is OK: OK 1 unauthenticated, 0 locked, 1 copy to table, 11 statistics [04:34:02] RECOVERY - MySQL Idle Transactions on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:36:46] (03CR) 10Zhuyifei1999: [C: 031] Add transwiki import options for zh.wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [04:40:49] (03PS1) 10Andrew Bogott: Add role::nova::network to labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112190 [04:43:14] (03CR) 10Andrew Bogott: [C: 032] Add role::nova::network to labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112190 (owner: 10Andrew Bogott) [04:58:21] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:00:48] ^ my fault, it'll be right back [05:03:02] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [05:21:47] (03PS1) 10Andrew Bogott: Use 10g interface for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112191 [05:22:44] (03CR) 10Andrew Bogott: [C: 032] Use 10g interface for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112191 (owner: 10Andrew Bogott) [05:30:48] (03PS1) 10Andrew Bogott: Set up eth4.1118 on labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112192 [05:33:03] (03CR) 10Andrew Bogott: [C: 032] Set up eth4.1118 on labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112192 (owner: 10Andrew Bogott) [05:34:58] RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [07:26:18] PROBLEM - SSH on ms-be1001 is CRITICAL: Server answer: [07:38:18] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:58:48] PROBLEM - Disk space on virt10 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 44147 MB (3% inode=99%): [09:10:33] (03PS1) 10Andrew Bogott: Added auth_uri for nova. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112196 [09:12:30] (03CR) 10Andrew Bogott: [C: 032] Added auth_uri for nova. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112196 (owner: 10Andrew Bogott) [09:12:32] (03CR) 10TTO: (bug 61014) add he.wiki checkusers additional rights (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111985 (owner: 10Matanya) [09:18:28] PROBLEM - SSH on ms-be1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:50] (03PS1) 10Springle: Appears to need same fix as If0117511497ee1457e63d2710f1c3e29f4f44bc0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112197 [09:29:38] (03CR) 10Springle: [C: 032] Appears to need same fix as If0117511497ee1457e63d2710f1c3e29f4f44bc0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112197 (owner: 10Springle) [09:31:01] !log powercycling ms-be1001, 'soft lockup cpu stuck' on console, no login prompt [09:31:10] Logged the message, Master [09:32:28] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:32] lies, it's already back [09:33:38] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:33:38] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:33:38] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:33:48] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:33:48] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:33:48] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [09:33:49] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:33:49] RECOVERY - RAID on ms-be1001 is OK: OK: optimal, 14 logical, 14 physical [09:33:58] RECOVERY - puppet disabled on ms-be1001 is OK: OK [09:33:59] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:34:08] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:34:08] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:34:08] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:34:08] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [09:34:08] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:34:09] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:34:18] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:34:18] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:42:16] RECOVERY - MySQL disk space on db35 is OK: DISK OK [09:42:17] RECOVERY - MySQL disk space on db1017 is OK: DISK OK [09:42:17] RECOVERY - MySQL disk space on db1038 is OK: DISK OK [09:42:26] RECOVERY - MySQL disk space on es1006 is OK: DISK OK [09:42:26] RECOVERY - MySQL disk space on db1001 is OK: DISK OK [09:42:36] RECOVERY - MySQL disk space on db71 is OK: DISK OK [09:42:46] RECOVERY - MySQL disk space on db74 is OK: DISK OK [09:42:46] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [09:42:46] RECOVERY - MySQL disk space on db1023 is OK: DISK OK [09:42:46] RECOVERY - MySQL disk space on es8 is OK: DISK OK [09:42:56] RECOVERY - MySQL disk space on db1030 is OK: DISK OK [09:42:56] RECOVERY - MySQL disk space on db1037 is OK: DISK OK [09:42:56] RECOVERY - MySQL disk space on db1047 is OK: DISK OK [09:42:56] RECOVERY - MySQL disk space on es1010 is OK: DISK OK [09:42:56] RECOVERY - MySQL disk space on db1006 is OK: DISK OK [09:42:57] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [09:42:57] RECOVERY - MySQL disk space on db1035 is OK: DISK OK [09:59:26] oh my :-D [09:59:43] thanks sprin gle [10:04:31] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [10:04:41] RECOVERY - MySQL disk space on es1005 is OK: DISK OK [10:04:41] RECOVERY - MySQL disk space on es1007 is OK: DISK OK [10:04:41] RECOVERY - MySQL disk space on db1033 is OK: DISK OK [10:04:41] RECOVERY - MySQL disk space on db38 is OK: DISK OK [10:04:41] RECOVERY - MySQL disk space on db67 is OK: DISK OK [10:04:42] RECOVERY - MySQL disk space on db68 is OK: DISK OK [10:04:42] RECOVERY - MySQL disk space on db1040 is OK: DISK OK [10:04:43] RECOVERY - MySQL disk space on db1027 is OK: DISK OK [10:04:43] RECOVERY - MySQL disk space on db1002 is OK: DISK OK [10:04:44] RECOVERY - MySQL disk space on db1043 is OK: DISK OK [10:04:44] RECOVERY - MySQL disk space on db1051 is OK: DISK OK [10:04:45] RECOVERY - MySQL disk space on db1045 is OK: DISK OK [10:04:51] RECOVERY - MySQL disk space on db1028 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db1039 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db1042 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db1019 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db1031 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db1048 is OK: DISK OK [10:04:52] RECOVERY - MySQL disk space on db69 is OK: DISK OK [10:04:53] RECOVERY - MySQL disk space on db63 is OK: DISK OK [10:04:53] RECOVERY - MySQL disk space on es4 is OK: DISK OK [10:04:54] RECOVERY - MySQL disk space on db1005 is OK: DISK OK [10:04:54] RECOVERY - MySQL disk space on db1010 is OK: DISK OK [10:04:55] RECOVERY - MySQL disk space on db1016 is OK: DISK OK [10:04:55] RECOVERY - MySQL disk space on db1024 is OK: DISK OK [10:04:56] RECOVERY - MySQL disk space on db1026 is OK: DISK OK [10:04:56] RECOVERY - MySQL disk space on db1036 is OK: DISK OK [10:04:57] RECOVERY - MySQL disk space on db1046 is OK: DISK OK [10:04:57] RECOVERY - MySQL disk space on db1059 is OK: DISK OK [10:04:58] RECOVERY - MySQL disk space on db1060 is OK: DISK OK [10:04:59] RECOVERY - MySQL disk space on es1003 is OK: DISK OK [10:04:59] RECOVERY - MySQL disk space on es1008 is OK: DISK OK [10:04:59] RECOVERY - MySQL disk space on db1022 is OK: DISK OK [10:05:00] RECOVERY - MySQL disk space on db1011 is OK: DISK OK [10:05:00] RECOVERY - MySQL disk space on db1055 is OK: DISK OK [10:05:01] RECOVERY - MySQL disk space on db1058 is OK: DISK OK [10:05:01] RECOVERY - MySQL disk space on db1003 is OK: DISK OK [10:05:02] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [10:05:02] RECOVERY - MySQL disk space on db1029 is OK: DISK OK [10:05:03] RECOVERY - MySQL disk space on es1009 is OK: DISK OK [10:05:03] RECOVERY - MySQL disk space on db1052 is OK: DISK OK [10:05:04] RECOVERY - MySQL disk space on db1049 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on db1021 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on db48 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on db1018 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on db1015 is OK: DISK OK [10:05:11] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [10:05:21] RECOVERY - MySQL disk space on db73 is OK: DISK OK [10:05:36] more like it [10:11:42] hey [10:11:59] I saw icinga spam, then saw I just saw it's recoveries [10:12:01] nevermind me :) [10:20:44] :) [10:34:04] RECOVERY - MySQL disk space on es7 is OK: DISK OK [10:34:04] RECOVERY - MySQL disk space on db1007 is OK: DISK OK [10:34:34] RECOVERY - MySQL disk space on db72 is OK: DISK OK [10:34:54] RECOVERY - MySQL disk space on db1041 is OK: DISK OK [10:50:24] (03PS1) 10Andrew Bogott: Set network_api_class for nova-network [operations/puppet] - 10https://gerrit.wikimedia.org/r/112199 [10:52:38] (03CR) 10Andrew Bogott: [C: 032] Set network_api_class for nova-network [operations/puppet] - 10https://gerrit.wikimedia.org/r/112199 (owner: 10Andrew Bogott) [10:57:37] heh, you weren't kidding when you were saying you'd work today [11:38:04] hm. maybe I'll join in the fun tomorrow [11:38:16] or maybe I'll ask for my revocation to be merged [11:38:21] I'll flip a coin to choose [11:44:10] I just counted and I've had 24 bugs fixed in salt without writing a single line of code. [11:51:12] Ryan_Lane: Want to help me debug a nova-network problem as a distraction from deployment? [11:51:29] andrewbogott: sure. what's the issue? [11:51:45] Compute node says "Timeout while waiting on RPC response - topic: "network.virt1001" [11:52:02] nova service-list thinks that nova-network is just fine, but nothing ever appears in the network log. [11:52:19] nova-manage service list? [11:52:38] what's the network node say in its logs? [11:52:41] is it picking up a job? [11:52:54] that's the thing, the network node is pretty much silent. [11:52:55] do you have a network defined? nova-manage network list [11:53:10] Well -- when I issue commands to network from the controller it gets them. But it gets nothing from the compute node. [11:53:43] usually I see issues like this when the network isn't defined [11:53:46] So, right now I'm just trying to delete a failed instance. The network log says /nothing/ about it. And then later on compute reports a timeout on an rps call. [11:54:05] Oh, um… [11:54:09] # nova network-list [11:54:11] +--------------------------------------+-------+---------------+ [11:54:12] | ID | Label | Cidr | [11:54:14] +--------------------------------------+-------+---------------+ [11:54:15] | 5003d989-9969-4414-a2f4-720c4a9380ad | vmnet | 10.68.16.0/24 | [11:54:15] +--------------------------------------+-------+---------------+ [11:54:21] vmnet? [11:54:28] that's a weird netowrk name [11:54:31] *network name [11:54:33] that's just what it named it. [11:54:40] is it the same network as defined in nova's config? [11:54:54] that's a reasonable question, looking... [11:55:52] I have no clue why nova requires that to be defined in the database and the config file, but that's the stupidity that exists currently :) [11:56:10] hm, not clear to me that a network /is/ named in the nova config. I don't see it anywhere. [11:56:15] no? [11:56:36] doesn't look like it. But, lemme see if I can kill that network and recreate with the same name as in tampa... [11:56:55] fixed_range isn't defined? [11:57:17] Oh! Sorry, I thought you meant is 'vmnet' specified in the config, which it isn't. [11:57:20] But, yeah, same range. [11:57:23] ah [11:57:25] fixed_range=10.68.16.0/24 [11:57:30] yeah, the name isn't terribly important [11:57:40] what's the hostname for the network node? [11:57:40] ok, that's what I figured since nova just picked one for me :/ [11:57:50] labnet1001 [11:57:55] compute node is virt1001 [11:57:59] controller virt1000 [11:58:03] Just running those three at the moment. [11:58:17] heh [11:58:28] /var/log/nova/nova-network.log [11:58:33] I see a traceback there [11:58:46] 'AdminRequired: User does not have admin privileges' [11:58:48] Oh, about the admin privs? [11:58:55] That's from a while ago, when I was creating 'vmnet' [11:59:02] I think -- isn't that in response to a create? [11:59:27] what's the controller? virt1000? [11:59:32] yep [12:00:09] Requests to e.g. associated or disassociate should be coming in on rabbit, shouldn't they? So I don't know what nova-network's problem is. It can clearly talk to virt1000 via rabbit. [12:00:15] But seemingly not to virt1001. [12:00:34] it doesn't need to talk to virt1001 [12:00:45] it should always talk to it through the queue [12:01:34] exactly! Which is why "m so puzzled... [12:01:46] it can talk to the queue but isn't getting any messages from compute [12:02:28] hm. it seems this version of openstack uses conductor [12:02:38] is conductor running? is it configured to be used? [12:02:46] conductor is running [12:02:51] it's running. I /think/ it's configured. [12:03:13] that just basically forces everything through the queue, right? [12:03:37] Yeah, I think it's to manage db calls. [12:03:39] it makes the compute nodes get data through conductor rather than directly from the database [12:03:43] right [12:04:42] how are you testing instance creation? [12:05:25] well… at the moment I've hit my quota so I'm testing instance deleting instead :) [12:05:34] But, via 'nova boot' on virt1000. [12:05:38] Or, in this case, 'nova delete' [12:06:48] can you give me a command? [12:07:02] you can just use more than one project, if you hit a limit ;) [12:07:22] True [12:07:35] Oh, just try 'nova delete eqiadtest10' [12:07:45] that'll return immediately but a minute later you'll see an error in the compute log. [12:09:35] or if you want to test creation… nova --os-tenant-name openstack boot --flavor 2 --image testimage test2 [12:10:56] hm, the internet blames this on misconfigured keystone [12:11:19] ok? link? [12:11:25] keystone is a massive pain in the ass to configure [12:11:42] well, this isn't conclusive: http://www.gossamer-threads.com/lists/openstack/dev/33878 [12:12:33] It's possible, although keystone is working ok overall. RPC timeout seems to be a pretty generic 'something isn't working' error... [12:12:40] lots of different things turn up online. [12:13:11] If nova-network is failing to authenticate (does it authenticate, even?) I'd think that should appear in the log [12:13:30] well, it may not be a problem of authenticating [12:13:55] I guess I should remove the neutron service from keystone… not that that should have anything to do with this. [12:13:59] But, let me do that now anyway [12:14:58] ok, done [12:15:08] whoah, no -- [12:15:18] hang on, I totally broke keystone :) will have it fixed in a minute [12:17:09] ok, should be back now. [12:17:11] 460f59982a7a484683cdd9b828317f5e | neutron | network | OpenStack Network Service ? [12:17:18] neutron? [12:17:23] yeah, that's what I was cleaning up just now [12:17:26] ah [12:17:27] heh [12:17:27] it's gone now, right? [12:17:33] yep [12:17:41] that may have been the issue [12:17:53] possible, although nothing was pointing to it... [12:17:57] compute probably defaults to neutron if its configured [12:18:00] let's see [12:19:30] nova network's logs are way less chatty than they used to be [12:19:39] I'm not sure that's a great thing :) [12:19:45] keystone's logs are abysmal [12:19:59] I was assuming that nova-network just wasn't doing a single thing. [12:20:01] Hence, quiet logs. [12:20:11] There's no indication it's getting any requests at all, is there? [12:20:24] yeah. none [12:20:30] Hm, probably there's a way to monitor queue requests. [12:21:08] there are [12:21:28] rabbitmqctl list_queues [12:22:04] of course when things timeout... :) [12:22:25] I'd expect to see a request from compute, at least... [12:22:53] Well -- is there any scenario where compute can talk to controller but not to network? The queue is all the same isn't it? [12:23:55] well, no. compute should put a message into network's queue [12:24:23] right, so can we get a running log of the 'network' channel to see when that happens? [12:25:27] not sure [12:27:00] Trying to sort out if this timeout is actually an rpc problem or if it's just the result of nova-network failing at the task. [12:27:10] Although in the latter case I would /really/ hope to see something in the log [12:27:43] it's almost definitely not a timeout [12:28:13] it's most likely that the message isn't going into the right queue, or that the service isn't picking it up [12:28:36] create an instance and run the rabbit command to see which queue gets the message [12:29:24] Are the numbers in list_connections cumulative? So it's just a question of seeing what increments? [12:31:22] (restarting rabbit to simplify the view...) [12:32:12] no, it's based on what's sent in and whats timed out [12:32:35] maybe it's possible to get cumulative numbers. my rabbit knowledge is limited [12:33:52] So, just confirming -- when you said "see which queue gets the message," do you know how to do that? Or shall I google? [12:34:43] hm, there is network.labnet1001 and network [12:35:01] as though someone things that we're running a network service on each node... [12:35:05] lemme see if that's the same in tampa [12:35:37] well, run rabbitmqctl list_queues [12:35:44] after you create an instance [12:35:57] run it before too, to see which value changed, of course :) [12:36:14] it takes a bit for a message to time out [12:36:47] no diff :( [12:37:17] (btw, there are the same three network queues in tampa and eqiad) [12:37:34] (03PS1) 10Tim Landscheidt: Tools: Install package supybot [operations/puppet] - 10https://gerrit.wikimedia.org/r/112202 [12:37:55] create one rather than deleting one [12:38:04] ah, you did both [12:39:18] well I really need to sleep [12:39:27] sorry I couldn't help you more [12:39:56] you helped me eliminate things! [12:39:58] thanks, sleep well. [12:40:25] yw. ttyl [13:14:02] (03PS1) 10Andrew Bogott: Include the nova-api-metadata service in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112203 [13:14:05] (03PS1) 10Andrew Bogott: Add rabbit login/password in case that helps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112204 [13:17:37] (03PS2) 10Andrew Bogott: Include the nova-api-metadata service in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112203 [13:17:39] (03PS2) 10Andrew Bogott: Add rabbit login/password in case that helps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112204 [13:19:05] (03CR) 10Andrew Bogott: [C: 032] Include the nova-api-metadata service in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112203 (owner: 10Andrew Bogott) [13:19:15] (03CR) 10Andrew Bogott: [C: 032] Add rabbit login/password in case that helps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112204 (owner: 10Andrew Bogott) [13:26:41] (03PS1) 10Andrew Bogott: Revert "Include the nova-api-metadata service in havana." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112205 [13:27:56] (03CR) 10Andrew Bogott: [C: 032] Revert "Include the nova-api-metadata service in havana." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112205 (owner: 10Andrew Bogott) [15:35:09] PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:05] (03PS1) 10Andrew Bogott: Set multi_host to False. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112237 [15:39:22] (03CR) 10Andrew Bogott: [C: 032] Set multi_host to False. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112237 (owner: 10Andrew Bogott) [16:30:27] (03PS1) 10Tim Landscheidt: Fix typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112239 [16:31:42] (03CR) 10Hoo man: [C: 031] Fix typo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112239 (owner: 10Tim Landscheidt) [18:25:09] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 416.666656 [18:25:19] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 35.566666 [18:28:09] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:28:39] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 148.800003 [18:29:39] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:32:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 192.833328 [18:33:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:34:19] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:36:09] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 271.200012 [18:37:19] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 324.666656 [18:37:39] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 376.333344 [18:37:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 472.633331 [18:49:09] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:52:09] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 61.366665 [19:00:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:01:09] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:03:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 39.933334 [19:04:09] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 8.233334 [19:09:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:10:09] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:12:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 34.366665 [19:13:09] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.766667 [19:14:09] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:19:39] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:27:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:30:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 50.166668 [19:38:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:39:19] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:41:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 152.233337 [19:47:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:41:18] !log aaron synchronized php-1.23wmf13/extensions/Math '4844f52139593f4a324bf99b74d7abb91aac2e54' [20:41:26] Logged the message, Master [21:00:41] !log aaron synchronized php-1.23wmf12/extensions/Math 'd96b29ca8b17f35e7068f0d3a16b5e2644e084f9' [21:00:49] Logged the message, Master [21:04:33] math seems less broken now [21:05:59] Yep, looks much better. [21:09:48] AaronSchulz, did you see this SAL entry from earlier? 00:57 Nemo_bis: Job queue rather long (400k on en.wiki), OTRS reports of password resets not being delivered (bug 43936?), almost no jobs run today according to gdash [21:10:48] thanks AaronSchulz [21:12:00] Krenair: a looked at gdash and saw that jobs were running...not sure what that entry was about then [21:12:32] queues look normal to me [21:12:43] parsoid is a bit high, but that's it [21:12:47] at least on enwiki [21:13:42] They look normal now, do you mean they looked normal then too? [21:13:51] right [21:14:23] Is it possible for you to find out what happened to mail sent to specific addresses? [21:22:01] AaronSchulz? [21:23:05] sent through what? EmailUser, password reset, etc? [21:23:55] password reset [21:27:23] hmm, you'd have to ask someone from ops [21:28:09] if there is a some sort of bad address response or exim4 log I doubt I'd be able to see it [21:52:30] Krenair: you should ask legoktm if he had an answer, he was the one complaining :) [21:57:36] Nemo_bis, had an answer? [22:03:52] Krenair: from the OTRS complainers on what happened exactly :) [22:04:31] I can deal with the people on OTRS, I just don't have much to ask them except "have you received anything since you mailed us this?" [22:19:08] AaronSchulz: Thank you very much. [23:52:39] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [23:53:49] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms