[00:15:57] !log added wikimedia-task-dns-auth_0.18 to the repo, to add support for zero [00:16:00] Logged the message, Master [00:16:16] !log installing newer wikimedia-task-dns-auth on all dns servers [00:16:18] Logged the message, Master [00:17:56] !log adding zero cnames [00:17:59] Logged the message, Master [00:34:22] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [00:43:30] New patchset: Ryan Lane; "Add sudo rights needed to manage gluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2969 [00:43:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2969 [00:44:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2969 [00:44:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2969 [00:49:47] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [00:58:47] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [01:07:47] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:07:47] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:32:41] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [01:37:20] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [01:54:44] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [01:54:53] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [01:55:36] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:55:38] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [01:58:36] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:58:38] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [01:59:08] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:59:10] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [02:00:36] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2970 [02:00:38] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [02:38:11] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [02:43:53] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Thu Mar 8 02:43:39 UTC 2012 [02:48:45] New patchset: Ryan Lane; "Ensure a run directory exists for the glustermanager" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2971 [02:48:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2971 [02:49:11] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2971 [02:49:13] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2971 [03:01:44] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Thu Mar 8 03:01:31 UTC 2012 [03:04:32] !log powercycled ms-be5; it has been unresponsive for 2 hours. [03:04:38] Logged the message, Master [03:07:39] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [03:20:20] super lame. [03:59:19] maplebed: https://graphite.wikimedia.org/dashboard/temporary-4 <- cautious optimism? [04:01:00] more likely just the regular cycle. [04:01:06] wait a week then call it. [04:01:08] :P [04:01:20] yeah, we'll see [04:52:08] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [05:32:47] New patchset: Dzahn; "nagios - move snmp stuff into it's own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2972 [05:32:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2972 [05:54:43] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2972 [05:54:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2972 [06:55:39] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [06:55:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:09:39] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:09:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:10:47] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:10:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:11:51] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2973 [07:11:54] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:16:57] New patchset: Dzahn; "sort iptables ports and protocols alphabetically" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2974 [07:17:05] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%): /var/lib/ureadahead/debugfs 0 MB (0% inode=95%): [07:17:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2974 [07:20:26] !log ms1004 ran out of disk - caused by 17G HTCPurger.log.1, trying to gzip it now [07:20:30] Logged the message, Master [07:22:56] RECOVERY - Disk space on ms1004 is OK: DISK OK [07:35:59] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [07:41:53] !log puppet on cadmium broken due to dependency Group[500] for User[catrope] [07:41:56] Logged the message, Master [07:52:06] New patchset: Dzahn; "add groups::wikidev to cadmium, puppet broke due to dependency Group[500] for User[catrope] without it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2975 [07:52:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2975 [07:52:56] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2974 [07:52:59] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2974 [07:53:32] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2975 [07:53:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2975 [07:54:50] RECOVERY - Puppet freshness on cadmium is OK: puppet ran at Thu Mar 8 07:54:46 UTC 2012 [07:54:52] !log cadmium fixed by adding groups::wikidev [07:54:55] Logged the message, Master [08:14:11] New review: Dzahn; "looking at that docroot as it is now, there are some files and dirs owned by groups "svn" and "svnad..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2888 [08:16:53] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:25] hi, is it a human rebooting cp1019 right now? or all by itself :P [08:21:49] !log cp1019 went down, then rebooted by itself (i think) after showing "idrac-8W82BP1 Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted" [08:21:53] Logged the message, Master [08:22:49] !log cp1019 - Hitting F1 to continue reboot ( "Alert! System fatal error during previous boot") [08:22:52] Logged the message, Master [08:23:02] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [08:27:14] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [08:28:08] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [08:36:59] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [08:38:02] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.108 seconds [08:42:59] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [09:11:02] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [09:13:52] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [09:17:10] RECOVERY - Puppet freshness on mw1010 is OK: puppet ran at Thu Mar 8 09:16:56 UTC 2012 [09:19:52] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [09:19:52] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [09:28:54] New patchset: Hashar; "Bug 28469 - Make SVN Documentation be indexed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [09:29:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2888 [09:31:21] New review: Hashar; "Good point! I have changed the recursive declaration so it put files in the svnadm group which most..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [09:31:49] New review: Hashar; "And I also rebased the patch set :-)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2888 [09:48:41] hashar: hmm..if svnadm needs to access subdirectories, it would need 5 (r) or 7 (rw) [09:49:04] current perimissions are sure a bit chaotic though :) [09:49:42] hm [09:49:43] and then there is "svn" vs. "svnadm".. hrmm.. guess we really need to find out what it really needs for each of those ;/+ [09:50:10] and there is another problem siwht svn.pp btw [09:50:57] the "viewvc" part in puppet is not being applied it seems [09:50:59] I am confused patch set 4 on change 2888 makes it : 0664 -> rw-rw-r-- [09:51:59] it would also change the permissions for the subdirectories when recursing, right? not just files [09:52:02] I think puppet add the +x automatically on directories [09:52:10] and you need +x (1) to be allowed to list dirs [09:52:18] oh.. [09:52:27] * hashar looks at puppet doc [09:52:51] in that case, you are right of coursr [09:53:54] http://docs.puppetlabs.com/references/stable/type.html#file [09:54:08] with mode 644 and recurse: In this case all of the files underneath /some/dir will have mode 644, and all of the directories will have mode 755. [09:54:22] "Puppet always sets the search/traverse (1) bit anywhere the read (4) bit is set. " [09:54:40] so you just have to specify the file mode [09:54:48] alright, cool :) [09:55:04] then just .. what about the user "svn" [09:55:22] root is fine [09:55:36] I think :) [09:56:05] we will probably have to migrate every file to puppet [09:56:14] then they will just be root:root :) [09:58:37] migrate every file to puppet? wasnt it the point of using "recurse" not having to do that? [09:59:09] arr, looking at that docroot again [09:59:23] originally, I just wanted to have robot.txt deployed that puppet [09:59:37] mark then asked to make a recursive declaration for later use [09:59:46] something that have been made later in my opinon [09:59:48] opinion [10:00:06] grr [10:00:09] I keep eating my words [10:00:19] something that might have been made later in my opinion [10:17:10] New review: Dzahn; "alright, convinced after we talked more about this. e.g. "everything is in svnadm group already besi..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2888 [10:17:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [10:26:27] PROBLEM - Lucene on mw1010 is CRITICAL: Connection refused [10:42:12] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:45] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:59:45] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [11:08:45] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:08:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:35:29] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:17] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:13:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.914 seconds [12:50:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:56:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.687 seconds [13:30:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.888 seconds [14:01:21] New patchset: Mark Bergsma; "Fix up squid/varnish partitioning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2976 [14:01:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2976 [14:02:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2976 [14:02:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2976 [14:11:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.222 seconds [14:47:19] !log installed python faulthandler 2.1 [14:47:20] !log installed python faulthandler 2.1 [14:47:22] Logged the message, Master [14:47:24] Logged the message, Master [14:48:12] !log installed python faulthandler 2.1 [14:48:15] Logged the message, Master [14:51:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:26] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [14:55:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.994 seconds [15:14:31] New patchset: Mark Bergsma; "Make a specific partman file for varnish with xfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2977 [15:14:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2977 [15:17:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2977 [15:17:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2977 [15:31:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.191 seconds [15:52:43] !log mw1103 finally repaired and ready for os and such [15:52:46] Logged the message, RobH [16:11:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:29] !log search1008 repaired [16:16:31] notpeter: ^ [16:16:31] Logged the message, RobH [16:16:38] its all yours my firend [16:16:39] friend even [16:17:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.128 seconds [16:28:51] New patchset: Mark Bergsma; "Initial puppetization of varnish upload cluster in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2978 [16:29:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2978 [16:31:08] RobH: awesome [16:31:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2978 [16:31:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2978 [16:35:45] New patchset: Mark Bergsma; "Add text/upload cache groups to ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2979 [16:35:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2979 [16:36:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2979 [16:36:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2979 [16:36:40] finishing repairs on the ohter search now. [16:43:02] New patchset: Mark Bergsma; "Add empty upload varnish VCL templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2980 [16:43:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2980 [16:44:03] ms-be5 crashed again. [16:44:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2980 [16:44:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2980 [16:45:19] PROBLEM - Host db1033 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:59] RECOVERY - Disk space on search10 is OK: DISK OK [16:47:34] !log took ms-be5 out of rotation in the swift cluster - it's crashed 3 times now. [16:47:37] Logged the message, Master [16:47:53] maplebed: we may wanna have chris run dell hw testing on that. [16:48:00] yes please. [16:48:01] he has the dell diag cds. [16:48:07] you wanna drop a ticket or should I? [16:48:17] I created one last night. [16:48:17] this is perfect swift testing ;) [16:48:30] RT-2595 [16:48:47] would you update it with what needs to be there for him to do the right thing? [16:49:02] mark: did you look at the graph I sent out with the effect of this host crashing? [16:49:09] yes [16:50:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:00] oh, hey! the new graphs I put in yesterday for the back end storage nodes shows what happens when I adjust the rings and push them out: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift+backend+storage [16:56:18] I was just looking at them [16:56:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.031 seconds [16:57:34] !log search1014 still down per rt2483 [16:57:36] Logged the message, RobH [17:00:17] New patchset: Mark Bergsma; "Add fstype" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2981 [17:00:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2981 [17:00:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2981 [17:00:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2981 [17:09:47] New patchset: Mark Bergsma; "Fix storage file name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2982 [17:09:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2982 [17:10:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2982 [17:10:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2982 [17:12:54] New patchset: Mark Bergsma; "Dashes are ILLEGAL!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2983 [17:13:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2983 [17:13:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2983 [17:13:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2983 [17:14:17] !log shutting down db18 for memory testing [17:14:20] Logged the message, Master [17:17:57] mark: there is a port in the dmarc with nothing in it, but the port/keystone jack is installed on the panel [17:18:06] not sure if thats going to be for the exchange link or not [17:20:32] i will update ticket with the cable number, which is 2048 if you wanna label the port on the cr2 [17:20:53] ack sorry [17:20:55] 2648 [17:22:25] robh: ever have LOM commands not work on the x4240? reset /SYS? [17:22:35] locked up? [17:26:53] i have yea [17:27:01] try reset /SP [17:27:06] which is the service processor / lom [17:27:14] if that doesnt fix it, the only other fix is complete power removal [17:27:17] cmjohnson1: ^ [17:27:33] ctrl alt delete cycled it [17:27:45] that cycles the server, but the lom is still broken yes? [17:27:51] how will a remote person do that? [17:27:59] long arms ? [17:28:08] yea. please fix the lom before putting it as ok. [17:28:20] RobH: you in the DC today ? [17:28:22] yep...gonna run the test first and will check on it [17:28:23] yes [17:28:23] yes [17:28:35] LeslieCarr: can you work with apergos to drain traffic on the HE so i can clean the fiber [17:28:36] sorry though u were talking to me [17:28:42] i am ready to do that whenever you want [17:28:51] LeslieCarr: ^ [17:29:01] RobH: yay [17:29:03] hello [17:29:04] PROBLEM - Host db18 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:32] by "work with" what he means is (yes, I'm about to say what someone elsemeans :-P but I told him originally) [17:29:54] LeslieCarr: so yea, in the datacenter today i have to finish this fiber run for mark, do this fiber fix for you two, and recieve in some shipments, then i am gonna go home and be sick some more =P [17:29:55] can you explain what the goal is, then how you do it, and then I'll do it and you can check it? [17:30:07] apergos: so for draining the fiber, we're going to do the quick and dirty way. :) and yes [17:30:12] ok [17:30:14] RobH: you are dedicated [17:30:15] thank you [17:30:22] yeah. he is. [17:30:32] wqell not the quickest and dirtiest , which would be just pulling the fiber [17:30:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:45] so, that's my first q [17:30:48] we're going to deactivate the bgp session [17:30:56] since there was only like 35 mbits on there last I looked [17:31:08] ok, so [17:31:25] when people say they aren't going to advertise some route (and it's bgp) is that what they mean [17:31:29] the nicer way is to block port 179 because it causes bgp to time out instead ... [17:31:31] or is that something else? [17:31:49] well so for bgp we sort of exchange the routes that we want the other person to talk to us on [17:31:50] ok so my first q was, with so little traffic do we care? [17:31:59] so we exchange our routes, and they give us all of theirs [17:32:08] ok [17:32:18] so they just probably put a filter on to not readvertise our routes to their bgp peers [17:32:35] hmm ok [17:33:30] so we are just going to deactive the bgp peer session [17:34:04] that is easier than blocking the port? [17:34:13] deactivate protocols bgp group Private-Peer neighbor 216.66.30.89 [17:34:55] yes just because we don't have a filter group for port blocking - we could also just create a firewall filter group and call it something like "no port 179" and apply it on the incoming side of the interface [17:35:26] ok I see [17:35:29] this is indeed quicker [17:36:46] ok, I saw this stanza earlier and at least could recognize it was theirs [17:37:15] why is it called "private peer"? [17:37:45] LeslieCarr: so now i should be wiping down all fibers ends when i plug in right? (its what i just did for the one i just ran for mark_) [17:38:03] because we have a direct connection [17:38:16] so that's commonly called private peer - public peer would be a fiber to the peering exchange [17:38:20] RobH: yes please [17:38:34] you guys gonna be a few minutes still? if so i am gonna go receive in the foundry stuff, which will take me about 10 minutes [17:38:35] that might not be what the problem was, but I have when playing with scopes seen brand new fibers be all dirty [17:38:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.050 seconds [17:38:41] I'm going to be 10 seconds. [17:38:46] oh, then i wait =] [17:39:07] apergos: i love the deactivate command :) [17:39:13] it's nice [17:39:18] saves a lot of yping [17:39:20] typing too :-P [17:39:30] it claims to be committed [17:39:44] so you can monitor that port and see the traffic flow on it now [17:40:12] Output bytes : 514087318112588 288 bps [17:40:15] yeah I was just there [17:40:32] the HE fiber is the only unlabeled cable in the cage. i will label if it fixes it when cleaned [17:40:37] that's at the top level, at .500 it's the same [17:40:43] it was in before my labeler arrived [17:41:27] so that looks fine to me [17:41:57] lemme know when i can pull to wipe [17:42:11] I'm not actually monitoring (which I'm not sure how to do), just show interfaces blah extensive I guess i could | match bytes or something [17:42:19] anyways whatever, is he good to go then? [17:42:55] LeslieCarr: ^? [17:42:59] good to go! [17:43:07] sweet [17:43:21] ok, pulling it at dmarc to confirm its the same unlabled cable, it should be but i wanna confirm [17:43:30] RobH: do the C2100s have hot-swappable drives? [17:43:33] maplebed: did you get paged at night? :) [17:43:43] so this will be both the cable to psw-2 and the psw2-> them yeah? [17:43:47] Aaron|away: no. the ms-be hosts are not currently set to page. [17:44:31] i am cleaning the long run dmarc to peering router first indeed [17:45:15] ok [17:45:58] ok thats done, i can pull the peering to cr1 right LeslieCarr ? [17:46:04] yes you can [17:46:09] cool, doing that now [17:46:11] that's the one we just deactivated [17:46:40] I guess I delete deactivate blah blah to put it back? [17:46:54] * apergos goes to look at the stanza again [17:47:31] activate blah blah [17:47:33] ok [17:47:34] or "rollback 1" [17:47:35] :) [17:47:35] psw2 to cr2 cleaned [17:47:39] heh rollback! [17:47:41] psw2 to cr1 is a sfp copper cable [17:47:48] but no I'm going to do activate [17:47:54] so cannot clen that ;] [17:47:58] hmmmmmm [17:48:01] ah okay RobH - hrm, i have some curiousity if those could have some issues [17:48:02] you guys should be good to go and test now [17:48:06] with the conversion and converting back [17:48:07] yeeaahh [17:48:20] well can't do anything about it today [17:48:27] i can repalce it. [17:48:31] with a fiber if needed. [17:48:36] is that doable? [17:48:39] but i rather we test and see if this fixed first ;] [17:48:46] yeah sure [17:48:55] i prefer to use the copper attached sfps in rack if only cuz they are a LOT sturdier [17:48:56] besides you're sick [17:48:56] sadly i am about to run in 5 minutes :( so maybe tomorrow ? (RobH do you think you'll be well tomorrow?) [17:49:09] I will totally fedex you chicken soup [17:49:10] if i dont come back down here tomorrow i will be here on monday [17:49:16] if thats ok? [17:49:27] ie: if you think its the copper, i can replace it now rather than come back tomorrow ;] [17:49:46] but those were pricy, rather use em if we can ;] [17:50:14] that's ok [17:50:32] let's do it one step at a time [17:50:38] it might be completely unrelated [17:50:42] yeah [17:50:43] sadly [17:50:58] ok, i am going to clean up the cage and receive in the juniper stuff [17:51:01] all right, I'm going to bring it back up [17:51:03] thanks RobH [17:51:08] will be afk a bit but will be back in channel before i leave eqiad [17:51:20] cool :) [17:52:42] hmmmm [17:53:06] hmmm ? [17:53:51] Output bytes : 514089443911904 287594784 bps [17:54:35] 287mbit looks ok to me … lemme check observium for historical data [17:54:45] yeah it does [17:54:48] i want to say 280-320 is normal [17:54:48] it look s alot better [17:54:57] like about 10 times better [17:55:19] rerunning the crazy mtr [17:55:25] from ds1001 [17:55:29] ooo no ploss yet [17:55:29] ah right [17:55:29] :) [17:55:38] that is a good sign.... [17:55:45] well we don't have a big download going at the same time [17:55:49] where is our he net guy? [17:55:51] true true [17:55:54] * apergos goes to aim [17:56:23] shit, i gotta run, will be back in about an hour - but so far…. mtr -i 0.03 216.66.22.2 is way happy [17:56:26] not there, I'll start an iperf server on there and invite him into the channel [17:56:31] awesome [17:56:36] I think we can do the basic tests [17:56:40] i'm on the chan so i'll look at scrollback :) [17:56:44] thanks for evrything [17:57:05] thank you ! it's awesome to have someone else interested in the tubes :) [17:57:09] bbiab [17:59:46] well learning enough networking to be useful has been on my goals for 2 years now [17:59:56] or 1.5 I guess [17:59:59] anyways, about time [18:10:24] LeslieCarrafk: why do you think putting a filter to time out port 179 would be nicer? [18:10:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:46] RobH: so no improvement, sadly [18:16:12] huh, want me to replace the copper with fiber? [18:16:14] so either friday or monday when you're there next we'll try other steps [18:16:16] no. [18:16:21] I want you to go home [18:16:32] nah i can do it today, still here [18:16:37] seriously [18:16:42] i have a minimum of 30 minutes of work still ;] [18:16:43] because the next round will be [18:16:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.002 seconds [18:16:48] work til we beat it [18:16:58] well, if we replace the fiber in the rack [18:17:13]