[00:15:57] !log added wikimedia-task-dns-auth_0.18 to the repo, to add support for zero [00:16:00] Logged the message, Master [00:16:16] !log installing newer wikimedia-task-dns-auth on all dns servers [00:16:18] Logged the message, Master [00:17:56] !log adding zero cnames [00:17:59] Logged the message, Master [00:34:22] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [00:43:30] New patchset: Ryan Lane; "Add sudo rights needed to manage gluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2969 [00:43:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2969 [00:44:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2969 [00:44:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2969 [00:49:47] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [00:58:47] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [01:07:47] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:07:47] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:32:41] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [01:37:20] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [01:54:44] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [01:54:53] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [01:55:36] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:55:38] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [01:58:36] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:58:38] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [01:59:08] New patchset: Bhartshorne; "adding the ability to read optionsfrom a config file and modify running behavior so we can throttle up and down the cleaner as its running." [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [01:59:10] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/2970 [02:00:36] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2970 [02:00:38] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/2970 [02:38:11] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [02:43:53] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Thu Mar 8 02:43:39 UTC 2012 [02:48:45] New patchset: Ryan Lane; "Ensure a run directory exists for the glustermanager" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2971 [02:48:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2971 [02:49:11] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2971 [02:49:13] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2971 [03:01:44] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Thu Mar 8 03:01:31 UTC 2012 [03:04:32] !log powercycled ms-be5; it has been unresponsive for 2 hours. [03:04:38] Logged the message, Master [03:07:39] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [03:20:20] super lame. [03:59:19] maplebed: https://graphite.wikimedia.org/dashboard/temporary-4 <- cautious optimism? [04:01:00] more likely just the regular cycle. [04:01:06] wait a week then call it. [04:01:08] :P [04:01:20] yeah, we'll see [04:52:08] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [05:32:47] New patchset: Dzahn; "nagios - move snmp stuff into it's own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2972 [05:32:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2972 [05:54:43] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2972 [05:54:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2972 [06:55:39] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [06:55:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:09:39] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:09:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:10:47] New patchset: Dzahn; "add iptables rules - let only production network send snmp-traps to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:10:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2973 [07:11:51] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2973 [07:11:54] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2973 [07:16:57] New patchset: Dzahn; "sort iptables ports and protocols alphabetically" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2974 [07:17:05] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%): /var/lib/ureadahead/debugfs 0 MB (0% inode=95%): [07:17:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2974 [07:20:26] !log ms1004 ran out of disk - caused by 17G HTCPurger.log.1, trying to gzip it now [07:20:30] Logged the message, Master [07:22:56] RECOVERY - Disk space on ms1004 is OK: DISK OK [07:35:59] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [07:41:53] !log puppet on cadmium broken due to dependency Group[500] for User[catrope] [07:41:56] Logged the message, Master [07:52:06] New patchset: Dzahn; "add groups::wikidev to cadmium, puppet broke due to dependency Group[500] for User[catrope] without it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2975 [07:52:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2975 [07:52:56] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2974 [07:52:59] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2974 [07:53:32] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2975 [07:53:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2975 [07:54:50] RECOVERY - Puppet freshness on cadmium is OK: puppet ran at Thu Mar 8 07:54:46 UTC 2012 [07:54:52] !log cadmium fixed by adding groups::wikidev [07:54:55] Logged the message, Master [08:14:11] New review: Dzahn; "looking at that docroot as it is now, there are some files and dirs owned by groups "svn" and "svnad..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2888 [08:16:53] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:25] hi, is it a human rebooting cp1019 right now? or all by itself :P [08:21:49] !log cp1019 went down, then rebooted by itself (i think) after showing "idrac-8W82BP1 Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted" [08:21:53] Logged the message, Master [08:22:49] !log cp1019 - Hitting F1 to continue reboot ( "Alert! System fatal error during previous boot") [08:22:52] Logged the message, Master [08:23:02] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [08:27:14] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [08:28:08] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [08:36:59] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [08:38:02] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.108 seconds [08:42:59] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [09:11:02] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [09:13:52] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [09:17:10] RECOVERY - Puppet freshness on mw1010 is OK: puppet ran at Thu Mar 8 09:16:56 UTC 2012 [09:19:52] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [09:19:52] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [09:28:54] New patchset: Hashar; "Bug 28469 - Make SVN Documentation be indexed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [09:29:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2888 [09:31:21] New review: Hashar; "Good point! I have changed the recursive declaration so it put files in the svnadm group which most..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [09:31:49] New review: Hashar; "And I also rebased the patch set :-)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2888 [09:48:41] hashar: hmm..if svnadm needs to access subdirectories, it would need 5 (r) or 7 (rw) [09:49:04] current perimissions are sure a bit chaotic though :) [09:49:42] hm [09:49:43] and then there is "svn" vs. "svnadm".. hrmm.. guess we really need to find out what it really needs for each of those ;/+ [09:50:10] and there is another problem siwht svn.pp btw [09:50:57] the "viewvc" part in puppet is not being applied it seems [09:50:59] I am confused patch set 4 on change 2888 makes it : 0664 -> rw-rw-r-- [09:51:59] it would also change the permissions for the subdirectories when recursing, right? not just files [09:52:02] I think puppet add the +x automatically on directories [09:52:10] and you need +x (1) to be allowed to list dirs [09:52:18] oh.. [09:52:27] * hashar looks at puppet doc [09:52:51] in that case, you are right of coursr [09:53:54] http://docs.puppetlabs.com/references/stable/type.html#file [09:54:08] with mode 644 and recurse: In this case all of the files underneath /some/dir will have mode 644, and all of the directories will have mode 755. [09:54:22] "Puppet always sets the search/traverse (1) bit anywhere the read (4) bit is set. " [09:54:40] so you just have to specify the file mode [09:54:48] alright, cool :) [09:55:04] then just .. what about the user "svn" [09:55:22] root is fine [09:55:36] I think :) [09:56:05] we will probably have to migrate every file to puppet [09:56:14] then they will just be root:root :) [09:58:37] migrate every file to puppet? wasnt it the point of using "recurse" not having to do that? [09:59:09] arr, looking at that docroot again [09:59:23] originally, I just wanted to have robot.txt deployed that puppet [09:59:37] mark then asked to make a recursive declaration for later use [09:59:46] something that have been made later in my opinon [09:59:48] opinion [10:00:06] grr [10:00:09] I keep eating my words [10:00:19] something that might have been made later in my opinion [10:17:10] New review: Dzahn; "alright, convinced after we talked more about this. e.g. "everything is in svnadm group already besi..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2888 [10:17:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [10:26:27] PROBLEM - Lucene on mw1010 is CRITICAL: Connection refused [10:42:12] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:45] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:59:45] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [11:08:45] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:08:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:35:29] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:17] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:13:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.914 seconds [12:50:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:56:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.687 seconds [13:30:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.888 seconds [14:01:21] New patchset: Mark Bergsma; "Fix up squid/varnish partitioning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2976 [14:01:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2976 [14:02:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2976 [14:02:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2976 [14:11:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.222 seconds [14:47:19] !log installed python faulthandler 2.1 [14:47:20] !log installed python faulthandler 2.1 [14:47:22] Logged the message, Master [14:47:24] Logged the message, Master [14:48:12] !log installed python faulthandler 2.1 [14:48:15] Logged the message, Master [14:51:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:26] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [14:55:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.994 seconds [15:14:31] New patchset: Mark Bergsma; "Make a specific partman file for varnish with xfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2977 [15:14:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2977 [15:17:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2977 [15:17:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2977 [15:31:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.191 seconds [15:52:43] !log mw1103 finally repaired and ready for os and such [15:52:46] Logged the message, RobH [16:11:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:29] !log search1008 repaired [16:16:31] notpeter: ^ [16:16:31] Logged the message, RobH [16:16:38] its all yours my firend [16:16:39] friend even [16:17:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.128 seconds [16:28:51] New patchset: Mark Bergsma; "Initial puppetization of varnish upload cluster in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2978 [16:29:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2978 [16:31:08] RobH: awesome [16:31:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2978 [16:31:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2978 [16:35:45] New patchset: Mark Bergsma; "Add text/upload cache groups to ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2979 [16:35:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2979 [16:36:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2979 [16:36:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2979 [16:36:40] finishing repairs on the ohter search now. [16:43:02] New patchset: Mark Bergsma; "Add empty upload varnish VCL templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2980 [16:43:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2980 [16:44:03] ms-be5 crashed again. [16:44:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2980 [16:44:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2980 [16:45:19] PROBLEM - Host db1033 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:59] RECOVERY - Disk space on search10 is OK: DISK OK [16:47:34] !log took ms-be5 out of rotation in the swift cluster - it's crashed 3 times now. [16:47:37] Logged the message, Master [16:47:53] maplebed: we may wanna have chris run dell hw testing on that. [16:48:00] yes please. [16:48:01] he has the dell diag cds. [16:48:07] you wanna drop a ticket or should I? [16:48:17] I created one last night. [16:48:17] this is perfect swift testing ;) [16:48:30] RT-2595 [16:48:47] would you update it with what needs to be there for him to do the right thing? [16:49:02] mark: did you look at the graph I sent out with the effect of this host crashing? [16:49:09] yes [16:50:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:00] oh, hey! the new graphs I put in yesterday for the back end storage nodes shows what happens when I adjust the rings and push them out: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift+backend+storage [16:56:18] I was just looking at them [16:56:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.031 seconds [16:57:34] !log search1014 still down per rt2483 [16:57:36] Logged the message, RobH [17:00:17] New patchset: Mark Bergsma; "Add fstype" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2981 [17:00:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2981 [17:00:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2981 [17:00:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2981 [17:09:47] New patchset: Mark Bergsma; "Fix storage file name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2982 [17:09:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2982 [17:10:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2982 [17:10:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2982 [17:12:54] New patchset: Mark Bergsma; "Dashes are ILLEGAL!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2983 [17:13:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2983 [17:13:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2983 [17:13:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2983 [17:14:17] !log shutting down db18 for memory testing [17:14:20] Logged the message, Master [17:17:57] mark: there is a port in the dmarc with nothing in it, but the port/keystone jack is installed on the panel [17:18:06] not sure if thats going to be for the exchange link or not [17:20:32] i will update ticket with the cable number, which is 2048 if you wanna label the port on the cr2 [17:20:53] ack sorry [17:20:55] 2648 [17:22:25] robh: ever have LOM commands not work on the x4240? reset /SYS? [17:22:35] locked up? [17:26:53] i have yea [17:27:01] try reset /SP [17:27:06] which is the service processor / lom [17:27:14] if that doesnt fix it, the only other fix is complete power removal [17:27:17] cmjohnson1: ^ [17:27:33] ctrl alt delete cycled it [17:27:45] that cycles the server, but the lom is still broken yes? [17:27:51] how will a remote person do that? [17:27:59] long arms ? [17:28:08] yea. please fix the lom before putting it as ok. [17:28:20] RobH: you in the DC today ? [17:28:22] yep...gonna run the test first and will check on it [17:28:23] yes [17:28:23] yes [17:28:35] LeslieCarr: can you work with apergos to drain traffic on the HE so i can clean the fiber [17:28:36] sorry though u were talking to me [17:28:42] i am ready to do that whenever you want [17:28:51] LeslieCarr: ^ [17:29:01] RobH: yay [17:29:03] hello [17:29:04] PROBLEM - Host db18 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:32] by "work with" what he means is (yes, I'm about to say what someone elsemeans :-P but I told him originally) [17:29:54] LeslieCarr: so yea, in the datacenter today i have to finish this fiber run for mark, do this fiber fix for you two, and recieve in some shipments, then i am gonna go home and be sick some more =P [17:29:55] can you explain what the goal is, then how you do it, and then I'll do it and you can check it? [17:30:07] apergos: so for draining the fiber, we're going to do the quick and dirty way. :) and yes [17:30:12] ok [17:30:14] RobH: you are dedicated [17:30:15] thank you [17:30:22] yeah. he is. [17:30:32] wqell not the quickest and dirtiest , which would be just pulling the fiber [17:30:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:45] so, that's my first q [17:30:48] we're going to deactivate the bgp session [17:30:56] since there was only like 35 mbits on there last I looked [17:31:08] ok, so [17:31:25] when people say they aren't going to advertise some route (and it's bgp) is that what they mean [17:31:29] the nicer way is to block port 179 because it causes bgp to time out instead ... [17:31:31] or is that something else? [17:31:49] well so for bgp we sort of exchange the routes that we want the other person to talk to us on [17:31:50] ok so my first q was, with so little traffic do we care? [17:31:59] so we exchange our routes, and they give us all of theirs [17:32:08] ok [17:32:18] so they just probably put a filter on to not readvertise our routes to their bgp peers [17:32:35] hmm ok [17:33:30] so we are just going to deactive the bgp peer session [17:34:04] that is easier than blocking the port? [17:34:13] deactivate protocols bgp group Private-Peer neighbor 216.66.30.89 [17:34:55] yes just because we don't have a filter group for port blocking - we could also just create a firewall filter group and call it something like "no port 179" and apply it on the incoming side of the interface [17:35:26] ok I see [17:35:29] this is indeed quicker [17:36:46] ok, I saw this stanza earlier and at least could recognize it was theirs [17:37:15] why is it called "private peer"? [17:37:45] LeslieCarr: so now i should be wiping down all fibers ends when i plug in right? (its what i just did for the one i just ran for mark_) [17:38:03] because we have a direct connection [17:38:16] so that's commonly called private peer - public peer would be a fiber to the peering exchange [17:38:20] RobH: yes please [17:38:34] you guys gonna be a few minutes still? if so i am gonna go receive in the foundry stuff, which will take me about 10 minutes [17:38:35] that might not be what the problem was, but I have when playing with scopes seen brand new fibers be all dirty [17:38:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.050 seconds [17:38:41] I'm going to be 10 seconds. [17:38:46] oh, then i wait =] [17:39:07] apergos: i love the deactivate command :) [17:39:13] it's nice [17:39:18] saves a lot of yping [17:39:20] typing too :-P [17:39:30] it claims to be committed [17:39:44] so you can monitor that port and see the traffic flow on it now [17:40:12] Output bytes : 514087318112588 288 bps [17:40:15] yeah I was just there [17:40:32] the HE fiber is the only unlabeled cable in the cage. i will label if it fixes it when cleaned [17:40:37] that's at the top level, at .500 it's the same [17:40:43] it was in before my labeler arrived [17:41:27] so that looks fine to me [17:41:57] lemme know when i can pull to wipe [17:42:11] I'm not actually monitoring (which I'm not sure how to do), just show interfaces blah extensive I guess i could | match bytes or something [17:42:19] anyways whatever, is he good to go then? [17:42:55] LeslieCarr: ^? [17:42:59] good to go! [17:43:07] sweet [17:43:21] ok, pulling it at dmarc to confirm its the same unlabled cable, it should be but i wanna confirm [17:43:30] RobH: do the C2100s have hot-swappable drives? [17:43:33] maplebed: did you get paged at night? :) [17:43:43] so this will be both the cable to psw-2 and the psw2-> them yeah? [17:43:47] Aaron|away: no. the ms-be hosts are not currently set to page. [17:44:31] i am cleaning the long run dmarc to peering router first indeed [17:45:15] ok [17:45:58] ok thats done, i can pull the peering to cr1 right LeslieCarr ? [17:46:04] yes you can [17:46:09] cool, doing that now [17:46:11] that's the one we just deactivated [17:46:40] I guess I delete deactivate blah blah to put it back? [17:46:54] * apergos goes to look at the stanza again [17:47:31] activate blah blah [17:47:33] ok [17:47:34] or "rollback 1" [17:47:35] :) [17:47:35] psw2 to cr2 cleaned [17:47:39] heh rollback! [17:47:41] psw2 to cr1 is a sfp copper cable [17:47:48] but no I'm going to do activate [17:47:54] so cannot clen that ;] [17:47:58] hmmmmmm [17:48:01] ah okay RobH - hrm, i have some curiousity if those could have some issues [17:48:02] you guys should be good to go and test now [17:48:06] with the conversion and converting back [17:48:07] yeeaahh [17:48:20] well can't do anything about it today [17:48:27] i can repalce it. [17:48:31] with a fiber if needed. [17:48:36] is that doable? [17:48:39] but i rather we test and see if this fixed first ;] [17:48:46] yeah sure [17:48:55] i prefer to use the copper attached sfps in rack if only cuz they are a LOT sturdier [17:48:56] besides you're sick [17:48:56] sadly i am about to run in 5 minutes :( so maybe tomorrow ? (RobH do you think you'll be well tomorrow?) [17:49:09] I will totally fedex you chicken soup [17:49:10] if i dont come back down here tomorrow i will be here on monday [17:49:16] if thats ok? [17:49:27] ie: if you think its the copper, i can replace it now rather than come back tomorrow ;] [17:49:46] but those were pricy, rather use em if we can ;] [17:50:14] that's ok [17:50:32] let's do it one step at a time [17:50:38] it might be completely unrelated [17:50:42] yeah [17:50:43] sadly [17:50:58] ok, i am going to clean up the cage and receive in the juniper stuff [17:51:01] all right, I'm going to bring it back up [17:51:03] thanks RobH [17:51:08] will be afk a bit but will be back in channel before i leave eqiad [17:51:20] cool :) [17:52:42] hmmmm [17:53:06] hmmm ? [17:53:51] Output bytes : 514089443911904 287594784 bps [17:54:35] 287mbit looks ok to me … lemme check observium for historical data [17:54:45] yeah it does [17:54:48] i want to say 280-320 is normal [17:54:48] it look s alot better [17:54:57] like about 10 times better [17:55:19] rerunning the crazy mtr [17:55:25] from ds1001 [17:55:29] ooo no ploss yet [17:55:29] ah right [17:55:29] :) [17:55:38] that is a good sign.... [17:55:45] well we don't have a big download going at the same time [17:55:49] where is our he net guy? [17:55:51] true true [17:55:54] * apergos goes to aim [17:56:23] shit, i gotta run, will be back in about an hour - but so far…. mtr -i 0.03 216.66.22.2 is way happy [17:56:26] not there, I'll start an iperf server on there and invite him into the channel [17:56:31] awesome [17:56:36] I think we can do the basic tests [17:56:40] i'm on the chan so i'll look at scrollback :) [17:56:44] thanks for evrything [17:57:05] thank you ! it's awesome to have someone else interested in the tubes :) [17:57:09] bbiab [17:59:46] well learning enough networking to be useful has been on my goals for 2 years now [17:59:56] or 1.5 I guess [17:59:59] anyways, about time [18:10:24] LeslieCarrafk: why do you think putting a filter to time out port 179 would be nicer? [18:10:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:46] RobH: so no improvement, sadly [18:16:12] huh, want me to replace the copper with fiber? [18:16:14] so either friday or monday when you're there next we'll try other steps [18:16:16] no. [18:16:21] I want you to go home [18:16:32] nah i can do it today, still here [18:16:37] seriously [18:16:42] i have a minimum of 30 minutes of work still ;] [18:16:43] because the next round will be [18:16:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.002 seconds [18:16:48] work til we beat it [18:16:58] well, if we replace the fiber in the rack [18:17:13] then all thats left is the fiber from dmarc to psw1 [18:17:22] but i wont argue, i feel like shit. [18:17:23] ;] [18:17:31] no, the pic on cr1, and psw-2 itself are left [18:17:47] that's right. don't argue, get better instead [18:30:09] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [18:33:06] heh [18:33:30] I like that these days a lot of monitoring is so automatic/implicit that you don't even realize it's there when setting up new stuff [18:34:15] :-) [18:34:57] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [18:40:28] !log deploying new squid frontend.conf to fix epic fail - all googlebot traffic was being redirected to mobile. now just if it's mobilegooglebot. [18:40:31] Logged the message, Master [18:42:19] nice [18:42:26] binasher: do you agree with having a 'misc' varnish cluster? [18:44:53] mark: in general, yeah [18:46:40] !log purging entire mobile varnish cache - the main mobile template included robots no-follow [18:46:43] Logged the message, Master [18:46:49] binasher: for new varnish servers in amsterdam, I'm gonna go with fewer, more cpu cores/memory than current boxes [18:47:06] older boxes were sorta optimized for squid still [18:47:14] but varnish really likes the many cores and mem [18:47:29] with fewer servers, I mean [18:47:45] fewer servers, higher performance per server [18:48:06] i like that, so long as a healthy amount of redundancy remains [18:48:13] absolutely [18:48:19] so instead of like 32 servers, I was thinking like 12-16 [18:48:21] hrm, one of the mobile varnish servers died 16 hours ago [18:48:23] plenty ;) [18:48:29] yep! [18:49:52] !log power cycling cp1044 [18:49:55] Logged the message, Master [18:51:42] mark: what about ssl for services behind misc varnish? perhaps nginx directly on the misc varnish servers? [18:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:24] yeah I was thinking that [18:52:38] seems cleanest [18:53:15] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [18:58:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [18:59:06] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [19:01:03] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 2 processes with command name varnishncsa [19:08:15] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:27] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [19:21:27] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [19:21:27] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [19:30:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.400 seconds [20:01:58] New patchset: Lcarr; "Modifying nagios config files for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2989 [20:02:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2989 [20:12:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.019 seconds [20:26:52] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:27] robh: i want to move the power for the mgmt switch on c1...just want to clear it first [20:35:12] hrmm, in there are swift and labs stuff, so unless Ryan_Lane or maplebed are actively working on things in that via mgmt [20:35:14] it should be fine [20:35:34] is someone messing with ms-be3 [20:35:39] (I would give them a few minutes to reply, if they dont, then they may not be at computer, and thus not using mgmt interfaces) [20:35:49] and adminlog when you do it of course [20:36:41] If anyone is working on the following, you are about to be bumped off mgmt for cable cleanup: ms-be2, labstore2, labstore1, es2, and es1 [20:37:27] heh, no reply for this kind of thing is fine, you should be ok to go ahead and move its power [20:37:56] it wouldnt actually mess up anyting if they were on it, it would merely d/c them [20:38:11] (just make sure it works when you finish) [20:39:06] !log removing and relocating power to msw-c1-sdtpa [20:39:10] Logged the message, Master [20:39:22] i will check it [20:42:10] PROBLEM - Host ps1-c1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:42:20] heh, thats normal [20:42:23] its connected via mgmt [20:42:46] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [20:42:50] !log power to msw-c1-sdtpa restore [20:42:54] Logged the message, Master [20:52:40] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [20:52:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.951 seconds [21:01:40] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [21:03:39] LeslieCarr: so, what do I need to do to make the labstore boxes accessible from labs? [21:04:23] we need to put them in the right vlan, with the right ip addresses [21:10:40] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [21:10:40] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:24:08] New patchset: Lcarr; "Modifying nagios config files for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2989 [21:24:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2989 [21:30:54] New patchset: Reedy; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [21:31:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [21:32:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:33:21] New review: Demon; "Looks good, but let's not merge it until the core mirror is 100% in sync with svn. Then we can go ah..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2990 [21:36:14] New review: Hashar; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/2990 [21:36:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.585 seconds [21:38:11] New patchset: Reedy; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [21:38:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [21:38:56] New review: Reedy; "Yeah, indeed. No point breaking stuff prematurely? ;)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2990 [21:43:09] LeslieCarr: ok, so, which ip addresses to switch to? [21:43:55] New review: Hashar; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/2990 [21:46:25] Ryan_Lane: so are these supposed to be able to be nfs'ed from any labs machine ? [21:46:29] aka then on the labs subnet [21:46:41] well, glusterfs, not nfs [21:46:43] but yeah [21:46:50] close enough :) [21:47:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2863 [21:47:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2863 [21:48:13] so the big thing is making sure that you don't hand out the ip address to someone else - is there a dhcp equivalent file on virt0 ? i'd just give it an ip in that range that you can exclude from being handed out [21:50:03] there's dhcp [21:50:11] can we use a different range? [21:50:36] the cluster could grow…. so I'd prefer to not use the same exact subnet [21:50:37] technically since it's not routed you could use any range [21:50:40] New patchset: Hashar; "redirect some missing Swift syslog messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2820 [21:50:48] lemme check out mark's plan [21:50:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2820 [21:50:52] * Ryan_Lane nods [21:51:14] 10.5.0/16 ? [21:51:30] works for me [21:51:32] give it a /24 netmask [21:51:41] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2991 [21:51:52] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2991 [21:53:26] Change abandoned: Ryan Lane; "But he said it's just a test, and he said it's just a test." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2991 [21:55:26] New patchset: Ryan Lane; "Only run this for the puppet repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2992 [21:55:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2992 [21:58:10] Ryan_Lane: which machines (or did you already oipen a ticket i should look for) [21:58:18] labstore1-4 [21:58:23] I'm changing them in dns now [21:59:08] New review: Hashar; "Clean and easy way to fix that bug. Kudos!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2992 [21:59:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2992 [21:59:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2992 [21:59:58] LeslieCarr: can you review my change in the dns svn? [22:01:04] looking [22:01:11] heh, there are 10 root logins on sockpuppet [22:01:23] Change abandoned: Ryan Lane; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2954 [22:01:26] someday we will work in sudo... ;] [22:01:40] looks good to me [22:01:42] s/someday/likely never/ [22:01:44] ok [22:01:56] crap [22:02:01] the ms-fe stuff is in there [22:02:06] and someone's got ms-fe in that change too :) [22:02:08] maplebed: ok for me to push out your ms-fe stuff? [22:02:10] damn, you win :) [22:02:16] oh shit [22:02:17] thats me [22:02:19] i forgot to push it [22:02:20] ah [22:02:21] which ms-fe stuff? [22:02:22] I'll push it [22:02:23] oh well ;] [22:02:27] thx [22:03:04] RobH: done [22:03:13] port swaps done [22:03:13] thx! [22:03:53] !log oxygen coming down for reinstall [22:03:57] Logged the message, RobH [22:05:37] can't breathe.... [22:05:50] budump ching [22:07:51] LeslieCarr: what gateway am I using? [22:09:21] oh shit, for that i didn't think, if you want to have multiple vlans you need to have routing inside that little area [22:09:44] didnt we add more public ips to row a? [22:09:48] you'll have to either use the same vlan (so l2 connectivity) [22:09:56] RobH: we moved things from public to private [22:10:02] RobH: so same effect [22:10:19] oh, freed up a ton of the existing, i see [22:10:21] ok [22:10:22] Ryan_Lane: let's talk about that in physical form when you get back [22:11:42] !log udpating dns for oxygen [22:11:45] Logged the message, RobH [22:12:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:10] Ryan_Lane: http://www.xenocafe.com/tutorials/linux/redhat/bind_multiple_ip_addresses_to_single_nic/index.php [22:18:06] ewwww [22:18:09] thats not ubuntu ;] [22:18:40] man netboot works a lot better when you actually update dns. [22:18:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [22:20:56] don't you be ewwwing my os :-P [22:21:57] too late! [22:22:30] I'm giving you a pass but only cause yer sick [22:22:38] hehe [22:22:47] also bedtime for me [22:23:03] have a good rest of your day [22:23:10] nite apergos [22:24:46] LeslieCarr: http://gluster.org/community/documentation/index.php/Gluster_3.2:_Installing_GlusterFS_on_Debian-based_Distributions [22:24:56] 11, 24007,24008, 24009-(24009 + number of bricks across all volumes) [22:25:37] installer takes forever to format =P [22:26:47] New patchset: Asher; "syncing up with https://github.com/asher/gdash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2993 [22:26:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2993 [22:28:17] Ryan_Lane: what are labstore3 and 4 going to be ip'ed as ? [22:28:17] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2993 [22:28:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2993 [22:30:45] LeslieCarr: 10.0.0.41-44 [22:30:49] is labstore1-4 [22:30:51] cool [22:39:46] !log poked hole to allow labs machines to reach gluster machines in tampa [22:39:49] Logged the message, Mistress of the network gear. [22:39:50] New patchset: Pyoungmeister; "script cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2994 [22:40:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2994 [22:40:44] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2994 [22:40:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2994 [22:46:37] New patchset: RobH; "added other locke items to oxygen as it is the locke replacement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2995 [22:46:48] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2995 [22:47:31] ok, wtf.... [22:47:39] all kinds of crazy errors on my commit... [22:48:43] warning: You cannot collect without storeconfigs being set on line 52 in file /var/lib/gerrit2/review_site/tmp/I499352c8edf3a0cd7fa995f66c33d2f3a0e5110a/manifests/ssh.pp warning: You cannot collect exported resources without storeconfigs being set; [22:48:49] Ryan_Lane: this doesnt seem to be me, any ideas? [22:49:29] that's normal [22:49:32] look at the last line [22:49:54] the puppet help parser validate? [22:50:36] LeslieCarr: https://labsconsole.wikimedia.org/wiki/Nova_Resource:Gluster#Network_filtering_and_ports_for_gluster [22:50:37] :) [22:50:51] RobH: file /var/lib/gerrit2/review_site/tmp/I499352c8edf3a0cd7fa995f66c33d2f3a0e5110a/manifests/webserver.pp at line 251 err: Could not parse for environment production: Syntax error at '='; expected '}' at /var/lib/gerrit2/review_site/tmp/I499352c8edf3a0cd7fa995f66c33d2f3a0e5110a/manifests/site.pp:1485 err: Try 'puppet help parser validate' for usage [22:51:05] very last line [22:51:31] err [22:51:31] Could not parse for environment production: Syntax error at '='; expected '}' at /var/lib/gerrit2/review_site/tmp/I499352c8edf3a0cd7fa995f66c33d2f3a0e5110a/manifests/site.pp:1485 err: Try 'puppet help parser validate' for usage [22:51:49] I wonder if we can tell it to ignore warnings [22:52:02] so many ports! [22:52:06] i still have no idea whats wrong [22:52:14] as oxygen syntax matches locke perfectly. [22:52:26] they are identical, and locke worked in the manifest before. [22:52:27] RobH: dude. it's telling you that you have a syntax error in site.pp at line 1485 [22:52:38] oh, bad include, bahg [22:52:42] i see what i did, nm [22:52:45] sorry, im out of it. [22:52:48] heh [22:52:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:20] New patchset: RobH; "damned typo added other locke items to oxygen as it is the locke replacement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2995 [22:54:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2995 [22:56:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.585 seconds [22:57:28] RobH: that's the sign it's time to stop touching machines and start touching bedtime [22:57:41] then folks need to stop handing me work to do [22:57:43] ;p [22:58:44] !log powercycled ms-be3 - it crashed 2.5 hours ag. [22:58:47] Logged the message, Master [22:58:51] RobH: ^^^ bad news. [22:59:03] .... [22:59:09] whyyyyyy [22:59:24] I connected to the console before the powercycle - nothing. [22:59:27] something aint right [22:59:36] I'm connected now, seems to be booting ok so far (but hasn't gotten to the os) [22:59:45] should we be concerned about search machines having full /a's ? [22:59:54] amazing how our test c2100 had none of these sisues [22:59:55] (for example search1017 ) [23:00:02] where's the damn nagios bot, thinking of that [23:00:09] yeah, ms-be1's been up for weeks. [23:00:15] search may not be setup to alert nagios for that [23:00:36] notpeter (even if he doesnt like it) i our search expert ;] [23:00:43] is even [23:00:55] notpeter: are you around today? [23:01:10] New review: RobH; "looks right to me, though the puppet class names are a bit servername specific to locke. chekcing t..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2995 [23:01:13] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2995 [23:02:02] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [23:02:40] LeslieCarr: was going to go get dinner, but what's up? [23:03:27] search1017/1018 have full /a partitions [23:03:34] should we be concerned and if so, what is safe to delete ? [23:03:40] nah [23:03:49] rob is getting them bigger disks [23:03:54] and they're not in prod yet [23:04:02] okay [23:04:02] hrmm, did i order those yet... [23:04:12] rob is, right? [23:04:13] =P [23:04:21] LeslieCarr: but thanks for checking! [23:04:31] mark questions if 7200 is fast enough rpm [23:04:37] since the search boes now have 10k [23:04:42] notpeter: thoughts? [23:05:44] RobH: it only needs to be as fast as the faster of search18 and search12's disks [23:05:59] they're both doing just fine, currently [23:06:13] if those are all 10k, then that seems prudent [23:06:27] if 7.2 or slower, then... whatevs [23:06:53] ok, but srsly, dinner now. ttfn [23:07:44] RECOVERY - Host db1033 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [23:08:20] New patchset: Ryan Lane; "Add gluster to instances by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2996 [23:08:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2996 [23:08:38] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2996 [23:08:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2996 [23:09:30] hrmm, trying to determine what is in search18 from cli [23:11:40] bah [23:11:44] they are 15k in the old servers [23:11:48] so yea, gonna have to up the speed. [23:11:56] PROBLEM - mysqld processes on db1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:13:53] RECOVERY - mysqld processes on db1033 is OK: PROCS OK: 1 process with command name mysqld [23:19:23] RobH: would you mind saying something with ms-be in it, then something else with ms-be3 in it? I'm testing IRC highlighting. [23:19:35] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 23854 seconds [23:19:39] me-be are c2100 [23:19:48] ms-be3 is a specific server [23:19:57] =] [23:19:58] lame. I did it wrong. [23:20:02] but thanks. [23:20:02] PROBLEM - MySQL Slave Running on db1033 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [23:20:08] lemme know when to do again [23:20:54] RobH: once more pls? [23:21:10] ms-be [23:21:13] ms-be3 [23:21:20] cool. [23:21:32] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay seconds [23:21:40] New patchset: Ryan Lane; "Adding base directory for labs automounts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2998 [23:21:47] linkinus has channel-specific highlighting in addition to global highlighting, in theory. the first try was with channel-specific (failed) the second with global (succeeded.) [23:21:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2998 [23:21:57] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2998 [23:21:59] RECOVERY - MySQL Slave Running on db1033 is OK: OK replication [23:22:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2998 [23:24:30] !log streaming hotbacking of db1017 to db1033 - no snapshots of enwiki in eqiad til db1033 is back [23:24:33] Logged the message, Master [23:25:53] PROBLEM - mysqld processes on db1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:27:18] Ryan_Lane: do we have deploy servers in puppet (esp in labs) for puppet ? [23:27:25] ldap [23:33:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:34] New patchset: Ryan Lane; "Adding gluster options for autofs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3000 [23:36:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3000 [23:36:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.925 seconds [23:37:09] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3000 [23:37:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3000 [23:42:49] !log rebooting ms-be5 [23:42:52] Logged the message, Master [23:45:41] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms