[00:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.616 seconds [00:23:41] robla ..sorry was out for a moment [00:25:04] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [00:30:46] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [00:37:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.990 seconds [01:19:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.767 seconds [01:41:31] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 259 seconds [01:44:22] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:01:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.117 seconds [02:58:49] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [03:00:38] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [03:00:38] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [03:00:38] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:00:38] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [03:21:47] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [03:26:08] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [04:11:26] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [04:26:17] PROBLEM - Host mw1135 is DOWN: PING CRITICAL - Packet loss = 100% [04:26:44] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/telenor-montenegro.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:40:50] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 190 seconds [04:42:11] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [04:49:32] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [06:22:30] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:23:42] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.019 second response time on port 8123 [06:33:00] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:35:52] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [06:39:19] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:40:31] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:42:10] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [06:48:01] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:49:13] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:52:40] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [07:03:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:05:25] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:06:39] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:12:12] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:14:54] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:15:41] !log restarted lsearchd on search1016 [07:15:57] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [07:15:58] thanks. I restarted it too just now [07:16:02] I wonder who got to it first [07:16:20] me of course [07:16:20] :P [07:16:37] I'll leav it for you the next time then :-P [07:16:47] you mean the other way round [07:17:09] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:22] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [07:21:25] so what I have noticed the last few times is that a restart doesn't quite do the trick: the old lucense search doesn't die [07:21:48] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [07:21:49] so I wind up doign a stop, checking that there are no lucen searches runnng, there is one, I shoot it, and then start it up again [07:22:34] (which was true today as well, there was one from may 20 that refused to die the usual way) [07:22:54] yeah, the same old problem [07:40:24] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [07:47:36] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [11:29:11] if there is any ops around, could you please [abandon] https://gerrit.wikimedia.org/r/#/c/8710/ ? ;-) [11:29:40] hi there [11:29:56] hello :) [11:30:10] I did some mess with bugzilla during lunch break [11:30:37] follow up in -labs [11:49:50] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:54:12] New review: Catrope; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/6489 [12:14:08] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:51:41] New patchset: Mark Bergsma; "Update TODO (from svn work space)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8750 [12:52:17] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8750 [12:52:27] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8750 [12:52:29] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8750 [12:57:03] New patchset: Mark Bergsma; "Old WIP: monitor for checking the VIP on real servers (untested)" [operations/debs/pybal] (vipping) - https://gerrit.wikimedia.org/r/8753 [12:57:28] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (vipping); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8753 [12:57:38] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (vipping); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8753 [12:57:40] Change merged: Mark Bergsma; [operations/debs/pybal] (vipping) - https://gerrit.wikimedia.org/r/8753 [12:59:52] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:22:27] New patchset: Dzahn; "move more color stuff into proper external CSS, general tabbing, minor fixes to status codes.." [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/8756 [13:23:20] morning mister mark! [13:23:38] I've got the geoip stuff we talked about committed [13:23:44] ben reviewed it for me yesterday [13:23:52] but he thought you should check it out too, and I agree [13:24:06] so! here we go: https://gerrit.wikimedia.org/r/#/c/8677/ [13:24:23] I think it is good, let me know if you catch anything or think of anything you'd like me to change [13:25:28] ok [13:30:32] i would also like a comment on 6467 by hashar,which i am a reviewer for. not so much about the technical part, but more as an example for other changes in generic-definitions which install tools for labs and the general labs vs. prod issue with these. [13:30:52] this specific one installs ack-grep, but more about how to handle all these little helper tools [13:31:48] (it's right after joe :p) [13:32:56] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8756 [13:32:58] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/8756 [13:32:59] New review: Mark Bergsma; "Very nicely done!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8677 [13:36:38] thanks mark! [13:38:18] mutante, how do I see the inline comments you've left? [13:38:25] on the ack-grep thing [13:38:46] oh i got it [13:38:51] have to go to the diff of the patch set [13:40:20] yep. well i ask more how to generally handle little helper packages that are requested on labs without conflicting with what we want in prod [13:40:39] is this commit for labs? [13:41:06] that's fine, no? it is a standalone class that doesn't depend on anything else [13:41:14] although i'm not so sure about the nested class for the link [13:41:21] yes, it is for labs, but it changes generic-defintions [13:41:32] that's ok, i think [13:41:39] and unless we give up on ever merging .. or keep adding lots of "if $realm"... [13:41:40] actually, that does not belong in generic-definitions [13:41:43] it is not a definition! [13:41:44] it is a class [13:41:50] why not just have a packages/ dir [13:41:53] (or module!) [13:42:08] then it further adds to those commits that will have to be picked out to get closer to a merge [13:42:08] with class packages::ack-grep [13:42:14] then anyone can include it [13:42:19] oh pshh [13:42:31] i thought I had been told they were giving up on the test/production merge [13:42:38] maybe I made that up [13:42:43] but I thought I had heard that mentioned [13:42:54] New patchset: Mark Bergsma; "Ran dh_clean" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8761 [13:43:14] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8761 [13:43:24] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8761 [13:43:26] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8761 [13:43:34] well, kind of, but on the other hand there is initiative to go through these changes that currently keep you from merging [13:43:49] thats why i ask, using this change as an example [13:43:54] mark, re misc:: [13:44:03] why are these 'misc'? [13:44:08] vs. 'generic' [13:44:10] or whatever [13:44:14] is there a reason to namespace them as such? [13:44:16] yes these are more generic [13:44:20] is there some guiding principal I don't know about? [13:44:26] it's not entirely clear [13:44:33] but I'd say this belongs in generic-definitions [13:44:36] or, a generic/ subdir [13:44:38] would be nicer [13:44:39] mutante, aye ok [13:44:47] or a modules/geoip folder? [13:44:54] heheh [13:44:54] misc/ is for "misc servers", generally we have only one of two of those [13:45:01] hmmm, ok [13:45:01] or a module, yes [13:45:10] we currently don't have any modules [13:45:14] indeed [13:45:16] this belongs in one [13:45:19] would it be crazy to start that now? [13:45:23] the java class I did does as well [13:45:28] as does the mysql instance thing [13:45:30] no i'd be fine with that [13:45:34] hmmmm ok [13:45:38] is just more work [13:45:46] i'm also fine with you committing this now [13:45:51] and then doing it as a subsequent step [13:45:55] maybe for now [13:45:56] but perhaps take it out of misc first ;) [13:45:59] i just move geoip to manifests? [13:46:01] yeah [13:46:02] yeah [13:46:03] fine with me [13:46:03] manifests/geoip? [13:46:04] ok [13:46:16] i'll do the same with java.pp [13:46:26] uhhh, but in a separate commit [13:46:27] k [13:50:46] New patchset: Ottomata; "Rewrote geoip.pp to be more modular and to use the licensed Maxmind GeoIP data files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8677 [13:51:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8677 [13:51:47] we need diffs between diffs ;) [13:52:07] heh, yeah, could do separate commits! [13:52:09] actually, this was good [13:52:16] I had unsaved files in textmate that didn't get committed [13:52:23] from my s/misc::geop/geoip/g [13:52:26] mark: That feature kind of exists in Gerrit actually [13:52:27] so those just got included to [13:52:51] brb [13:54:05] oh yes I see [13:54:41] nice [13:55:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8677 [13:55:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8677 [13:59:15] New patchset: Mark Bergsma; "Import geoip.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8763 [13:59:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8763 [13:59:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8763 [13:59:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8763 [14:01:43] New patchset: Ottomata; "check_udp2log_log_age - Adding all Wikipedia Zero filters to slow log list. Change-Id: Ie829fae652692bbf076443d405cecd0a7e086f61" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8456 [14:02:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8456 [14:02:08] would appreciate approval of that one [14:02:21] it is to cut down on false positive nagios alerts [14:02:23] first merging your geoip changes [14:02:28] ok danke [14:02:45] yeah, excited to make sure that works! [14:02:50] i bet there will be SOMETHIGN that doesn't work right [14:02:52] there always is [14:02:59] there was already ;) [14:03:03] needed to import geoip.pp now [14:03:05] out for food and later meeting peter. cya [14:03:12] oh! manifests aren't automatically imported? [14:03:17] no [14:03:17] manifests/*.pp? [14:03:22] yeah we don't do that [14:03:29] hm, should we? [14:03:37] let's not [14:03:41] haha ok [14:03:46] notice: /Stage[main]/Geoip::Data::Download/File[/var/lib/puppet/volatile/GeoIP]/ensure: created [14:03:46] you don't want to break everything!? [14:03:47] err: /Stage[main]/Geoip::Data::Download/Exec[geoipupdate]: Failed to call refresh: /usr/bin/geoipupdate -f /etc/GeoIP.conf -d /var/lib/puppet/volatile/GeoIP returned 242 instead of one of [0] at /var/lib/git/operations/puppet/manifests/geoip.pp:182 [14:03:49] no fun. [14:04:12] is /etc/GeoIP.conf in place? [14:04:30] yes [14:04:47] run [14:05:06] sudo /usr/bin/geoipupdate -f /etc/GeoIP.conf -d /var/lib/puppet/volatile/GeoIP -v [14:05:07] what's it say? [14:05:38] ah hehe [14:05:42] it doesn't have internet access [14:05:45] ah! [14:05:48] uhhh [14:05:53] really? should it? [14:05:58] oh! [14:05:59] that's right [14:06:00] no, internal servers [14:06:06] leslie had done this with a proxy for the wget [14:06:06] hmm [14:06:11] yeah [14:06:27] environment => "http_proxy=http://brewster.wikimedia.org:8080", [14:06:28] hmmmm [14:06:31] crapo [14:08:05] oh is that a shell thing? or a wget thing? [14:08:06] hmm [14:08:19] that's a general environment variable, also used by wget [14:08:22] lemme try if that works [14:08:23] oh cooooool [14:08:25] ok [14:08:33] i assumed it was for wget [14:08:38] awesome, didn't know that existed [14:08:42] yes that works [14:08:45] can you add that? [14:08:54] yup [14:11:56] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [14:13:53] hmm, mark, this is kind of hard for me to test [14:13:57] I know what to do [14:14:10] since you are with me now, mind if I commit this and you can just try it? [14:14:15] absolutely [14:14:26] let's do that [14:17:19] New patchset: Ottomata; "geoip.pp - passing environment parameter to cron and exec for geoipupdate command." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8767 [14:17:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8767 [14:18:47] mmk, there we go [14:19:17] that may not be optimal for the labs puppetmaster [14:19:23] but we'll see about that later [14:19:36] hm, ok [14:19:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8767 [14:19:41] good thing is it is a parameter [14:19:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8767 [14:19:46] so we could conditionally pass it [14:19:59] i think other places in puppetmaster.pp do that [14:20:04] if $labs or something [14:21:08] better now :) [14:22:40] yay, it worked on puppetmaster? [14:22:43] if so I will try on stat1 [14:22:45] and on another host [14:22:50] coool [14:24:26] Change abandoned: Mark Bergsma; "Redone in a much more awesome way by ottomata" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6240 [14:24:40] perfect! [14:24:44] works on stat1 too [14:25:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8456 [14:25:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8456 [14:26:09] danke! [14:26:49] mark, if you don't mind, I ahve two more little ones for you [14:26:53] this is nothing [14:26:53] https://gerrit.wikimedia.org/r/#/c/8488/ [14:26:55] just a comment change [14:27:10] this is so we can sudo -u on the new analytics cluster [14:27:10] https://gerrit.wikimedia.org/r/#/c/8713/ [14:29:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8488 [14:29:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8488 [14:30:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8713 [14:30:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8713 [14:30:13] awesome, thanks [14:30:26] ok, i'm gonna run some errands [14:30:28] back in an hour or so [14:30:28] i like it when my gerrit approval queue is low [14:30:29] great [14:30:41] i like it when the gerrit approval queue is low... which is not now [14:30:49] heh, yeah [14:31:09] thanks again so much! i def needed your help for making sure the puppetmaster changes worked [14:31:11] yeehaw [14:31:19] thanks for those changes [14:31:24] yup :) [15:11:08] New review: Nikerabbit; "Andrew: this script would run a system which can run maintenance scripts and send emails. I don't kn..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5783 [15:22:08] Hi, I have posted a bug report regarding the banishment of my IP address https://bugzilla.wikimedia.org/show_bug.cgi?id=37089 [15:45:05] !log changing ram distribution on search1015 and search1016 [16:24:44] hey folks, question about gerrit search, i am trying to run the query 'NOT is:merged' using the web interface but it throws an application error. is this a config error on our side, or a bug in gerrit? [16:37:12] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [16:53:15] hey folks, question about gerrit search, i am trying to run the query 'NOT is:merged' using the web interface but it throws an application error. is this a config error on our side, or a bug in gerrit? [16:55:39] PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:55:52] drdee2: Known issue in Gerrit [16:55:59] You can't apply any logic to is:foo statements [16:56:14] Even "is:merged or is:open" doesn't work [16:56:15] New patchset: Jgreen; "fixed dupe uid, added maxsem and mmullie" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8795 [16:56:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8795 [16:57:00] RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.010721 secs [16:57:19] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8795 [16:57:22] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8795 [17:01:37] robh: the memory on cp1017....same dimm? [17:02:50] as before? [17:05:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:05:44] RoanKattouw: thanks, but it is annoying :( [17:06:27] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:03] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:10:30] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:14:02] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6005 [17:20:42] cmjohnson1_: yep, same dimms [17:20:54] maplebed: you wanted the dual 160 GB ssds in the siwft hosts right? [17:21:01] i think i may have the parts on site to do it to all the ones in ashburn. [17:21:06] tampa we may have to order some stuff [17:21:29] RobH: yeah. [17:21:35] * AaronSchulz yays [17:21:43] though I actually don't care about the size (64G would be plenty) but we have them, so... [17:22:21] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [17:23:15] there are two 80G ssds in eiximenis [17:23:25] (or were?) [17:23:51] CT says there are far more SSDs there than I'll use; I'd rather not start juggling things just for the sake of it. [17:24:16] not in tampa [17:26:37] maplebed: so we have spare 160s in both locations [17:26:42] i just confirmed with chris [17:26:57] we also have the sata cables (came iwth another batch of SSDs and we have plenty spares of the cables) [17:27:06] what we dont have is the triple length hdd screws needed [17:27:21] the 2.5" mounts on top of the cooling duct in the c2100, has large rubber mounting washers [17:27:25] that require a longer screw [17:27:36] so im going to check online for a source on them, if i can order just a bunch of screws we are set [17:27:44] otherwise have to order useless kit of cables and screws from dell =P [17:31:06] RobH: ossm. [17:31:08] thanks. [17:32:41] yea the chassis has the place to mount, but not the screws or sata cables [17:40:38] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/8804 [17:40:54] maplebed or Ryan_Lane: can one of you merge this https://gerrit.wikimedia.org/r/#/c/8804/ for me? [17:41:08] yeah. [17:42:21] mark: around? [17:43:11] preilly: you don't want to make the same change to requests for zero. for orange tunisia and ivory coast? [17:43:34] maplebed: no [17:44:14] line 71 is redundant [17:44:28] (it's caught by the /22 two lines below) [17:44:40] maplebed: So I just ordered a few screws from 4mm to 8mm increments, they will come in next monday [17:44:51] so on tuesday or so i will try them out on the swift hardware and see if they work [17:44:53] RobH: nice and fast! [17:44:56] if they do, i drop half in mail to chris [17:45:20] heh, spare SSDs, spare sata cables [17:45:27] maplebed: cheapest hdd upgrade ever [17:46:21] preilly: do you want me to merge anyways or are you pulling that line? [17:48:02] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [17:48:12] maplebed: 71 is removed in patch set 2 [17:48:20] ok. merging. [17:48:42] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8804 [17:48:44] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8804 [17:49:27] mark: I've done quite a few changes to pybal's git; I wonder how to push it to gerrit though [17:49:51] maplebed: i have msfe1001 and msfe1002 as old misc servers [17:49:54] and the new ms-fe servers [17:49:54] mark: I saw that your commits didn't have a Change-Id; should mine do? [17:49:58] !log pushing out gerrit change 8804 for mobile [17:50:01] will be be reclaiming and renaming msfe1001? [17:50:10] or should i name the new stuff 1003+ ? [17:50:35] (i forgot they existed and called the new ones ms-fe1001+ ;P ) [17:50:43] RobH: I think they should be reclaimed. [17:50:50] the new ones are different hardware, right? [17:50:53] awesome, then i dont have to relabel the new ms-fe1001 [17:51:03] no, all the swift hosts should have the hyphen. [17:51:06] maplebed: the order was messed up and i only ordered 4 not 5 [17:51:06] ms-fe and ms-be. [17:51:13] so you need 5 front ends right? [17:51:20] i just need to do an additional order is all [17:51:21] cool [17:51:29] I wanted 5; I think mark squashed it down to 4. [17:51:35] ahh, that was it then [17:51:42] cool, was afraid i fubared the order ;] [17:52:17] preilly: the change is live. do you need a cache flush too? [17:52:38] maplebed: I don't think so [17:52:45] ok. [17:59:14] New patchset: RobH; "added ms-fe1001-1004 entries" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8807 [17:59:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8807 [18:01:44] New review: RobH; "simple adding hosts to dhcp and netboot, all good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8807 [18:01:46] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8807 [18:04:23] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:25] maplebed: the front end swift hosts are still lucid yes? [18:04:30] yes. [18:04:32] cool [18:04:45] showing chris how to do installs, we are going to go ahead and install ms-fe1001-1004 [18:04:52] cool. [18:05:35] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:09:11] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:09:27] srv278 again [18:09:31] should we just power it down? [18:09:50] (there's an RT ticket about it, it's been flapping for months) [18:10:41] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.326 second response time [18:10:42] then we'll really ignore it forever ;) [18:11:12] occasionally the spam annoys people enough to take action [18:13:35] an easy merge for ops : https://gerrit.wikimedia.org/r/#/c/7877/ ;-D [18:14:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [18:18:31] RoanKattouw: can you +1 ^^ for me, if you are ok with the changes? [18:18:38] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [18:18:54] Ryan_Lane: Do you realize I *submitted* that change? [18:18:59] hahahahahaha [18:19:05] no. I didn't notice that [18:19:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7877 [18:19:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7877 [18:19:22] I'll shut up and merge then :) [18:19:43] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/8732/ [18:19:56] Ha! [18:19:58] Nice [18:19:58] I have that pretty manually set up right now in labs [18:20:30] I won't have time to review that soon but I do want to review it [18:20:35] * Ryan_Lane nods [18:20:36] Maybe we can sit down for that in Berlin or something? [18:20:40] I have more work to do on it [18:20:40] yeah [18:20:42] OK [18:20:46] that's what I hoped for [18:21:05] but, it works [18:21:18] it's basic and missing some things, but it deploys code perfectly fine :) [18:41:01] New patchset: Jeremyb; "dedupe code: foreachwiki vs. foreachwikiindblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [18:41:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8434 [18:42:21] * jeremyb wonders why he can't see 8732. it's in the private repo? [18:43:21] New review: Jeremyb; "rebased for I7116a708f8b0b92a6d5f30c018fee0eb06f6c4db" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8434 [18:43:35] * jeremyb runs away [18:59:10] !log killing puppet daemon on brewster for local isntall hacking to test something [19:07:39] New patchset: RobH; "stupid logic, updated netboot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8815 [19:07:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8815 [19:08:42] hmmmmm, is there any way in gerrit to query for changes that touch a given file or dir? [19:08:46] CR had that [19:09:12] (specifically i want unmerged stuff) [19:09:14] New review: RobH; "changing the ms-fe netboot logic to do ms-fe*, looks fine" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8815 [19:09:17] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8815 [19:09:18] hi, it seems that the ops team should take a look at my bug : https://bugzilla.wikimedia.org/show_bug.cgi?id=37089 [19:10:21] !rt 3024 [19:10:21] https://rt.wikimedia.org/Ticket/Display.html?id=3024 [19:10:26] that's rguillebert's RT [19:12:30] is that the issue that caused the enwiki slowdown the other day? [19:12:35] i've not a clue [19:12:44] (i can't see the RT of course) [19:12:46] (someone overused the api the other day im not sure who) [19:12:55] the RT just points to the ticket ;] [19:13:04] just ops doesnt watch bz like we do RT is all [19:13:10] right [19:13:14] if it's 2 or 3 days ago then yes it's me [19:13:31] i thought maybe it would be in a changelog somewhere? or ops@ [19:13:39] hrmm, i dunno who would make that call [19:13:48] mark if he is about would be the network admin, thus prolly his call [19:14:13] right, mark was my first thought. although i was thinking we could just ask whoever implemented it to begin with [19:14:18] lemme see if admin log says who did the ban [19:14:19] idk how to figure out who that is [19:15:30] rguillebert: So I do not have a good answer, but I can do this. I will email our internal ops list with the ticket info to see if someone who did the ban can chime in [19:15:38] i dont want whoever did it to undo my undoing it and all ;] [19:15:52] ok :) [19:16:53] it really caused a slowdown on wikipedia ? [19:17:15] on the api calls i think [19:17:34] but i am honestly not entirely in the know [19:17:46] i have been in the bowels of the datacenter proper for a few weeks now [19:18:20] mark did the ban on the router side [19:18:24] I did it at the squid level [19:18:43] right, so even if we unban the squid we need mark or leslie to undo the null route on the router [19:18:48] we were wrking independently of each other and both went into effect around the same time :-D [19:19:07] rguillebert: you are infamous ;] [19:19:37] I could also do it. (but not tonight, I am off the clock) [19:19:46] I honestly didn't know I could cause so much mess [19:20:30] I'm ok with lifting it, with the understanding that if it were to happen again the ban would come back and stay back for a loooooong time [19:20:39] let's see what mark says also. [19:23:20] I'm going go wander off and read, see folks later [19:35:32] hi guys [19:35:36] this is kinda important [19:35:36] https://rt.wikimedia.org/Ticket/Display.html?id=3025 [19:35:45] we're a bit twiddly until that is fixed [19:37:38] ottomata: uhh, you mean they need public ips? [19:37:47] apt-get should work [19:37:58] we have our own apt repo, so you can use that for apt [19:38:11] but they are indeed internal vlan systems, so they wont be internet accessible [19:38:24] ottomata: reinstall and repuppetize to bring them to public vlan... =[ [19:39:57] added comments to ticket [19:44:45] RobH: did you ever get a quote from dell for the potential ssd db server type we talked about that takes 12 x 2.5" drives, but with just 2 sas drives installed? [19:45:15] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:50:57] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:51:53] RobH, sorry was not looking in this room for a sec [19:51:56] saw your note [19:52:06] they don't need public IPs [19:52:09] just internet access [19:52:17] is there a gateway somewhere they can reach? [19:52:20] we can just set default route? [19:52:59] right now route is default vrrp-gw-1021.eq 0.0.0.0 [19:53:07] dunno what vrrp-gw-1021.eq is [19:53:20] can't reach it [20:04:13] maplebed and/or ryan_lane: I have a question about ryan's favorite patch, https://gerrit.wikimedia.org/r/#/c/5783/ [20:04:42] I'd imagine it should run on hume or fenari [20:04:46] RoanKattouw: ^^ ? [20:04:54] Hey, that was my question! [20:05:01] ;) [20:05:13] I think I know how to do that, even. [20:05:18] Oh is this about Niklas's patch? [20:05:24] Um... once you or roan pick a system. [20:05:41] I say run it on hume [20:05:46] Yes, if Nikerabbit is the same as Niklas. [20:05:54] yep [20:05:56] same [20:06:40] New review: Andrew Bogott; "OK, now I'm convinced that this is fine as it is." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5783 [20:06:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5783 [20:08:37] New patchset: Andrew Bogott; "Turn on the translationnotifications class on Hume." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8820 [20:08:54] Ryan_Lane: ^^ should take you about 20 seconds to review [20:08:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8820 [20:09:20] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8820 [20:09:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8820 [20:10:28] thx [20:17:18] yw [20:43:18] New patchset: Hashar; "remove old commented out snippets" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/8863 [20:51:42] !log power cycling es1003 which has been unresponsive for 27 hours [20:54:16] RECOVERY - MySQL disk space on es1003 is OK: DISK OK [20:54:25] RECOVERY - MySQL Slave Running on es1003 is OK: OK replication [20:54:52] RECOVERY - SSH on es1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:55:51] New patchset: Hashar; "cleanup comments in httpd.conf" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/8866 [21:16:01] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 188 seconds [21:16:37] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 193 seconds [21:18:25] RECOVERY - mysqld processes on es1003 is OK: PROCS OK: 1 process with command name mysqld [21:36:07] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [21:36:34] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [22:02:04] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:09:07] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:21:16] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/8871 [22:21:37] Ryan_Lane: can you merge ^^ [22:22:01] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:22:14] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8871 [22:22:22] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8871 [22:22:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8871 [22:22:42] Ryan_Lane: thanks! [22:22:47] yw [22:23:41] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:40:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:53:13] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:00:16] !log stopped replication on es1002 in order to rsync cluster23 to es1003 [23:00:19] Logged the message, Master [23:01:10] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [23:03:07] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [23:03:07] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [23:03:07] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:03:07] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [23:16:00] New patchset: Ryan Lane; "Initial commit of the new deployment system" [operations/deployment] (master) - https://gerrit.wikimedia.org/r/8732 [23:28:10] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [23:28:29] anybody interested in reviewing ^^^? [23:28:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [23:29:12] * Ryan_Lane twitches [23:29:27] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [23:29:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [23:30:11] maplebed: there's no way to make it fetch all metrics, then get each metric from a hash or something? [23:30:24] it seems really inefficient to fetch every one individually [23:30:25] there is in the docs, but it doesn't work. [23:30:30] ugh [23:30:31] there's a 'get-all' [23:30:36] gives me nothing. [23:30:40] lame [23:30:43] \o/ [23:31:00] that's 10's of commands a minute [23:31:02] but, efficiency-wise, it runs in 0.00000000 seconds and only once/minute. [23:31:08] so ... meh. [23:31:12] ah [23:31:13] ok [23:31:36] i thought we were all about python gmond plugins instead of wrappers around gmetric [23:32:07] in general they are better, yes [23:32:10] binasher: we are. except that pdns doesn't want nobody to talk to it. so rather than fighting the permissions I asked root to make the calls and wrapped gmetric. [23:32:31] have gmond call sudo, and make a sudo policy for it? [23:32:52] I still have the code to use a module, but I dislike gmetric less than I dislike having nobody call sudo for rec_control. [23:32:52] get-all is available as of 3.2 [23:33:01] ah. we're not even close to 3.2 [23:33:07] 2.9.22 [23:33:08] we're running 2.9.22-1ubuntu1 (wtf) [23:33:09] even precise doesn't have 3.2 [23:33:25] unless they updated it before release [23:33:39] nope. 3.0 [23:34:13] maplebed: if pdns runs with 'setgid nobody' the control socket will be in the nobody group with perms of 660 [23:34:53] i don't actually have any problem with the script as is, just fyi [23:35:13] k. [23:36:20] https://launchpad.net/ubuntu/+source/pdns [23:36:38] New review: Ryan Lane; "only one comment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [23:37:04] * Ryan_Lane sighs [23:37:10] the hooks are all screwed up in gerrit again [23:37:21] every single time I merge changes in for them [23:37:32] obviously I'm not code reviewing well enough [23:37:46] New review: Ryan Lane; "same on the new patchset." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/8876 [23:38:00] screwed hooks..i think there's a movie about that [23:38:13] :D [23:38:22] does it play out like "very bad things"? [23:39:09] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [23:39:12] ugh [23:39:14] I'm going to stab [23:39:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [23:40:35] maplebed: if you're running as root, and you're using python, why not talk to the pdns control socket directly instead of invoking a shell command to do it 20 times? [23:41:13] no good reason. one bad one - I played with that a bit and couldn't make it work (probably because I don't understand sockets correctly). [23:41:55] but also, it runs in 0.006s; I'm not too concerned that I'm going to overload the host. [23:42:12] it would be better though. [23:47:33] maplebed: isn't that script sort of a vector for a linux local privilege escalation? [23:47:46] preilly: I don't think so. how? [23:48:11] maplebed: well subprocess.Popen as root [23:49:06] it runs the subprocess as whomever ran the parent process. [23:49:40] all of the commands its running are predefined [23:50:01] also yeah, as ryan says - no variable or user input [23:50:22] import socket / import socket / s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) / s.connect("/var/run/pdns.controlsocket") / s.send("yourcommand\n") / response = s.recv(8192) [23:50:38] binasher: I think that's what I tried... [23:50:44] did you just test that and have it work? [23:51:02] (perfectly possible that I just screwed that up, though it was what I was trying to do.) [23:53:10] maplebed: the state file could be a problem [23:53:20] is it owned by root? [23:53:24] yeah. [23:53:37] mode 644 [23:53:53] the directory isn't [23:54:09] it's owned by ganglia [23:54:24] hm. [23:54:30] the state file could be used to inject shell injection attacks [23:54:45] redundant ones at that [23:55:03] what do you mean redundant ones? [23:55:19] was a joke. my statement was redundant :) [23:55:25] oh. [23:55:38] but that's a valid bug; do you think checking file ownership as root will be sufficient? [23:55:43] no [23:55:54] maplebed: the main pds control socket is a stream type as in the example i gave, and yes it works [23:56:04] how about moving the state file /var/lib/? [23:56:04] the recursor uses a dgram socket [23:56:14] you should only run the rec_control process as root [23:56:21] everything else should run as a less privileged user [23:56:58] ganglia user would likely be good for that [23:57:35] can a gmond plugin run the rec_control stuff via sudo? [23:57:41] then you don't need gmetric [23:57:46] and don't have the worry of shell escapes [23:57:57] binasher: it doesn't work for me. the s.recv() line just hangs. [23:58:04] can I look over your shoulder? [23:58:07] though it means you need to install a sudo policy with this [23:58:29] Ryan_Lane: you don't think moving the state file to /var/lib/ is sufficient? [23:58:41] I can also scrub the input - all the values I read are integers. [23:58:59] matching agaist [0-9] will purge any shell escapes. [23:59:06] it's safer to do least privilege