[00:02:41] Ryan_Lane: So I wrote a script to do a graceful restart of the job runners, wanna review it before I run it? [00:03:14] http://pastebin.com/NcR9CcnM [00:03:52] AaronSchulz: --^^ [00:04:14] Of course this functionality should be written into the job runners' init script eventually [00:04:22] * AaronSchulz looks [00:10:10] RoanKattouw: what are the 'grep -v grep's for? [00:10:41] oh, nvm [00:10:45] just removes cruft [00:11:41] !log nagios down [00:11:43] Logged the message, Mistress of the network gear. [00:11:44] nagios is down [00:11:44] fyi [00:13:26] RoanKattouw: seems mostly sane [00:13:56] * AaronSchulz was staring at the awk/sort/head stuff [00:15:54] Yeah it's a quickie so it's undocumented, sorry :) [00:36:03] explosion? [00:36:46] * AaronSchulz wonders what RoanKattouw is doing [00:38:16] heh [00:38:30] The "What?! No!!" thing? [00:38:55] I was reviewing code and it had WHERE foo REGEXP CONCAT('/', bar, '/', baz, '$') [00:43:03] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [00:43:21] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [00:43:57] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1031 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [00:43:58] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [00:43:58] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [00:45:09] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [00:47:06] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [00:53:42] RECOVERY - MySQL Slave Running on db1007 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:55:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:06] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 374675 seconds [00:56:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.705 seconds [00:57:09] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [01:30:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.440 seconds [02:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:00] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:14:03] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [02:14:03] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [02:14:57] PROBLEM - Puppet freshness on search1004 is CRITICAL: Puppet has not run in the last 10 hours [02:16:00] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [02:16:00] PROBLEM - Puppet freshness on search1005 is CRITICAL: Puppet has not run in the last 10 hours [02:17:03] PROBLEM - Puppet freshness on search1007 is CRITICAL: Puppet has not run in the last 10 hours [02:17:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.063 seconds [02:17:57] PROBLEM - Puppet freshness on search1008 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1009 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1010 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1011 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1012 is CRITICAL: Puppet has not run in the last 10 hours [02:22:21] PROBLEM - Puppet freshness on search1013 is CRITICAL: Puppet has not run in the last 10 hours [02:23:24] PROBLEM - Puppet freshness on search1014 is CRITICAL: Puppet has not run in the last 10 hours [02:23:24] PROBLEM - Puppet freshness on search1015 is CRITICAL: Puppet has not run in the last 10 hours [02:24:27] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [02:25:21] PROBLEM - Puppet freshness on search1017 is CRITICAL: Puppet has not run in the last 10 hours [02:25:21] PROBLEM - Puppet freshness on search1018 is CRITICAL: Puppet has not run in the last 10 hours [02:26:24] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [02:26:24] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [02:27:27] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [02:28:21] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [02:28:21] PROBLEM - Puppet freshness on search1023 is CRITICAL: Puppet has not run in the last 10 hours [02:29:24] PROBLEM - Puppet freshness on search1024 is CRITICAL: Puppet has not run in the last 10 hours [02:39:18] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Wed Apr 11 02:38:52 UTC 2012 [02:39:18] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Wed Apr 11 02:39:06 UTC 2012 [02:41:24] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Wed Apr 11 02:40:55 UTC 2012 [02:41:51] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Wed Apr 11 02:41:38 UTC 2012 [02:42:18] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Wed Apr 11 02:42:08 UTC 2012 [02:42:54] RECOVERY - Puppet freshness on search1012 is OK: puppet ran at Wed Apr 11 02:42:32 UTC 2012 [02:43:21] RECOVERY - Puppet freshness on search1017 is OK: puppet ran at Wed Apr 11 02:43:12 UTC 2012 [02:43:48] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Wed Apr 11 02:43:43 UTC 2012 [02:44:24] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Wed Apr 11 02:43:58 UTC 2012 [02:46:21] RECOVERY - Puppet freshness on search1023 is OK: puppet ran at Wed Apr 11 02:46:07 UTC 2012 [02:46:21] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Wed Apr 11 02:46:10 UTC 2012 [02:47:51] PROBLEM - Host db44 is DOWN: PING CRITICAL - Packet loss = 100% [02:48:18] RECOVERY - Puppet freshness on cp1033 is OK: puppet ran at Wed Apr 11 02:48:17 UTC 2012 [02:50:24] RECOVERY - Puppet freshness on search1008 is OK: puppet ran at Wed Apr 11 02:50:12 UTC 2012 [02:52:12] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Wed Apr 11 02:51:53 UTC 2012 [02:52:48] RECOVERY - Puppet freshness on search1007 is OK: puppet ran at Wed Apr 11 02:52:43 UTC 2012 [02:54:18] RECOVERY - Puppet freshness on cp1035 is OK: puppet ran at Wed Apr 11 02:53:48 UTC 2012 [02:55:21] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Wed Apr 11 02:55:12 UTC 2012 [02:55:48] RECOVERY - Puppet freshness on search1010 is OK: puppet ran at Wed Apr 11 02:55:27 UTC 2012 [02:56:24] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Wed Apr 11 02:56:02 UTC 2012 [02:56:51] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Wed Apr 11 02:56:43 UTC 2012 [02:57:45] RECOVERY - Puppet freshness on cp1032 is OK: puppet ran at Wed Apr 11 02:57:25 UTC 2012 [02:58:48] RECOVERY - Puppet freshness on cp1031 is OK: puppet ran at Wed Apr 11 02:58:36 UTC 2012 [02:58:48] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Wed Apr 11 02:58:43 UTC 2012 [02:59:15] RECOVERY - Puppet freshness on search1022 is OK: puppet ran at Wed Apr 11 02:58:54 UTC 2012 [03:01:21] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Wed Apr 11 03:01:11 UTC 2012 [03:02:24] RECOVERY - Puppet freshness on search1013 is OK: puppet ran at Wed Apr 11 03:02:12 UTC 2012 [03:02:33] RECOVERY - Varnish HTTP upload-frontend on cp1029 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [03:04:21] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Wed Apr 11 03:03:59 UTC 2012 [03:06:18] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Wed Apr 11 03:05:59 UTC 2012 [03:06:54] RECOVERY - Puppet freshness on search1009 is OK: puppet ran at Wed Apr 11 03:06:37 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Wed Apr 11 03:07:07 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Wed Apr 11 03:07:13 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Wed Apr 11 03:07:13 UTC 2012 [03:27:59] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [03:30:14] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Wed Apr 11 03:29:53 UTC 2012 [05:44:41] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [05:46:56] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [06:10:38] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [06:11:41] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [07:02:13] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:02:13] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:08:31] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [07:10:01] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [07:18:07] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:18:07] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:39:21] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [09:40:33] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:43:24] New patchset: Mark Bergsma; "Don't start Varnish automatically, Puppet will do this" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4685 [09:43:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4685 [09:44:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4685 [09:44:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4685 [09:59:29] New patchset: Mark Bergsma; "Make purging cluster-specific, to allow tweaking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4686 [09:59:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4686 [10:00:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4686 [10:00:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4686 [10:07:53] New patchset: Mark Bergsma; "Implement purging using a separate subroutine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4687 [10:08:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4687 [10:08:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4687 [10:08:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4687 [10:25:43] New patchset: Mark Bergsma; "Purge only http://upload.wikimedia.org URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4688 [10:25:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4688 [10:26:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4688 [10:26:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4688 [10:27:55] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [10:28:13] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [10:45:01] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [10:49:22] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:34] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [10:54:28] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:57:55] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:08:17] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 183 seconds [11:08:44] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 195 seconds [11:09:47] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 209 seconds [11:10:05] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 217 seconds [11:15:29] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 183 seconds [11:15:56] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 183 seconds [11:15:56] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 187 seconds [11:16:50] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 197 seconds [11:17:17] PROBLEM - Host mw6 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:29] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [11:20:08] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 23 seconds [11:20:08] RECOVERY - Host mw6 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [11:20:17] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [11:21:11] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [11:21:11] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [11:23:44] PROBLEM - Apache HTTP on mw6 is CRITICAL: Connection refused [11:25:05] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [11:34:08] New patchset: Dzahn; "another mail forward and url redirect for renamed list - museum-l -> glam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4693 [11:34:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4693 [11:35:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4693 [11:35:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4693 [11:49:46] New patchset: Mark Bergsma; "Switch to the persistent storage backend on cp1036 for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4696 [11:50:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4696 [11:51:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4696 [11:51:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4696 [11:58:14] !log Setup cp1036 with the persistent storage backend [11:58:17] Logged the message, Master [12:03:06] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [12:04:09] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [12:07:44] !log moved another list: museum-l -> glam (http://lists.wikimedia.org/pipermail/glam/2012-April/000000.html) [12:07:45] Logged the message, Master [12:08:09] * mutante inserts Template:PITA into the docs page he is writing [12:26:12] PROBLEM - Varnish HTTP upload-frontend on cp1036 is CRITICAL: Connection refused [12:33:24] RECOVERY - Varnish HTTP upload-frontend on cp1036 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [12:38:31] New patchset: Mark Bergsma; "Run varnish as user varnish instead of nobody" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4699 [12:38:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4699 [12:39:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4699 [12:39:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4699 [12:59:03] New patchset: Mark Bergsma; "Automatically restart gmond if varnish was started later" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4701 [12:59:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4701 [12:59:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4701 [12:59:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4701 [13:02:43] New patchset: Mark Bergsma; "Make restart gmond names unique" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4702 [13:02:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4702 [13:03:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4702 [13:03:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4702 [13:15:06] New patchset: Mark Bergsma; "Slightly nicer naming" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4703 [13:15:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4703 [13:15:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4703 [13:15:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4703 [13:35:25] !log applied patch-RT-2804.diff to bugzilla per [[BZ::731219]] re: XMLRPC content-type verification [13:35:27] Logged the message, Master [13:54:22] s/BZ/bugzilla.mozilla/g [13:55:01] New review: Hashar; "Added Mark & Tim as reviewers since the symbolics links like /h/w/c/p are some old stuff :-)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4261 [13:55:54] what is the entry URL for gitweb to see all the active repositories so you can browse them? [14:01:21] Jeff_Green: hey, you want to shoot off a quick round of tests at eqiad search just to be extra super safe before deploy? [14:02:05] yawp. ready? [14:02:12] should be [14:02:18] k [14:02:20] haven't changed anything since yesterday :) [14:02:27] just really paranoid :) [14:02:27] ha [14:02:30] ok then! [14:03:03] launched [14:03:08] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:26] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:31] uh.... [14:03:50] hey mark, are you doin' stuff with lvs in esams? [14:04:29] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 42980 bytes in 0.774 seconds [14:04:47] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 51913 bytes in 0.775 seconds [14:04:57] eh, solved [14:05:40] notpeter: search comparison looks clean [14:05:51] Jeff_Green: sweet! thank you [14:05:55] np [14:06:14] well . . . actually [14:06:18] ja? [14:06:32] hrm. enwiki is only 80% matches [14:06:40] lemme look at the raw logs a bit [14:06:53] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [14:07:08] kk [14:08:32] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:10:09] the individual result scores seem pretty different for the two things I've looked at so far [14:10:10] like [14:10:27] sec [14:11:07] 10.0.3.9: 3542.88 0 DVD [14:11:07] 10.64.0.95: 849.06 0 DVD [14:11:26] it's probably just timing, but it's a little curious that the relevance scores are so different [14:11:45] hhhmmm [14:11:49] that is a little worrying [14:11:56] I'll curl around a little too [14:11:58] ok [14:12:02] link to some results? [14:12:19] iron:~$ /opt/searchqa/bin/analyse_test_results /tmp/fire_in_the_hole-20120411-140257 [14:12:24] and . . . [14:12:49] i really gotta mod this script to print the summary page to a file too [14:13:32] http://trouser.org/searchqa.txt [14:14:06] for the most part it's clean, but I do think it's a little odd to see that much variation on enwiki [14:14:34] yeah [14:14:37] that's a little odd [14:15:28] I've been meaning to mod the script that checks the filesystem for indexes on each machine individually, such that it's trivial to see who has what indexes and how recent the files are [14:15:52] my vote: if you curl around and are generally not surprised with the results, do the cut [14:16:26] and meanwhile I can tweak that script to help expose issues now and going forward [14:16:27] yeah [14:16:32] I'm not hella worried [14:16:38] me either [14:16:39] but I think this does warrant some further investigation [14:16:54] no timeouts at all, that's great [14:32:37] Jeff_Green: these are, in fact, quite different... [14:33:02] yeah, i'm arriving at interesting preliminary conclusions about things too [14:33:43] almost done haxoring this script, but it looks like there's a big time difference between when the eqiad and pmtpa hosts last indexed [14:35:44] huh [14:37:13] oh, yes [14:37:13] see iron:/tmp/funk [14:37:24] the pmtpa search indexer looks pretty idle.... [14:37:43] enwiki.nspart1.sub1 20120411051509 search1003.eqiad.wmnet [14:37:44] enwiki.nspart1.sub1 20120411061941 search3.pmtpa.wmnet [14:38:24] oh wait, i misread that [14:38:34] that's only an hour [14:38:35] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=&c=Search&h=searchidx2.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [14:38:53] searchidx2 seems to have shat bed about an hour ago [14:39:25] boo [14:40:19] yes, it hasn't done anything for 1.25 hrs [14:40:25] !log restarting indexer on searchidx2 [14:40:27] Logged the message, notpeter [14:41:01] so hey! seems like a great time to switch to a newer set of indexes :) [14:41:08] hah [14:41:18] it does [14:41:28] well, I'll wait my 20 minutes [14:41:34] see what happens [14:41:41] but, I still think that tings will be good to go [14:41:52] I mean, it seems unlikely that it all magically broke in the last 24 hours [14:42:09] unless "it" = old infrastrucutre :) [14:43:14] i'm a little puzzled why this made such a big difference on enwiki [14:43:30] the indexes appear to be all there and at most ~2h apart [14:43:43] how often are they supposed to generate? [14:44:14] the regular indexes are updating all the time [14:44:22] the numbers being so different is weird... [14:44:24] enwiki.prefix should be borkn by all accounts, that's ~1d apart [14:44:30] but that might be part of the failure mode [14:44:34] yeah [14:44:58] so, everything other than the regular search index nspart[12].sub.blah [14:45:05] is made by crom [14:45:07] cron [14:45:11] oic [14:45:11] so those won't be different [14:45:22] it's just the search rankings that are being updated all the time [14:45:22] ok then it makes sense-ish [14:45:29] heh [14:45:39] so prefix will all be the same [14:45:45] but search results will be different [14:45:54] well [14:46:11] interestingly, though, the .prefix index files are far apart in time: [14:46:21] enwiki.prefix 20120411115947 search1018.eqiad.wmnet [14:46:21] enwiki.prefix 20120410175632 search18.pmtpa.wmnet [14:47:13] unless the apparently date-named file is not using a meaningful date [14:48:29] hhmmmm [14:48:30] weird [14:48:51] well... as long as eqiad is newer! :) [14:49:27] woot [14:50:16] I mean, searchidx2 may have been slowly failing over the last 24 hours [14:50:29] so it may not have successfully finished its cron [14:50:42] yeah [14:50:43] yeah [14:50:49] last time that cron ran was yesterday [14:51:02] New patchset: Jgreen; "modified search qa script fetch_search_cluster_sharding_info to report most recent lucene indexes for each search host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4722 [14:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4722 [14:51:41] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4722 [14:51:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4722 [14:54:46] Jeff_Green: and logs are rotating properly! [14:54:58] rlly [14:55:05] ja [14:55:13] thank you for finding that thing [14:55:19] this is much better (probably) [14:55:42] sure. have you been able to tell whether it helped the OMGDUMPTHEDISKCACHE?!??!>>!>>! problem? [14:56:17] nope. I didn't do it on pmtpa [14:56:22] as pmtpa was mildly stable [14:56:25] ah [14:56:40] and I didn't want to mess with something that was almost falling to peices [14:56:50] but hey, we'll know when those logs rotate tonight! :) [14:57:41] yes [14:58:06] alright, I'm gonna start [14:58:15] hey, so if you ever want a quick view of what indexes are on which machine run /opt/searchqa/bin/fetch_search_cluster_sharding_info on iron [14:58:27] seems to run better now [14:58:31] oh, cool! [14:58:51] that's what created /tmp/funk [14:59:08] it *should* be showing you the most current file found on each machine [14:59:15] for each index it finds [14:59:27] oh, that's super useful! [14:59:35] ya [14:59:42] how do you think I should do this? [14:59:47] pool2 first? [15:00:00] well [15:00:06] what are the concerns? [15:00:31] if you're concerned about heating up eqiad caches we can do that before hand [15:00:40] well, mostly that I'm actually going to structurally changing the conf this time [15:00:43] to clean out some cruft [15:00:48] and me no speak php too goodlike [15:01:00] oic [15:01:04] hrm [15:01:12] I mean, I think I can do this [15:01:17] Reedy: you around? [15:01:18] you want to mock up the new one and I can help review it? [15:01:26] notpeter: yup [15:01:31] oh even better. a php non-noob [15:01:35] hehehe [15:01:37] PROBLEM - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 68068 MB (3% inode=99%): [15:01:39] we have so many php wizards! [15:01:51] can you syntax check a conf before I push it out? [15:01:58] You mean it's magic we can get PHP application to work? ;) [15:01:59] sure [15:02:05] ahahaha [15:02:08] something like that ;) [15:02:24] Reedy: yes, and do it without a 15 page string of expletives :-P [15:02:30] New review: Dzahn; ""make it easier"-part sounds good. puppet looks straight-forward / just removes files and approved b..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4364 [15:02:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4364 [15:02:36] so when these go live, what's the impact on production? [15:02:59] apergos: should be none [15:03:10] should be the same as the test last week, which was transparent [15:03:30] Reedy: ok, I changed lucene.php, want to look at the diff? [15:03:31] it'll make search requests cross colos, that's an interesting change [15:03:54] so it changes our failure modes a bit [15:04:03] true [15:04:16] iyes, that's the sort of thing I was thinking of [15:04:26] but so much is already cross colo, that if transit goes down... we'll have bigger problems... [15:04:28] but people won't notice the lag we think? [15:04:39] notpeter: sure [15:04:52] apergos: check this: [15:04:55] http://trouser.org/searchqa.txt [15:05:22] bottom section of that page has the response times for search api tests run from host iron [15:05:43] i've been assuming iron is at eqiad, but now I'm not even sure [15:05:50] it is [15:06:05] ok [15:06:36] apparently it is [15:06:40] so it seems to me that the machines and lucene account for most of the latency [15:07:01] yeah, it's just one http request, so not lots and lots of round trips [15:07:09] ya [15:07:51] we'll see soon enough [15:07:57] heh [15:08:56] another complicating factor in the timing is whether or not subrequests are needed within lucene [15:09:28] I have no idea about that. zero. [15:10:16] i have about half again that, it's somewhat opaque until you follow the logic the routing of a particular request [15:11:30] notpeter: where's the diff? [15:12:17] Reedy: svn di /home/w/common/wmf-config [15:12:22] didn't output it [15:13:39] FYI, you can use php -l /home/w/c/wmf-config/lucene.php to do a lint check on it [15:13:51] oh, that's smart [15:14:25] but looks ok? [15:14:30] Yup, looks good [15:14:48] awesome! thank you. I'll probably ask for one more syntax/sanity check in a bit [15:18:36] well, that appears to be working [15:18:39] as far as I can tell [15:25:19] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.08177476562 (gt 8.0) [15:30:17] well, that looks good. going for pool2 [15:33:12] diederik: is bugzilla working now? [15:33:42] hexmode: don't know, has the patch been applied? [15:33:51] yes [15:33:58] mutante did it [15:34:05] cool, let me check [15:34:16] yep diederik , applied [15:34:26] thx! [15:34:38] looks good? [15:35:08] it worsks! [15:35:14] thanx so much [15:36:10] :) yw! [15:36:47] so we're live? [15:39:59] apergos: it fixes xmlrpc requests to bugzilla [15:40:45] I meant this: [15:40:47] py synchronized wmf-config/lucene.php 'pushing search pool 2 to eqiad. for realz this time!' [15:41:08] heh, ok, i was wondering [15:41:10] apergos: on all of the non-english major languages [15:41:30] guess I should go do some el pedia searches [15:41:31] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Search+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [15:45:55] ok, en as well, no [15:45:57] *now [15:46:05] that ought to spike things up [15:46:21] and the prefix (autocomplete) indexes [15:46:27] New review: Dzahn; "based on "The fix is already in production, this" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4395 [15:48:34] yep, that's looking good [15:54:57] Reedy: will you double check lucene.php again, plox? [15:56:00] Looks fine [15:56:40] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.713349370079 [15:58:33] thanks! [16:00:18] well, ok then [16:00:24] search is now in eqiad [16:04:10] New patchset: Mark Bergsma; "Apply patch varnishncsa-udplog" [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4728 [16:04:11] New patchset: Mark Bergsma; "Implement multiple log lines per udp log packet" [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4729 [16:04:41] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4728 [16:04:52] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4728 [16:04:54] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4728 [16:16:25] who is an expert on testing puppet? [16:16:41] * hexmode looks for Leslie [16:17:25] mutante: got time to give me some tips on puppet? [16:18:26] hexmode: ok, was about to reply in labs channel, i can add a class to your project i think. just got a few minutes though [16:19:37] that is "labsconsole" stuff rather than puppet itself right [16:20:13] mutante: but I was wanting to test w/o committing ... is there a way to do that? [16:20:39] start my own puppet server and point to it... ? [16:20:40] hexmode: there is something brandnew coming up for that but i havent seen it yet [16:20:48] heh [16:20:57] need to hear Ryan about it [16:21:05] ah, k. [16:21:25] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.74042195312 (gt 8.0) [16:27:23] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.592768031496 [16:30:25] hexmode: do you want to validate your puppet changes? [16:30:44] lint would do that, right, hashar ? [16:31:02] !log Sending Canadian upload traffic to the eqiad varnish upload cluster [16:31:04] Logged the message, Master [16:31:05] gem install puppet /// then command is: puppet parser validate someppupetfile [16:31:28] hexmode: which is what the gerrit hook does when you submit a patchset [16:31:45] hexmode: I prefer doing it locally to have the nice color output and avoid the waiting time [16:50:11] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.57039875 (gt 8.0) [16:52:26] RECOVERY - Varnish HTTP upload-backend on cp1034 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [16:54:54] !log enabling notifications for eqiad lucene vips [16:54:56] Logged the message, notpeter [16:56:11] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.664240944882 [17:02:47] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:02:47] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:18:41] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:18:41] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:41:50] New patchset: MarkAHershberger; "lint warnings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4734 [17:42:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4734 [17:43:53] apergos: have you seen http://mosh.mit.edu/? [17:44:07] looking [17:44:16] apergos: I've been using it to ssh back home and it's making a big difference [17:44:24] huh good to know [17:44:40] I'd guess it'll make a difference for the other way around too :-) [17:44:46] how long are you there? [17:44:55] the predictive thing is what really rocks [17:45:00] monday [17:45:12] it's in precise [17:45:13] I mean, when do you leave? [17:45:24] let's see if fedora has it [17:45:36] so our servers will have it soon [17:46:42] I take it it uses ssh-agent? all the usual stuff? [17:47:16] it uses ssh to setup the connection I believe [17:47:19] but not later [17:47:26] yes [17:47:38] but I haven't tried it yet, decided it wasn't worth it until it's conveniently in our distros ;) [17:47:48] my latency isn't an issue [17:48:05] roaming not so much either [17:48:10] it's usually ok for me but not always [17:48:47] yup it's in the distro (says yum installing it) [17:50:20] nice usenix peer review :-D [17:51:25] omg to never have to type stty sane and reset again... [17:51:45] heh [17:55:20] https://github.com/keithw/mosh/issues/120 [17:55:22] rats [17:55:53] agent forwarding is evil [17:56:22] ryan told me yesterday that I need to forward my agent for some things to work (like scripts that ssh to other machines) [17:56:33] it's true [17:56:58] so, evil or not, there we are [17:58:42] <^demon> If we end up rigging some kind of deployment system with git, we might be able to remove that necessity for deployment purposes at least. [17:59:15] that would be one positive step [17:59:54] whoever is root to a system that you're forwarding your agent to, can login to all systems that you have access as you [18:00:10] ssh-add -c helps [18:00:13] but not much [18:00:25] (and it's not supported by all agents, like gnome-keyring-daemon) [18:00:41] <^demon> This is all known, but has never been high enough on anyone's priority list. [18:00:47] <^demon> Not breaking what works, and all ;-) [18:01:14] we figure that at the point where folks have root on the cluster, with the current setup, if they want to screew us we are screwed [18:01:17] yeah don't forward your agent to all systems [18:01:37] <^demon> I only forward it if I need to. [18:01:42] me too [18:01:47] or i just type the root pass ;) [18:01:48] well me three [18:01:56] but that's still not much better [18:02:04] apergos: that means that I have to run separate agents per groups of machines [18:02:10] yep [18:02:13] and we do [18:02:27] it's not the most convenient thing in the world but that's how it goes [18:02:56] at least for now [18:03:05] is the little script in the wiki or was it only in email ? [18:03:23] which, ben's thing for switching? [18:03:27] it's on wiki someplace [18:03:28] paravoid: someone wrote a nice little script so now when i want to ssh to labs i just have it aliased and it switches my ssh key and everything [18:03:29] yep [18:03:36] ah yeah, ben is good about that [18:04:02] yep [18:04:32] ah warning, I am now officially done for the day [18:04:39] ok [18:04:40] :) [18:05:07] nobody cares unless you disappear, apergos ;-) [18:05:12] :-P [18:05:27] it just means that when I say "no" don't be surprised :-D [18:05:39] but we all know that you won't [18:05:43] no means no! [18:06:25] in fact I have been known to say no and mean it [18:06:37] :-P [18:07:01] nimsoft is nowhere near as cute of a name as watchmouse [18:08:33] you are not known to say that apergos [18:08:36] not at all :P [18:09:10] don't give the new guy false expectations :-P [18:10:03] i'm totally telling the truth here [18:10:29] I'll have you know I say no to CT after hours on a regular basis [18:10:31] new guy: expect apergos to be online during european work hours, AND evening, and forget to eat/sleep when there's any problem reported by anyone for any severity level [18:10:35] otherwise I would not have an after hours [18:10:54] * apergos growls in mark's general direction [18:14:02] paravoid: well, you shouldn't enable it in your config file [18:14:10] paravoid: but should explicitly forward it where needed [18:14:26] there's not too many you need to forward it to, but you do need to forward it to the bastion host [18:17:24] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [18:17:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [18:29:32] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [18:29:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [18:30:16] * andrewbogott waves hello to paravoid [18:30:19] welcome! [18:44:30] andrewbogott: thanks! :-) [18:49:53] New patchset: Lcarr; "replacing statically defined nagios nrpe checks with $USER1$ defined in resource.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4738 [18:50:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4738 [18:50:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4738 [18:50:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4738 [18:51:36] New patchset: Ryan Lane; "Adding Faidon as root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4739 [18:51:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4739 [19:03:25] diederik: what's the calll-in number for this? [19:04:19] skype? [19:04:25] sure. [19:04:31] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4739 [19:04:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4739 [19:05:18] hi woosters: can you join me, ben and dario on skype? [19:06:21] sure [19:07:13] woosters: i booked r35 [19:33:09] New patchset: Hashar; "testswarm: publish MediaWiki clone to a new dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4743 [19:33:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4743 [19:36:57] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 192 seconds [19:37:15] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 195 seconds [19:37:15] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 202 seconds [19:37:51] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 211 seconds [19:39:02] yes [19:39:05] mistype [19:41:00] !log reimaging bellin and blondel [19:41:02] Logged the message, notpeter [20:14:09] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [20:14:27] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [20:14:45] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [20:14:54] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [20:16:12] notpeter: do you know what icinga@neon.wikimedia.org is, and possibly why it wants to send me nagios alerts? (does that mean nagios@spence isn't doing it anymore?) [20:22:54] nimish_g: icinga is a branch of nagios [20:23:02] leslie is spinning it up on neon [20:23:06] but I believe it's not done yet [20:23:20] although you would have to ask her for the exact status of that project [20:23:48] ok, will do! thanks [20:36:06] notpeter: do you know offhand if SMS is expected to be going out from nagios or from icinga? [20:36:35] in other words, do we care that the icinga outbound mail is bouncing internally? [20:37:11] I'm getting the icinga mail [20:37:29] I think that eventually it will all be swtiched to icinga [20:37:31] but I'm not sure [20:37:35] so sms would go with it [20:38:45] you're probably getting the stuff destined to wikimedia.org b/c that's not failing the authorized-relay test, but the stuff to outside servers (i.e. @att.com) is bouncing at our outbound relays [20:39:08] ah, gotcha [20:46:08] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [20:59:11] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:00:40] New review: Hashar; "To test the results:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4743 [21:33:22] !log restarted puppet on mw1110 [21:33:24] Logged the message, Mistress of the network gear. [21:33:30] !log restarted puppet on db30 [21:33:32] Logged the message, Mistress of the network gear. [21:33:35] !log db1004 puppet is fubar [21:33:38] Logged the message, Mistress of the network gear. [21:35:45] !log restarted nrpe on db10 [21:35:47] Logged the message, Mistress of the network gear. [21:36:53] LeslieCarr: so, there's this really weird thing about mw1110... [21:36:58] it thinks that it's a search node [21:37:03] and I can't figure out why [21:37:11] heh [21:37:15] interesting [21:37:17] but it went and added itself to ganglia, nagios, etc [21:37:21] yeah [21:37:54] hrm [21:38:08] did you already try a puppetstoredconfigsclean [21:38:15] and adding it to the decom list then removing it ? [21:38:20] those are my two usual quick fixes [21:38:26] (my equivalent of turning it off and on) [21:38:35] heh, I did not [21:38:51] I was just assuming that once it got a real role assigned to it, it would pick that up [21:39:08] as those boxxies are not currently in site.pp [21:39:11] ah [21:39:37] could just assign them boring "standard" and have them not really do anything [21:40:06] or we could finish spinning them up :) [21:40:20] then it would just be down to no memcache in eqiad... [21:40:51] oh yes [21:40:53] that would be better [21:40:55] much better [21:59:53] New patchset: Bhartshorne; "craete a class for swift clients" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4756 [22:00:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4756 [22:00:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4756 [22:00:29] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4756 [22:21:47] New patchset: Diederik; "Wikipedia Zero filters for Orange Uganda, Orange Tunesia and Telenor Montenegro" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4758 [22:22:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4758 [22:22:27] hey maplebed: i am pushing my luck :) but i've got three more wikipedia zero filters [22:23:17] pusher. [22:23:56] fortunately, they are not super computational intensive [22:24:32] I'll say again what I said before - the trigger for overwhelming the host is not the individual filter but the number. [22:25:07] the reason for that is that udp2log must hand off packets to all filters that need them for every packet that comes in. [22:25:27] that loop is what leads to dropped packets, not the computational intensity of crunching those packets once tehy're handed off to the filter. [22:25:45] so yay not computationally intensive, but... it doesn't actually matter. [22:25:51] :) [22:26:01] why is diederik smiling? [22:26:14] i am always smiling [22:26:26] ok then [22:26:31] but thanks for the explanation maplebed [22:27:16] yeah, that changeset looks ok. [22:27:32] diederik: will you watch ganglia for me for the next hour and make sure emery doesn't fall over? [22:27:53] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4758 [22:27:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4758 [22:28:03] sure (if i can find emery) [22:28:18] I dropped the link in our skype chat earlier. [22:29:48] http://ganglia.wikimedia.org/search/ gives a 404 [22:30:56] don't hit enter. [22:30:59] it's a weird interface. [22:31:09] enter your search term then wait a sec and click on the result. [22:31:37] okay, thanks i got the charts [22:37:27] !log deployed more log filters to emery: gerrit/r4758 [22:37:30] Logged the message, Master [22:41:23] New patchset: Lcarr; "Fixing mysql fundraising checks to use nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4759 [22:41:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4759 [22:41:47] notpeter: do you know about powercycling older dbs? [22:42:03] Jeff_Green: https://gerrit.wikimedia.org/r/#change,4759 ? [22:42:09] I need to kick db44 and the instructinos at http://wikitech.wikimedia.org/view/Sun_Fire_X4240 are failing me. [22:42:10] fixing/updating mysql checks [22:42:13] or maybe Ryan_Lane [22:43:25] those should work [22:43:27] what's failing? [22:44:06] status_tag : COMMAND PROCESSING FAILED [22:44:06] error : 246 [22:44:07] error_tag : INVALID TARGET [22:44:25] super helpful message. [22:47:40] platform set power state cycle ? [22:47:46] without the ? [22:48:03] command not recognized. [22:48:14] no clue. ,then [22:48:30] set power state cycle -> syntax error. [22:50:42] would you mind logging in to the mgmt interface for db44 for a moment and see if at least the prompt etc. look normal? [22:50:55] ( Ryan_Lane ) [22:51:11] * Ryan_Lane doesn't know what a normal prompt looks like [22:51:19] I've logged into a sun maybe twice [22:51:33] that's twice more than me. [22:51:39] ok [22:51:40] sec [22:51:49] and our docs only give the command, not the prompt, [22:51:55] so you can't see what it's supposed to look like. [22:51:56] :( [22:52:14] this is a dell... [22:52:47] racadm serveraction powercycle [22:52:49] grrr... [22:53:30] I thought the dells gave you a $ for the prompt. [22:53:36] ::sigh:: [22:53:41] depends on the version of drac [22:56:04] RECOVERY - udp2log log age on emery is OK: OK: all log files active [22:56:11] well there goes my 'it's not the dell prompt I know so it must be a sun' logic. [22:57:07] heh [22:59:49] New review: Tim Starling; "I think Jeronim added them." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4261 [23:01:37] RECOVERY - Host db44 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [23:05:40] PROBLEM - mysqld processes on db44 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:07:10] RECOVERY - mysqld processes on db44 is OK: PROCS OK: 1 process with command name mysqld [23:09:34] PROBLEM - jenkins_service_running on gilman is CRITICAL: Connection refused by host [23:09:34] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 65968 seconds [23:10:26] hrm … so for some reason nrpe-server on gilman keeps reloading with the old version of the config file [23:10:34] even though the config file has changed a while ago [23:10:37] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 63898 seconds [23:10:39] any thoughts why this could happen ? [23:10:55] RECOVERY - jenkins_service_running on gilman is OK: PROCS OK: 3 processes with args jenkins [23:21:36] Has puppet actually updated it? And is it actually loading the config file that you think it is? [23:26:05] it has been updated and even loading it manually it fails [23:26:13] i tried moving the file and making puppet put it back there again