[00:02:41] Ryan_Lane: So I wrote a script to do a graceful restart of the job runners, wanna review it before I run it? [00:03:14] http://pastebin.com/NcR9CcnM [00:03:52] AaronSchulz: --^^ [00:04:14] Of course this functionality should be written into the job runners' init script eventually [00:04:22] * AaronSchulz looks [00:10:10] RoanKattouw: what are the 'grep -v grep's for? [00:10:41] oh, nvm [00:10:45] just removes cruft [00:11:41] !log nagios down [00:11:43] Logged the message, Mistress of the network gear. [00:11:44] nagios is down [00:11:44] fyi [00:13:26] RoanKattouw: seems mostly sane [00:13:56] * AaronSchulz was staring at the awk/sort/head stuff [00:15:54] Yeah it's a quickie so it's undocumented, sorry :) [00:36:03] explosion? [00:36:46] * AaronSchulz wonders what RoanKattouw is doing [00:38:16] heh [00:38:30] The "What?! No!!" thing? [00:38:55] I was reviewing code and it had WHERE foo REGEXP CONCAT('/', bar, '/', baz, '$') [00:43:03] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [00:43:21] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [00:43:57] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1031 is CRITICAL: Puppet has not run in the last 10 hours [00:43:57] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [00:43:58] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [00:43:58] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [00:45:09] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [00:47:06] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [00:53:42] RECOVERY - MySQL Slave Running on db1007 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:55:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:06] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 374675 seconds [00:56:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.705 seconds [00:57:09] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [01:30:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.440 seconds [02:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:00] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:14:03] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [02:14:03] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [02:14:57] PROBLEM - Puppet freshness on search1004 is CRITICAL: Puppet has not run in the last 10 hours [02:16:00] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [02:16:00] PROBLEM - Puppet freshness on search1005 is CRITICAL: Puppet has not run in the last 10 hours [02:17:03] PROBLEM - Puppet freshness on search1007 is CRITICAL: Puppet has not run in the last 10 hours [02:17:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.063 seconds [02:17:57] PROBLEM - Puppet freshness on search1008 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1009 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1010 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1011 is CRITICAL: Puppet has not run in the last 10 hours [02:21:09] PROBLEM - Puppet freshness on search1012 is CRITICAL: Puppet has not run in the last 10 hours [02:22:21] PROBLEM - Puppet freshness on search1013 is CRITICAL: Puppet has not run in the last 10 hours [02:23:24] PROBLEM - Puppet freshness on search1014 is CRITICAL: Puppet has not run in the last 10 hours [02:23:24] PROBLEM - Puppet freshness on search1015 is CRITICAL: Puppet has not run in the last 10 hours [02:24:27] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [02:25:21] PROBLEM - Puppet freshness on search1017 is CRITICAL: Puppet has not run in the last 10 hours [02:25:21] PROBLEM - Puppet freshness on search1018 is CRITICAL: Puppet has not run in the last 10 hours [02:26:24] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [02:26:24] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [02:27:27] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [02:28:21] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [02:28:21] PROBLEM - Puppet freshness on search1023 is CRITICAL: Puppet has not run in the last 10 hours [02:29:24] PROBLEM - Puppet freshness on search1024 is CRITICAL: Puppet has not run in the last 10 hours [02:39:18] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Wed Apr 11 02:38:52 UTC 2012 [02:39:18] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Wed Apr 11 02:39:06 UTC 2012 [02:41:24] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Wed Apr 11 02:40:55 UTC 2012 [02:41:51] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Wed Apr 11 02:41:38 UTC 2012 [02:42:18] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Wed Apr 11 02:42:08 UTC 2012 [02:42:54] RECOVERY - Puppet freshness on search1012 is OK: puppet ran at Wed Apr 11 02:42:32 UTC 2012 [02:43:21] RECOVERY - Puppet freshness on search1017 is OK: puppet ran at Wed Apr 11 02:43:12 UTC 2012 [02:43:48] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Wed Apr 11 02:43:43 UTC 2012 [02:44:24] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Wed Apr 11 02:43:58 UTC 2012 [02:46:21] RECOVERY - Puppet freshness on search1023 is OK: puppet ran at Wed Apr 11 02:46:07 UTC 2012 [02:46:21] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Wed Apr 11 02:46:10 UTC 2012 [02:47:51] PROBLEM - Host db44 is DOWN: PING CRITICAL - Packet loss = 100% [02:48:18] RECOVERY - Puppet freshness on cp1033 is OK: puppet ran at Wed Apr 11 02:48:17 UTC 2012 [02:50:24] RECOVERY - Puppet freshness on search1008 is OK: puppet ran at Wed Apr 11 02:50:12 UTC 2012 [02:52:12] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Wed Apr 11 02:51:53 UTC 2012 [02:52:48] RECOVERY - Puppet freshness on search1007 is OK: puppet ran at Wed Apr 11 02:52:43 UTC 2012 [02:54:18] RECOVERY - Puppet freshness on cp1035 is OK: puppet ran at Wed Apr 11 02:53:48 UTC 2012 [02:55:21] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Wed Apr 11 02:55:12 UTC 2012 [02:55:48] RECOVERY - Puppet freshness on search1010 is OK: puppet ran at Wed Apr 11 02:55:27 UTC 2012 [02:56:24] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Wed Apr 11 02:56:02 UTC 2012 [02:56:51] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Wed Apr 11 02:56:43 UTC 2012 [02:57:45] RECOVERY - Puppet freshness on cp1032 is OK: puppet ran at Wed Apr 11 02:57:25 UTC 2012 [02:58:48] RECOVERY - Puppet freshness on cp1031 is OK: puppet ran at Wed Apr 11 02:58:36 UTC 2012 [02:58:48] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Wed Apr 11 02:58:43 UTC 2012 [02:59:15] RECOVERY - Puppet freshness on search1022 is OK: puppet ran at Wed Apr 11 02:58:54 UTC 2012 [03:01:21] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Wed Apr 11 03:01:11 UTC 2012 [03:02:24] RECOVERY - Puppet freshness on search1013 is OK: puppet ran at Wed Apr 11 03:02:12 UTC 2012 [03:02:33] RECOVERY - Varnish HTTP upload-frontend on cp1029 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [03:04:21] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Wed Apr 11 03:03:59 UTC 2012 [03:06:18] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Wed Apr 11 03:05:59 UTC 2012 [03:06:54] RECOVERY - Puppet freshness on search1009 is OK: puppet ran at Wed Apr 11 03:06:37 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Wed Apr 11 03:07:07 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Wed Apr 11 03:07:13 UTC 2012 [03:07:21] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Wed Apr 11 03:07:13 UTC 2012 [03:27:59] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [03:30:14] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Wed Apr 11 03:29:53 UTC 2012 [05:44:41] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [05:46:56] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [06:10:38] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [06:11:41] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [07:02:13] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:02:13] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:08:31] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [07:10:01] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [07:18:07] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:18:07] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:39:21] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [09:40:33] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:43:24] New patchset: Mark Bergsma; "Don't start Varnish automatically, Puppet will do this" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4685 [09:43:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4685 [09:44:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4685 [09:44:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4685 [09:59:29] New patchset: Mark Bergsma; "Make purging cluster-specific, to allow tweaking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4686 [09:59:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4686 [10:00:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4686 [10:00:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4686 [10:07:53] New patchset: Mark Bergsma; "Implement purging using a separate subroutine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4687 [10:08:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4687 [10:08:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4687 [10:08:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4687 [10:25:43] New patchset: Mark Bergsma; "Purge only http://upload.wikimedia.org URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4688 [10:25:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4688 [10:26:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4688 [10:26:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4688 [10:27:55] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [10:28:13] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [10:45:01] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [10:49:22] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:34] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [10:54:28] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:57:55] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:08:17] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 183 seconds [11:08:44] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 195 seconds [11:09:47] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 209 seconds [11:10:05] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 217 seconds [11:15:29] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 183 seconds [11:15:56] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 183 seconds [11:15:56] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 187 seconds [11:16:50] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 197 seconds [11:17:17] PROBLEM - Host mw6 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:29] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [11:20:08] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 23 seconds [11:20:08] RECOVERY - Host mw6 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [11:20:17] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [11:21:11] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [11:21:11] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [11:23:44] PROBLEM - Apache HTTP on mw6 is CRITICAL: Connection refused [11:25:05] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [11:34:08] New patchset: Dzahn; "another mail forward and url redirect for renamed list - museum-l -> glam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4693 [11:34:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4693 [11:35:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4693 [11:35:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4693 [11:49:46] New patchset: Mark Bergsma; "Switch to the persistent storage backend on cp1036 for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4696 [11:50:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4696 [11:51:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4696 [11:51:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4696 [11:58:14] !log Setup cp1036 with the persistent storage backend [11:58:17] Logged the message, Master [12:03:06] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [12:04:09] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [12:07:44] !log moved another list: museum-l -> glam (http://lists.wikimedia.org/pipermail/glam/2012-April/000000.html) [12:07:45] Logged the message, Master [12:08:09] * mutante inserts Template:PITA into the docs page he is writing [12:26:12] PROBLEM - Varnish HTTP upload-frontend on cp1036 is CRITICAL: Connection refused [12:33:24] RECOVERY - Varnish HTTP upload-frontend on cp1036 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [12:38:31] New patchset: Mark Bergsma; "Run varnish as user varnish instead of nobody" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4699 [12:38:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4699 [12:39:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4699 [12:39:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4699 [12:59:03] New patchset: Mark Bergsma; "Automatically restart gmond if varnish was started later" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4701 [12:59:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4701 [12:59:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4701 [12:59:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4701 [13:02:43] New patchset: Mark Bergsma; "Make restart gmond names unique" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4702 [13:02:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4702 [13:03:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4702 [13:03:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4702 [13:15:06] New patchset: Mark Bergsma; "Slightly nicer naming" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4703 [13:15:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4703 [13:15:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4703 [13:15:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4703 [13:35:25] !log applied patch-RT-2804.diff to bugzilla per [[BZ::731219]] re: XMLRPC content-type verification [13:35:27] Logged the message, Master [13:54:22] s/BZ/bugzilla.mozilla/g [13:55:01] New review: Hashar; "Added Mark & Tim as reviewers since the symbolics links like /h/w/c/p are some old stuff :-)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4261 [13:55:54] what is the entry URL for gitweb to see all the active repositories so you can browse them? [14:01:21] Jeff_Green: hey, you want to shoot off a quick round of tests at eqiad search just to be extra super safe before deploy? [14:02:05] yawp. ready? [14:02:12] should be [14:02:18] k [14:02:20] haven't changed anything since yesterday :) [14:02:27] just really paranoid :) [14:02:27] ha [14:02:30] ok then! [14:03:03] launched [14:03:08] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:26] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:31] uh.... [14:03:50] hey mark, are you doin' stuff with lvs in esams? [14:04:29] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 42980 bytes in 0.774 seconds [14:04:47] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 51913 bytes in 0.775 seconds [14:04:57] eh, solved [14:05:40] notpeter: search comparison looks clean [14:05:51] Jeff_Green: sweet! thank you [14:05:55] np [14:06:14] well . . . actually [14:06:18] ja? [14:06:32] hrm. enwiki is only 80% matches [14:06:40] lemme look at the raw logs a bit [14:06:53] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [14:07:08] kk [14:08:32] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:10:09] the individual result scores seem pretty different for the two things I've looked at so far [14:10:10] like [14:10:27] sec [14:11:07] 10.0.3.9: 3542.88 0 DVD [14:11:07] 10.64.0.95: 849.06 0 DVD [14:11:26] it's probably just timing, but it's a little curious that the relevance scores are so different [14:11:45] hhhmmm [14:11:49] that is a little worrying [14:11:56] I'll curl around a little too [14:11:58] ok [14:12:02] link to some results? [14:12:19] iron:~$ /opt/searchqa/bin/analyse_test_results /tmp/fire_in_the_hole-20120411-140257 [14:12:24] and . . . [14:12:49] i really gotta mod this script to print the summary page to a file too [14:13:32] http://trouser.org/searchqa.txt [14:14:06] for the most part it's clean, but I do think it's a little odd to see that much variation on enwiki [14:14:34] yeah [14:14:37] that's a little odd [14:15:28] I've been meaning to mod the script that checks the filesystem for indexes on each machine individually, such that it's trivial to see who has what indexes and how recent the files are [14:15:52] my vote: if you curl around and are generally not surprised with the results, do the cut [14:16:26] and meanwhile I can tweak that script to help expose issues now and going forward [14:16:27] yeah [14:16:32] I'm not hella worried [14:16:38] me either [14:16:39] but I think this does warrant some further investigation [14:16:54] no timeouts at all, that's great [14:32:37] Jeff_Green: these are, in fact, quite different... [14:33:02] yeah, i'm arriving at interesting preliminary conclusions about things too [14:33:43] almost done haxoring this script, but it looks like there's a big time difference between when the eqiad and pmtpa hosts last indexed [14:35:44] huh [14:37:13] oh, yes [14:37:13] see iron:/tmp/funk [14:37:24] the pmtpa search indexer looks pretty idle.... [14:37:43] enwiki.nspart1.sub1 20120411051509 search1003.eqiad.wmnet [14:37:44] enwiki.nspart1.sub1 20120411061941 search3.pmtpa.wmnet [14:38:24] oh wait, i misread that [14:38:34] that's only an hour [14:38:35] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=&c=Search&h=searchidx2.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [14:38:53] searchidx2 seems to have shat bed about an hour ago [14:39:25] boo [14:40:19] yes, it hasn't done anything for 1.25 hrs [14:40:25] !log restarting indexer on searchidx2 [14:40:27] Logged the message, notpeter [14:41:01] so hey! seems like a great time to switch to a newer set of indexes :) [14:41:08] hah [14:41:18] it does [14:41:28] well, I'll wait my 20 minutes [14:41:34] see what happens [14:41:41] but, I still think that tings will be good to go [14:41:52] I mean, it seems unlikely that it all magically broke in the last 24 hours [14:42:09] unless "it" = old infrastrucutre :) [14:43:14] i'm a little puzzled why this made such a big difference on enwiki [14:43:30] the indexes appear to be all there and at most ~2h apart [14:43:43] how often are they supposed to generate? [14:44:14] the regular indexes are updating all the time [14:44:22] the numbers being so different is weird... [14:44:24] enwiki.prefix should be borkn by all accounts, that's ~1d apart [14:44:30] but that might be part of the failure mode [14:44:34] yeah [14:44:58] so, everything other than the regular search index nspart[12].sub.blah [14:45:05] is made by crom [14:45:07] cron [14:45:11] oic [14:45:11] so those won't be different [14:45:22] it's just the search rankings that are being updated all the time [14:45:22] ok then it makes sense-ish [14:45:29] heh [14:45:39] so prefix will all be the same [14:45:45] but search results will be different [14:45:54] well [14:46:11] interestingly, though, the .prefix index files are far apart in time: [14:46:21] enwiki.prefix 20120411115947 search1018.eqiad.wmnet [14:46:21] enwiki.prefix 20120410175632 search18.pmtpa.wmnet [14:47:13] unless the apparently date-named file is not using a meaningful date [14:48:29] hhmmmm [14:48:30] weird [14:48:51] well... as long as eqiad is newer! :) [14:49:27] woot [14:50:16] I mean, searchidx2 may have been slowly failing over the last 24 hours [14:50:29] so it may not have successfully finished its cron [14:50:42] yeah [14:50:43] yeah [14:50:49] last time that cron ran was yesterday [14:51:02] New patchset: Jgreen; "modified search qa script fetch_search_cluster_sharding_info to report most recent lucene indexes for each search host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4722 [14:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4722 [14:51:41] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4722 [14:51:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4722 [14:54:46] Jeff_Green: and logs are rotating properly! [14:54:58] rlly [14:55:05] ja [14:55:13] thank you for finding that thing [14:55:19] this is much better (probably) [14:55:42] sure. have you been able to tell whether it helped the OMGDUMPTHEDISKCACHE?!??!>>!>>! problem? [14:56:17] nope. I didn't do it on pmtpa [14:56:22] as pmtpa was mildly stable [14:56:25] ah [14:56:40] and I didn't want to mess with something that was almost falling to peices [14:56:50] but hey, we'll know when those logs rotate tonight! :) [14:57:41] yes [14:58:06] alright, I'm gonna start [14:58:15] hey, so if you ever want a quick view of what indexes are on which machine run /opt/searchqa/bin/fetch_search_cluster_sharding_info on iron [14:58:27] seems to run better now [14:58:31] oh, cool! [14:58:51] that's what created /tmp/funk [14:59:08] it *should* be showing you the most current file found on each machine [14:59:15] for each index it finds [14:59:27] oh, that's super useful! [14:59:35] ya [14:59:42] how do you think I should do this? [14:59:47] pool2 first? [15:00:00] well [15:00:06] what are the concerns? [15:00:31] if you're concerned about heating up eqiad caches we can do that before hand [15:00:40] well, mostly that I'm actually going to structurally changing the conf this time [15:00:43] to clean out some cruft [15:00:48] and me no speak php too goodlike [15:01:00] oic [15:01:04] hrm [15:01:12] I mean, I think I can do this [15:01:17] Reedy: you around? [15:01:18] you want to mock up the new one and I can help review it? [15:01:26] notpeter: yup [15:01:31] oh even better. a php non-noob [15:01:35] hehehe [15:01:37] PROBLEM - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 68068 MB (3% inode=99%): [15:01:39] we have so many php wizards! [15:01:51] can you syntax check a conf before I push it out? [15:01:58] You mean it's magic we can get PHP application to work? ;) [15:01:59] sure [15:02:05] ahahaha [15:02:08] something like that ;) [15:02:24] Reedy: yes, and do it without a 15 page string of expletives :-P [15:02:30] New review: Dzahn; ""make it easier"-part sounds good. puppet looks straight-forward / just removes files and approved b..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4364 [15:02:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4364 [15:02:36] so when these go live, what's the impact on production? [15:02:59] apergos: should be none [15:03:10] should be the same as the test last week, which was transparent [15:03:30] Reedy: ok, I changed lucene.php, want to look at the diff? [15:03:31] it'll make search requests cross colos, that's an interesting change [15:03:54] so it changes our failure modes a bit [15:04:03] true [15:04:16] iyes, that's the sort of thing I was thinking of [15:04:26] but so much is already cross colo, that if transit goes down... we'll have bigger problems... [15:04:28] but people won't notice the lag we think? [15:04:39] notpeter: sure [15:04:52] apergos: check this: [15:04:55] http://trouser.org/searchqa.txt [15:05:22] bottom section of that page has the response times for search api tests run from host iron [15:05:43] i've been assuming iron is at eqiad, but now I'm not even sure [15:05:50] it is [15:06:05] ok [15:06:36] apparently it is [15:06:40] so it seems to me that the machines and lucene account for most of the latency [15:07:01] yeah, it's just one http request, so not lots and lots of round trips [15:07:09] ya [15:07:51] we'll see soon enough [15:07:57] heh [15:08:56] another complicating factor in the timing is whether or not subrequests are needed within lucene [15:09:28] I have no idea about that. zero. [15:10:16] i have about half again that, it's somewhat opaque until you follow the logic the routing of a particular request [15:11:30] notpeter: where's the diff? [15:12:17] Reedy: svn di /home/w/common/wmf-config [15:12:22] didn't output it [15:13:39] FYI, you can use php -l /home/w/c/wmf-config/lucene.php to do a lint check on it [15:13:51] oh, that's smart [15:14:25] but looks ok? [15:14:30] Yup, looks good [15:14:48] awesome! thank you. I'll probably ask for one more syntax/sanity check in a bit [15:18:36] well, that appears to be working [15:18:39] as far as I can tell [15:25:19] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.08177476562 (gt 8.0) [15:30:17] well, that looks good. going for pool2 [15:33:12] diederik: is bugzilla working now? [15:33:42] hexmode: don't know, has the patch been applied? [15:33:51] yes [15:33:58] mutante did it [15:34:05] cool, let me check [15:34:16] yep diederik , applied [15:34:26] thx! [15:34:38] looks good? [15:35:08] it worsks! [15:35:14] thanx so much [15:36:10] :) yw! [15:36:47] so we're live? [15:39:59] apergos: it fixes xmlrpc requests to bugzilla [15:40:45] I meant this: [15:40:47] py synchronized wmf-config/lucene.php 'pushing search pool 2 to eqiad. for realz this time!' [15:41:08] heh, ok, i was wondering [15:41:10]