[00:01:01] I'd try 0.67.x if I were you [00:01:19] the stable ones are buggy; the intermediate releases are really crappy [00:01:30] * AaronSchulz is on .69 [00:01:52] I noticed [00:02:47] (03PS1) 10Jdlrobson: Update MobileWebClickTracking schema revision [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87485 [00:03:04] Dinner time. [00:03:04] * paravoid looks at the 112M flat line and smiles [00:03:16] * bd808 waves goodnight [00:03:31] paravoid: hm? [00:03:59] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=copper.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [00:04:15] it's gonna go like this for 4-5 days [00:33:51] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 333 seconds [00:34:11] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 352 seconds [00:57:30] (03PS1) 10Faidon Liambotis: swift: remove swiftcleaner [operations/puppet] - 10https://gerrit.wikimedia.org/r/87494 [00:57:31] (03PS1) 10Faidon Liambotis: swift: remove swift::iptables [operations/puppet] - 10https://gerrit.wikimedia.org/r/87495 [00:57:32] (03PS1) 10Faidon Liambotis: swift: remove conditional for lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/87496 [00:57:33] (03PS1) 10Faidon Liambotis: swift: remove swift::utilities [operations/puppet] - 10https://gerrit.wikimedia.org/r/87497 [00:59:45] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swiftcleaner [operations/puppet] - 10https://gerrit.wikimedia.org/r/87494 (owner: 10Faidon Liambotis) [01:00:20] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swift::iptables [operations/puppet] - 10https://gerrit.wikimedia.org/r/87495 (owner: 10Faidon Liambotis) [01:01:02] (03CR) 10Faidon Liambotis: [C: 032] swift: remove conditional for lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/87496 (owner: 10Faidon Liambotis) [01:01:24] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swift::utilities [operations/puppet] - 10https://gerrit.wikimedia.org/r/87497 (owner: 10Faidon Liambotis) [01:02:02] 9 files changed, 6 insertions(+), 1156 deletions(-) [01:02:05] this is a good day [01:02:09] :) [01:11:47] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 24 seconds [01:12:07] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [01:28:28] !log catrope synchronized php-1.22wmf20/extensions/VisualEditor/ 'Fix SyntaxHighlight icon' [01:28:45] Logged the message, Master [01:30:39] (03PS1) 10Springle: warm up db1038 in s3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87500 [01:31:25] (03CR) 10Springle: [C: 032] warm up db1038 in s3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87500 (owner: 10Springle) [01:32:34] !log springle synchronized wmf-config/db-eqiad.php 'db1038 to s3' [01:32:44] Logged the message, Master [01:53:48] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:48] (03PS1) 10Springle: depool db1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87501 [02:07:15] (03CR) 10Springle: [C: 032] depool db1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87501 (owner: 10Springle) [02:08:00] !log springle synchronized wmf-config/db-eqiad.php 'depool db1003' [02:08:16] Logged the message, Master [02:11:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:15:40] !log upgrading db1003 to precise + mariadb [02:15:51] Logged the message, Master [02:16:10] !log LocalisationUpdate completed (1.22wmf19) at Fri Oct 4 02:16:10 UTC 2013 [02:16:22] Logged the message, Master [02:31:49] !log springle synchronized wmf-config/db-eqiad.php 'refresh mw1072 db-eqiad.php' [02:32:00] Logged the message, Master [02:34:08] PROBLEM - Puppet freshness on maerlant is CRITICAL: No successful Puppet run in the last 10 hours [02:34:33] my sync-file db-eqiad.php isn't getting to mw1072 ^ ... where do sync-file / mwdeploy error messages go? [02:34:38] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 02:34:36 UTC 2013 [02:35:00] !log LocalisationUpdate completed (1.22wmf20) at Fri Oct 4 02:35:00 UTC 2013 [02:35:15] Logged the message, Master [02:35:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:41:36] (03PS1) 10Springle: db1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87502 [02:41:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 4 02:41:40 UTC 2013 [02:42:29] Logged the message, Master [02:42:57] (03CR) 10Springle: [C: 032] db1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87502 (owner: 10Springle) [02:46:36] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [02:49:46] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:19:49] !log started xtrabackup clone from db1035 to db1003 [03:20:00] Logged the message, Master [03:32:40] !log manual sync-common on mw1072 [03:32:51] Logged the message, Master [03:34:42] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 03:34:35 UTC 2013 [03:35:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:06:47] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:07] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 04:34:04 UTC 2013 [04:34:47] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:08:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:02] !log on fenari: running an analysis script for bug 53687 against tampa slave DB servers [06:14:16] Logged the message, Master [06:21:19] (03PS1) 10Springle: warm up db1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87511 [06:22:00] (03CR) 10Springle: [C: 032] warm up db1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87511 (owner: 10Springle) [06:23:09] !log springle synchronized wmf-config/db-eqiad.php 'repool db1003 after upgrade' [06:23:24] Logged the message, Master [06:34:45] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 06:34:36 UTC 2013 [06:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:03:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 07:03:49 UTC 2013 [07:04:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:00] (03PS1) 10Springle: repool db1035 after upgrade, db1003 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87514 [07:08:49] (03CR) 10Springle: [C: 032] repool db1035 after upgrade, db1003 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87514 (owner: 10Springle) [07:09:37] !log springle synchronized wmf-config/db-eqiad.php 'repool db1035 after upgrade, db1003 to full steam' [07:09:57] Logged the message, Master [07:11:36] !log powercycling maerlant, load > 100, couldn't log in via mgmt console [07:11:47] Logged the message, Master [07:14:15] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri Oct 4 07:14:10 UTC 2013 [07:15:05] RECOVERY - SSH on maerlant is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:15:15] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri Oct 4 07:15:05 UTC 2013 [07:35:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 07:35:25 UTC 2013 [07:36:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:49:18] (03PS1) 10Dzahn: add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 [07:49:45] (03CR) 10jenkins-bot: [V: 04-1] add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 (owner: 10Dzahn) [07:50:41] (03PS2) 10Dzahn: add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 [07:52:25] (03PS1) 10Springle: repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87517 [07:53:41] (03CR) 10Springle: [C: 032] repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87517 (owner: 10Springle) [07:54:21] !log springle synchronized wmf-config/db-eqiad.php 'repool db1042' [07:54:32] Logged the message, Master [07:54:46] (03CR) 10Dzahn: [C: 032] "approved by Howie and Toby and the role memberships enough for now per Otto" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 (owner: 10Dzahn) [08:00:49] !log upgrading db1039 to precise + mariadb [08:01:02] Logged the message, Master [08:04:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 08:04:30 UTC 2013 [08:05:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:05:54] (03PS1) 10Springle: db1039 to s6, plus switch to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87518 [08:07:00] (03CR) 10Springle: [C: 032] db1039 to s6, plus switch to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87518 (owner: 10Springle) [08:18:04] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:28:54] (03CR) 10Dzahn: [C: 032] "removes fundraising jenkins. not having an additional install from 3rd party repo is preferable and Jeff requested it because fr is moving" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [08:31:16] (03PS1) 10Dzahn: remove misc::fundraising::jenkins from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 [08:32:51] (03CR) 10Dzahn: "just have to also remove it from site.pp or there will be an unkown class. change 87520" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [08:33:38] good point [08:33:44] (03PS2) 10Dzahn: remove misc::fundraising::jenkins from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 [08:34:28] (03CR) 10Dzahn: [C: 032] "removed in change 86818" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 (owner: 10Dzahn) [08:34:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 08:34:26 UTC 2013 [08:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:27] (03PS1) 10Ori.livneh: Use line (rather than stack) graph type for static assets [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 [08:42:33] ^ siebrand [08:42:55] * siebrand cheers ori-l on! [08:43:18] (03CR) 10Siebrand: [C: 031] "Should be much better :)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [09:04:45] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 09:04:34 UTC 2013 [09:05:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:18:10] (03PS1) 10Dzahn: load mod_expires on planet webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/87523 [09:18:40] But stacks are so colourful! :) [09:20:57] (03CR) 10Dzahn: [C: 032] load mod_expires on planet webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/87523 (owner: 10Dzahn) [09:23:27] (03PS1) 10Dzahn: retab from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 [09:23:33] (03CR) 10jenkins-bot: [V: 04-1] retab from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 (owner: 10Dzahn) [09:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 09:34:27 UTC 2013 [09:35:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:44:00] (03PS2) 10Dzahn: planet, retab from tabs to 4 spaces, align =>'s, quoting.. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 [09:44:38] \O/ [09:50:20] (03CR) 10Dzahn: [C: 032] planet, retab from tabs to 4 spaces, align =>'s, quoting.. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 (owner: 10Dzahn) [10:02:28] (03CR) 10Dzahn: [C: 031] "agreed, lines would look better here: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=Static+assets line is default, righ" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [10:06:58] (03PS1) 10Hashar: doc how to run rspec tests in the rakefile [operations/puppet] - 10https://gerrit.wikimedia.org/r/87531 [10:09:28] (03CR) 10Dzahn: [C: 032] "docs :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87531 (owner: 10Hashar) [10:10:08] mutante: thx [10:10:16] yw [10:25:24] (03CR) 10Dzahn: [C: 032] "fixes RAID checks on several ms-be hosts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:27:42] path conflict on that? hrmm [10:29:35] (03PS2) 10Dzahn: account for hosts where every disk is raid 0 (e.g. the ms-be hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:30:52] (03CR) 10Dzahn: [C: 032] account for hosts where every disk is raid 0 (e.g. the ms-be hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:32:09] ah, files/icinga -> modules/base/files/monitoring/ [10:33:17] waits for Icinga recoveries on quite a few now .. [10:38:07] RECOVERY - RAID on analytics1012 is OK: OK: No disks configured for RAID [10:38:14] (03CR) 10ArielGlenn: [C: 032] Use line (rather than stack) graph type for static assets [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [10:38:17] there it starts :) [10:38:34] apergos: see RECOVERY, thanks for that fix, there will be more soon [10:38:43] yw [10:39:06] RECOVERY - RAID on ms-be8 is OK: OK: No disks configured for RAID [10:39:10] glad to see those warnings go [10:39:20] yes !:) [10:40:31] siebrand: ori-l ^ merged [10:40:42] running puppet now [10:41:49] mutante: yay. Better graphs :) [10:42:05] * Nemo_bis waits for puppet [10:43:35] of course we wouldn't need this fix if we didn't have h310s anymore ;) [10:44:03] but nice fix nevertheless, mutante [10:44:18] if you're feeling up to it, the check is also buggy in other ways [10:44:41] for the rest of the ms-be boxes it says "1 logical drive" while they have 14 [10:44:45] paravoid: i have another check in monitoring we could fix or remove ..:) [10:45:04] Swift HTTP on ms-fe* is HTTP WARNING: HTTP/1.1 401 Unauthorized [10:45:22] ms-fe10xx? [10:45:26] still? [10:45:32] yes, 1001 to 1004 [10:45:42] they are WARN though, not crit [10:45:46] hm, this should have been fixed yesterday [10:45:49] lemme look [10:46:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:46:35] command_name check_http_swift [10:46:35] command_line $USER1$/check_http -H ms-fe.pmtpa.wmnet -I $HOSTADDRESS$ -u /wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/80px-Little_kitten_.jpg [10:46:37] paravoid: ariel wrote the actual fix :) [10:46:38] lol [10:46:40] seriously [10:46:41] for check_raid [10:46:46] hehehe, kittens [10:46:48] hardcoded pmtpa [10:46:52] also the kitten, yes [10:47:11] i see, pmtpa yea [10:47:16] RECOVERY - RAID on ms-be3 is OK: OK: No disks configured for RAID [10:47:36] will fix in a moment [10:47:38] thanks for the ping [10:47:46] RECOVERY - RAID on analytics1014 is OK: OK: No disks configured for RAID [10:48:13] cool, sure [10:49:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.634 second response time [10:49:51] hmm I guess that file should not be renamed, worth protecting? [10:50:09] no [10:50:11] I'll change the URL [10:50:18] I have a better way than this [10:50:25] :) [10:50:32] better so [10:51:05] /monitoring/backend to be exact [10:52:12] and then we have those disk space warnings on some cp10xx, but those are large varnish.bigobj2/varnish.main2 using 95% on /srv/sdb3 so they don't change size [10:54:16] RECOVERY - RAID on analytics1011 is OK: OK: No disks configured for RAID [10:55:27] siebrand: reload the graphs :) [10:55:56] RECOVERY - RAID on ms-be10 is OK: OK: No disks configured for RAID [10:56:36] mutante: right. Now impact is visible much more clearly. The stacking didn't make sense. [11:00:36] RECOVERY - RAID on analytics1021 is OK: OK: No disks configured for RAID [11:01:16] RECOVERY - RAID on ms-be7 is OK: OK: No disks configured for RAID [11:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [11:05:00] sysctl::parameters is so pretty [11:05:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.350 second response time [11:06:26] RECOVERY - RAID on ms-be6 is OK: OK: No disks configured for RAID [11:10:34] mutante: are you familiar with lucene at all? [11:11:21] [FAILED] jawiki.spell [11:11:23] and all ja* [11:11:28] [FAILED] zhwiki.spell [11:11:30] and all zh* [11:16:04] paravoid: no, i just added a check to find the string FAILED on those status pages so it tells us when indexing didnt work, but i dont know why it fails :p [11:16:27] ha [11:16:30] ok :) [11:16:34] I guess noone does [11:17:00] sigh, yea [11:17:06] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=stafford.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+pmtpa [11:17:09] puppet!~ [11:17:20] i was just wondering where check_dpkg actually gets intsalled from now that monitoring is in base module [11:17:31] i see check_raid , but where is check_dpkg [11:17:52] /puppet/modules/base/files/monitoring [11:18:53] looks like we use it but dont install it anymore [11:19:27] mutante: look at modules/nrpe/files and modules/nrpe/manifests/init.pp [11:20:14] akosiaris: thanks, got it. would you expect check-raid.py to be in the same place then ? [11:20:36] confused by base/files/monitoring and nrpe/files/ [11:20:57] I am working on clearing al that up [11:21:00] (03PS1) 10Ori.livneh: Parsoid frontends VCL: strip X-Parsoid-Performance from cache hits [operations/puppet] - 10https://gerrit.wikimedia.org/r/87535 [11:21:06] for now place it alongside check_dpkg please [11:21:08] for some reason host erzurumi never got this file .. but has the nrpe command [11:21:19] -bash: /usr/local/lib/nagios/plugins/check_dpkg: No such file or directory [11:21:22] that's why i was looking [11:21:45] akosiaris: alright :) [11:22:49] mutante: thanks for the ganglia merge [11:24:45] sure, thank ariel [11:25:25] mutante: hardy??? erzurumi is hardy... sigh... [11:25:37] oooh :) [11:25:39] kiiill it [11:25:42] well, then .. [11:25:58] immediately stops trying to fix then:) [11:25:59] err: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate: odd number of arguments for Hash [11:26:19] erzurumi is fundraising leftover, i guess it moves to frack [11:26:19] and it hasn't even asked for the catalog yet... [11:26:23] and we should ping [11:26:32] jeff [11:26:49] hardy boxes do that but it's harmless [11:28:50] RT #5896 - kill erzurumi :p [11:34:29] (03PS1) 10Faidon Liambotis: miredo: fix typo in source => caused by conversion [operations/puppet] - 10https://gerrit.wikimedia.org/r/87537 [11:34:55] (03CR) 10Faidon Liambotis: [C: 032 V: 032] miredo: fix typo in source => caused by conversion [operations/puppet] - 10https://gerrit.wikimedia.org/r/87537 (owner: 10Faidon Liambotis) [11:39:32] (03PS1) 10Dzahn: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ so that it's in the same place with check_dpkg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 [11:40:21] (03CR) 10jenkins-bot: [V: 04-1] move check-raid.py from base/files/monitoring/ to nrpe/plugins/ so that it's in the same place with check_dpkg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [11:40:50] oh, heh [11:40:53] https://integration.wikimedia.org/ci/job/operations-puppet-pep8/3787/violations/ [11:41:13] that check didnt run the former place [11:42:22] root@wtp1001:~# /usr/bin/MegaCli64 -LDInfo -LALL -aALL [11:42:23] Exit Code: 0x00 [11:42:27] empty output [11:42:31] so the new code doesn't catch this [11:43:31] this is so broken [11:43:48] yeah it has that on the rdbs too [11:43:50] ah, yeathat's what apergos said earlier today [11:43:55] it doesn't see the controllers at all there [11:44:21] maybe they don't have controllers at all [11:44:33] controller count: 0 [11:44:35] root@ms-be1001:~# check-raid.py [11:44:36] OK: State is Optimal, checked 1 logical device(s) [11:44:39] that's also ridiculous :) [11:44:44] the rdb ones claim to have h310s [11:45:18] print 'OK: State is %s, checked %d logical device(s)' % (state, numDrives) [11:45:20] root@rdb1001:~# lspci -vv | grep -i perc [11:45:20] Subsystem: Dell PERC H310 Mini Monolithics [11:45:21] lol [11:45:23] so [11:45:31] numDrives is actually the number of *physical* disks [11:45:43] in the last logical device [11:46:27] (03CR) 10Dzahn: "the jenkins fail is just "line too long" in the python script and it didn't run before the move" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [11:53:46] I think I fixed it [11:54:36] OK: State is Optimal, checked 14 logical drive(s), 14 physical drive(s) [11:54:57] :) [11:59:32] (03PS1) 10Dzahn: add a .pep8 file to suppress warnings and jenkins fails due to stuff like long lines, this is just copied from modules/base/files/monitoring where it was done before [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 [12:00:40] (03CR) 10Dzahn: "reason: jenkins fail in Change-Id: I5bad1070ff40261a09b6bee696355ba17731dc14" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 (owner: 10Dzahn) [12:07:32] mutante: are you moving the check? [12:08:16] eh, sure, so far just added Alex to review because he's in the middle of moving it [12:08:28] it's V -1 [12:08:40] that would be fixed by the next change above there [12:08:45] oh [12:08:53] module/nrpe doesnt ignore pep8 long lines yet [12:09:01] but module/base/files/monitoring does [12:09:07] so it didnt' show up before [12:09:07] or maybe I should fix those [12:09:28] the long lines? if you wish:) [12:09:36] I'm hacking on it anyway [12:09:55] ok, cool, just copied behaviour from the base module without questioning it much [12:12:21] !log Inserted 2x 10G MIC into cr1-esams FPC 0, brought online [12:12:32] Logged the message, Master [12:14:31] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:18:44] paravoid: want me to merge the changed that moves it, overriding jenkins for now to make it easier? [12:21:51] mutante: puppet on erzurumi fails so badly on 4-5 things (/etc/sysctl.d, sudo, nrpe, httpy, apt::pin) that I am inclided to say "better to kill it than fix it" [12:23:14] akosiaris: oh yea, i already created an RT to kill it, just need to ask Jeff to confirm [12:24:31] alternative: do-release-upgrade or edit apt sources and upgrade without reinstall ..shrug [12:24:53] yeah....... [12:25:03] reinstall seems faster [12:26:08] yea, just no idea about downtime and if it's all puppetized .. role "ActiveMQ instance" [12:26:51] i expect it's most likely that Jeff will say this will be setup freshly in the frack rack [12:27:04] and that we just need an OK and then remove it completely [12:30:27] akosiaris: move check-raid: change 87538 (i know it uses the file from another module then, but you are changing that anyways) and the reason it fails would be because it doesnt ignore pep8 yet, 87541 , though Faidon said he'll fix the long lines anyways now .. [12:30:55] so dunno if you wanna abandon the second one, fine with me too [12:33:31] mutante: the pedantic person in me says don't suppress warnings... fix them :-) [12:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 12:33:51 UTC 2013 [12:34:10] so let's wait for Faido [12:34:13] Faidon* [12:34:24] hey [12:34:31] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:34:33] (03Abandoned) 10Dzahn: add a .pep8 file to suppress warnings and jenkins fails due to stuff like long lines, this is just copied from modules/base/files/monitoring where it was done before [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 (owner: 10Dzahn) [12:34:37] ok:) [12:35:31] tbh, I'm not too excited to see all kinds of random checks inside the nrpe module [12:35:47] I'd like to see the checks be near the modules that use them, rather than in a central place [12:35:53] for raid this means base I guess [12:36:07] thought it would be modules/icinga/plugins/ [12:36:16] yeah or even that [12:36:20] they are just executed via nrpe [12:36:25] but they are still normal plugins [12:36:36] mark, hi, did you have any luck with ESI issue? [12:36:37] they should be virtual and collected by nrpe [12:36:38] so, for example, check-raid may have a dependency on some python library and accompanied by a package resource [12:36:42] and an addition to the plugins from the distro package [12:36:52] and labs doesn't need check-raid at all [12:36:53] realized might be a better name [12:37:06] having a $::realm check inside the icinga or nrpe modules sounds wrong to me [12:37:44] so... a define and nrpe realizing these things ? [12:38:00] you mean a nrpe::plugin or something like that? [12:38:07] yes [12:38:22] yup, that sounds reasonable to me [12:38:38] ok [12:38:43] will do [12:38:49] but not now [12:38:49] why separate nagios plugins by location they're being run [12:38:56] ? [12:39:06] some plugins are executed on the monitoring host [12:39:15] and some are on the remote hosts, via nrpe [12:39:30] some maybe both [12:39:34] but besides that, they are all just nagios/icinga plguins [12:39:43] that is also true [12:40:00] so, maybe icinga::plugin, sure [12:45:01] so the latest megacli from lsi (which has to be manually extracted from the rpm) sees the controller on the rdb boxes [12:45:13] we need an updated megacli anyway [12:45:19] to come with libsysfs too [12:45:24] same applies to the ms* boxes [12:45:30] where can I put this? [12:45:46] we have a package called wikimedia-raid-utils [12:45:47] but [12:46:09] maybe we should just use http://hwraid.le-vert.net/wiki/DebianPackages [12:46:34] what's the latest version you found? [12:46:56] I have 8.04.53 on my disk [12:46:57] 8.05.06 [12:47:03] that site has 8.04.07-1 [12:48:44] I can't imagine a minor version change would make the difference, and it would be nice to have the packaging and updating handled for us [12:48:59] it does make a differnce someietimes [12:49:00] happy to test it [12:49:03] ergh [12:49:29] I remember having a segfault when trying to discard some preserved cache [12:49:42] and the version we have in wikimedia-raid-utils didn't have the command at all [12:49:48] the latest version segfaulted [12:49:55] and I went 3 or 4 versions behind and it worked [12:49:58] it was a fun ride :) [12:50:01] great [12:50:07] so here's what I propose [12:50:45] add the repo & key above to files/misc/reprepro/updates [12:51:38] use the grep construct to pick just megacli and perhaps megarc & megamgr (also shipped by wikimedia-raid-utils, although I'm unsure if it's still used anywhere) [12:51:59] grep for wikimedia-raid-utils and ensure => absent it, and add megacli ensure => present [12:52:02] makes sense? [12:53:09] yes but I am going to test the specific version first before proceeding [12:53:34] also want to see what else might be in the wikimedia-raid-utils package if anything [12:54:03] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [12:54:54] there's megarc, megamgr, arcconf and tw_cli [12:57:01] (03PS1) 10Faidon Liambotis: base: fix check-raid to handle no or multiple LDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/87548 [12:57:13] apergos: a review would be most welcome :) [12:57:13] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:32] ok, gimme two secs [12:58:14] 5.3T in swift eqiad already [12:58:16] sweet [13:04:49] meh the packaged megacli64 wants libsysfs2.0.1 (or maybe 2.0.2), precise has 2.1.0 [13:04:52] so that's a fail [13:06:35] the version from lsi's web site did not need libsysfs [13:09:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:41] no [13:10:48] it embeds 2.0.1 [13:10:58] and it's needed [13:11:10] it dlopens() it and it's needed for some configurations only [13:11:52] dpkg -L it, you'll find it installs libsysfs as well and a wrapper to preload it [13:12:50] the lsi vrsion did not need it for MegaCli64 -LDInfo -LALL -aALL but the debian package version did [13:13:16] same box? [13:13:24] is WAP dead yet or not [13:13:27] same box. [13:13:36] hm [13:13:40] but the package worked as is, right? [13:13:57] I have not installed the package as a deb yet [13:14:04] I just put the extracted binary up first [13:14:05] root@ms-be1001:~# dpkg --contents megacli_8.04.07-1_amd64.deb |grep libsys [13:14:08] -rw-r--r-- root/root 93370 2012-08-21 21:16 ./usr/lib/megacli/libsysfs.so.2.0.2 [13:14:43] -rw-r--r-- root/root 93370 2012-08-21 21:16 ./usr/lib/megacli/libsysfs.so.2.0.2 [13:14:46] ser [13:14:49] er [13:14:50] opt/MegaRAID/MegaCli [13:14:51] opt/MegaRAID/MegaCli/MegaCli [13:14:51] opt/MegaRAID/MegaCli/MegaCli64 [13:14:51] opt/MegaRAID/MegaCli/libstorelibir-2.so.13.05-0 [13:14:53] opt/opt/lsi/3rdpartylibs/x86_64/libsysfs.so.2.0.2 [13:14:57] opt/opt/lsi/3rdpartylibs/LGPLLicenseV2.txt [13:14:58] opt/opt/lsi/3rdpartylibs/libsysfs.so.2.0.2 [13:15:08] is what I see on the 8.04.53 rpm [13:15:20] yes, the rpm I have has this library [13:15:22] however [13:15:49] anyway, it doesn't matter [13:15:50] that library is not needed for this command, for whatever reason [13:15:53]