[00:01:01] I'd try 0.67.x if I were you [00:01:19] the stable ones are buggy; the intermediate releases are really crappy [00:01:30] * AaronSchulz is on .69 [00:01:52] I noticed [00:02:47] (03PS1) 10Jdlrobson: Update MobileWebClickTracking schema revision [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87485 [00:03:04] Dinner time. [00:03:04] * paravoid looks at the 112M flat line and smiles [00:03:16] * bd808 waves goodnight [00:03:31] paravoid: hm? [00:03:59] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=copper.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [00:04:15] it's gonna go like this for 4-5 days [00:33:51] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 333 seconds [00:34:11] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 352 seconds [00:57:30] (03PS1) 10Faidon Liambotis: swift: remove swiftcleaner [operations/puppet] - 10https://gerrit.wikimedia.org/r/87494 [00:57:31] (03PS1) 10Faidon Liambotis: swift: remove swift::iptables [operations/puppet] - 10https://gerrit.wikimedia.org/r/87495 [00:57:32] (03PS1) 10Faidon Liambotis: swift: remove conditional for lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/87496 [00:57:33] (03PS1) 10Faidon Liambotis: swift: remove swift::utilities [operations/puppet] - 10https://gerrit.wikimedia.org/r/87497 [00:59:45] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swiftcleaner [operations/puppet] - 10https://gerrit.wikimedia.org/r/87494 (owner: 10Faidon Liambotis) [01:00:20] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swift::iptables [operations/puppet] - 10https://gerrit.wikimedia.org/r/87495 (owner: 10Faidon Liambotis) [01:01:02] (03CR) 10Faidon Liambotis: [C: 032] swift: remove conditional for lucid [operations/puppet] - 10https://gerrit.wikimedia.org/r/87496 (owner: 10Faidon Liambotis) [01:01:24] (03CR) 10Faidon Liambotis: [C: 032] swift: remove swift::utilities [operations/puppet] - 10https://gerrit.wikimedia.org/r/87497 (owner: 10Faidon Liambotis) [01:02:02] 9 files changed, 6 insertions(+), 1156 deletions(-) [01:02:05] this is a good day [01:02:09] :) [01:11:47] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 24 seconds [01:12:07] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [01:28:28] !log catrope synchronized php-1.22wmf20/extensions/VisualEditor/ 'Fix SyntaxHighlight icon' [01:28:45] Logged the message, Master [01:30:39] (03PS1) 10Springle: warm up db1038 in s3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87500 [01:31:25] (03CR) 10Springle: [C: 032] warm up db1038 in s3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87500 (owner: 10Springle) [01:32:34] !log springle synchronized wmf-config/db-eqiad.php 'db1038 to s3' [01:32:44] Logged the message, Master [01:53:48] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:48] (03PS1) 10Springle: depool db1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87501 [02:07:15] (03CR) 10Springle: [C: 032] depool db1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87501 (owner: 10Springle) [02:08:00] !log springle synchronized wmf-config/db-eqiad.php 'depool db1003' [02:08:16] Logged the message, Master [02:11:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:15:40] !log upgrading db1003 to precise + mariadb [02:15:51] Logged the message, Master [02:16:10] !log LocalisationUpdate completed (1.22wmf19) at Fri Oct 4 02:16:10 UTC 2013 [02:16:22] Logged the message, Master [02:31:49] !log springle synchronized wmf-config/db-eqiad.php 'refresh mw1072 db-eqiad.php' [02:32:00] Logged the message, Master [02:34:08] PROBLEM - Puppet freshness on maerlant is CRITICAL: No successful Puppet run in the last 10 hours [02:34:33] my sync-file db-eqiad.php isn't getting to mw1072 ^ ... where do sync-file / mwdeploy error messages go? [02:34:38] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 02:34:36 UTC 2013 [02:35:00] !log LocalisationUpdate completed (1.22wmf20) at Fri Oct 4 02:35:00 UTC 2013 [02:35:15] Logged the message, Master [02:35:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:41:36] (03PS1) 10Springle: db1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87502 [02:41:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 4 02:41:40 UTC 2013 [02:42:29] Logged the message, Master [02:42:57] (03CR) 10Springle: [C: 032] db1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87502 (owner: 10Springle) [02:46:36] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [02:49:46] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:19:49] !log started xtrabackup clone from db1035 to db1003 [03:20:00] Logged the message, Master [03:32:40] !log manual sync-common on mw1072 [03:32:51] Logged the message, Master [03:34:42] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 03:34:35 UTC 2013 [03:35:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:06:47] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:07] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 04:34:04 UTC 2013 [04:34:47] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:08:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:02] !log on fenari: running an analysis script for bug 53687 against tampa slave DB servers [06:14:16] Logged the message, Master [06:21:19] (03PS1) 10Springle: warm up db1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87511 [06:22:00] (03CR) 10Springle: [C: 032] warm up db1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87511 (owner: 10Springle) [06:23:09] !log springle synchronized wmf-config/db-eqiad.php 'repool db1003 after upgrade' [06:23:24] Logged the message, Master [06:34:45] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 06:34:36 UTC 2013 [06:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:03:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 07:03:49 UTC 2013 [07:04:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:00] (03PS1) 10Springle: repool db1035 after upgrade, db1003 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87514 [07:08:49] (03CR) 10Springle: [C: 032] repool db1035 after upgrade, db1003 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87514 (owner: 10Springle) [07:09:37] !log springle synchronized wmf-config/db-eqiad.php 'repool db1035 after upgrade, db1003 to full steam' [07:09:57] Logged the message, Master [07:11:36] !log powercycling maerlant, load > 100, couldn't log in via mgmt console [07:11:47] Logged the message, Master [07:14:15] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri Oct 4 07:14:10 UTC 2013 [07:15:05] RECOVERY - SSH on maerlant is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:15:15] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri Oct 4 07:15:05 UTC 2013 [07:35:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 07:35:25 UTC 2013 [07:36:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:49:18] (03PS1) 10Dzahn: add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 [07:49:45] (03CR) 10jenkins-bot: [V: 04-1] add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 (owner: 10Dzahn) [07:50:41] (03PS2) 10Dzahn: add ironholds to role analytics (RT #5831) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 [07:52:25] (03PS1) 10Springle: repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87517 [07:53:41] (03CR) 10Springle: [C: 032] repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87517 (owner: 10Springle) [07:54:21] !log springle synchronized wmf-config/db-eqiad.php 'repool db1042' [07:54:32] Logged the message, Master [07:54:46] (03CR) 10Dzahn: [C: 032] "approved by Howie and Toby and the role memberships enough for now per Otto" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87516 (owner: 10Dzahn) [08:00:49] !log upgrading db1039 to precise + mariadb [08:01:02] Logged the message, Master [08:04:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 08:04:30 UTC 2013 [08:05:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:05:54] (03PS1) 10Springle: db1039 to s6, plus switch to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87518 [08:07:00] (03CR) 10Springle: [C: 032] db1039 to s6, plus switch to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/87518 (owner: 10Springle) [08:18:04] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/85669 (owner: 10Hashar) [08:28:54] (03CR) 10Dzahn: [C: 032] "removes fundraising jenkins. not having an additional install from 3rd party repo is preferable and Jeff requested it because fr is moving" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [08:31:16] (03PS1) 10Dzahn: remove misc::fundraising::jenkins from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 [08:32:51] (03CR) 10Dzahn: "just have to also remove it from site.pp or there will be an unkown class. change 87520" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [08:33:38] good point [08:33:44] (03PS2) 10Dzahn: remove misc::fundraising::jenkins from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 [08:34:28] (03CR) 10Dzahn: [C: 032] "removed in change 86818" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87520 (owner: 10Dzahn) [08:34:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 08:34:26 UTC 2013 [08:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:27] (03PS1) 10Ori.livneh: Use line (rather than stack) graph type for static assets [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 [08:42:33] ^ siebrand [08:42:55] * siebrand cheers ori-l on! [08:43:18] (03CR) 10Siebrand: [C: 031] "Should be much better :)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [09:04:45] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 09:04:34 UTC 2013 [09:05:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:18:10] (03PS1) 10Dzahn: load mod_expires on planet webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/87523 [09:18:40] But stacks are so colourful! :) [09:20:57] (03CR) 10Dzahn: [C: 032] load mod_expires on planet webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/87523 (owner: 10Dzahn) [09:23:27] (03PS1) 10Dzahn: retab from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 [09:23:33] (03CR) 10jenkins-bot: [V: 04-1] retab from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 (owner: 10Dzahn) [09:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 09:34:27 UTC 2013 [09:35:23] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:44:00] (03PS2) 10Dzahn: planet, retab from tabs to 4 spaces, align =>'s, quoting.. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 [09:44:38] \O/ [09:50:20] (03CR) 10Dzahn: [C: 032] planet, retab from tabs to 4 spaces, align =>'s, quoting.. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87524 (owner: 10Dzahn) [10:02:28] (03CR) 10Dzahn: [C: 031] "agreed, lines would look better here: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=Static+assets line is default, righ" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [10:06:58] (03PS1) 10Hashar: doc how to run rspec tests in the rakefile [operations/puppet] - 10https://gerrit.wikimedia.org/r/87531 [10:09:28] (03CR) 10Dzahn: [C: 032] "docs :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87531 (owner: 10Hashar) [10:10:08] mutante: thx [10:10:16] yw [10:25:24] (03CR) 10Dzahn: [C: 032] "fixes RAID checks on several ms-be hosts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:27:42] path conflict on that? hrmm [10:29:35] (03PS2) 10Dzahn: account for hosts where every disk is raid 0 (e.g. the ms-be hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:30:52] (03CR) 10Dzahn: [C: 032] account for hosts where every disk is raid 0 (e.g. the ms-be hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 (owner: 10ArielGlenn) [10:32:09] ah, files/icinga -> modules/base/files/monitoring/ [10:33:17] waits for Icinga recoveries on quite a few now .. [10:38:07] RECOVERY - RAID on analytics1012 is OK: OK: No disks configured for RAID [10:38:14] (03CR) 10ArielGlenn: [C: 032] Use line (rather than stack) graph type for static assets [operations/puppet] - 10https://gerrit.wikimedia.org/r/87521 (owner: 10Ori.livneh) [10:38:17] there it starts :) [10:38:34] apergos: see RECOVERY, thanks for that fix, there will be more soon [10:38:43] yw [10:39:06] RECOVERY - RAID on ms-be8 is OK: OK: No disks configured for RAID [10:39:10] glad to see those warnings go [10:39:20] yes !:) [10:40:31] siebrand: ori-l ^ merged [10:40:42] running puppet now [10:41:49] mutante: yay. Better graphs :) [10:42:05] * Nemo_bis waits for puppet [10:43:35] of course we wouldn't need this fix if we didn't have h310s anymore ;) [10:44:03] but nice fix nevertheless, mutante [10:44:18] if you're feeling up to it, the check is also buggy in other ways [10:44:41] for the rest of the ms-be boxes it says "1 logical drive" while they have 14 [10:44:45] paravoid: i have another check in monitoring we could fix or remove ..:) [10:45:04] Swift HTTP on ms-fe* is HTTP WARNING: HTTP/1.1 401 Unauthorized [10:45:22] ms-fe10xx? [10:45:26] still? [10:45:32] yes, 1001 to 1004 [10:45:42] they are WARN though, not crit [10:45:46] hm, this should have been fixed yesterday [10:45:49] lemme look [10:46:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:46:35] command_name check_http_swift [10:46:35] command_line $USER1$/check_http -H ms-fe.pmtpa.wmnet -I $HOSTADDRESS$ -u /wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/80px-Little_kitten_.jpg [10:46:37] paravoid: ariel wrote the actual fix :) [10:46:38] lol [10:46:40] seriously [10:46:41] for check_raid [10:46:46] hehehe, kittens [10:46:48] hardcoded pmtpa [10:46:52] also the kitten, yes [10:47:11] i see, pmtpa yea [10:47:16] RECOVERY - RAID on ms-be3 is OK: OK: No disks configured for RAID [10:47:36] will fix in a moment [10:47:38] thanks for the ping [10:47:46] RECOVERY - RAID on analytics1014 is OK: OK: No disks configured for RAID [10:48:13] cool, sure [10:49:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.634 second response time [10:49:51] hmm I guess that file should not be renamed, worth protecting? [10:50:09] no [10:50:11] I'll change the URL [10:50:18] I have a better way than this [10:50:25] :) [10:50:32] better so [10:51:05] /monitoring/backend to be exact [10:52:12] and then we have those disk space warnings on some cp10xx, but those are large varnish.bigobj2/varnish.main2 using 95% on /srv/sdb3 so they don't change size [10:54:16] RECOVERY - RAID on analytics1011 is OK: OK: No disks configured for RAID [10:55:27] siebrand: reload the graphs :) [10:55:56] RECOVERY - RAID on ms-be10 is OK: OK: No disks configured for RAID [10:56:36] mutante: right. Now impact is visible much more clearly. The stacking didn't make sense. [11:00:36] RECOVERY - RAID on analytics1021 is OK: OK: No disks configured for RAID [11:01:16] RECOVERY - RAID on ms-be7 is OK: OK: No disks configured for RAID [11:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [11:05:00] sysctl::parameters is so pretty [11:05:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.350 second response time [11:06:26] RECOVERY - RAID on ms-be6 is OK: OK: No disks configured for RAID [11:10:34] mutante: are you familiar with lucene at all? [11:11:21] [FAILED] jawiki.spell [11:11:23] and all ja* [11:11:28] [FAILED] zhwiki.spell [11:11:30] and all zh* [11:16:04] paravoid: no, i just added a check to find the string FAILED on those status pages so it tells us when indexing didnt work, but i dont know why it fails :p [11:16:27] ha [11:16:30] ok :) [11:16:34] I guess noone does [11:17:00] sigh, yea [11:17:06] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=stafford.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+pmtpa [11:17:09] puppet!~ [11:17:20] i was just wondering where check_dpkg actually gets intsalled from now that monitoring is in base module [11:17:31] i see check_raid , but where is check_dpkg [11:17:52] /puppet/modules/base/files/monitoring [11:18:53] looks like we use it but dont install it anymore [11:19:27] mutante: look at modules/nrpe/files and modules/nrpe/manifests/init.pp [11:20:14] akosiaris: thanks, got it. would you expect check-raid.py to be in the same place then ? [11:20:36] confused by base/files/monitoring and nrpe/files/ [11:20:57] I am working on clearing al that up [11:21:00] (03PS1) 10Ori.livneh: Parsoid frontends VCL: strip X-Parsoid-Performance from cache hits [operations/puppet] - 10https://gerrit.wikimedia.org/r/87535 [11:21:06] for now place it alongside check_dpkg please [11:21:08] for some reason host erzurumi never got this file .. but has the nrpe command [11:21:19] -bash: /usr/local/lib/nagios/plugins/check_dpkg: No such file or directory [11:21:22] that's why i was looking [11:21:45] akosiaris: alright :) [11:22:49] mutante: thanks for the ganglia merge [11:24:45] sure, thank ariel [11:25:25] mutante: hardy??? erzurumi is hardy... sigh... [11:25:37] oooh :) [11:25:39] kiiill it [11:25:42] well, then .. [11:25:58] immediately stops trying to fix then:) [11:25:59] err: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate: odd number of arguments for Hash [11:26:19] erzurumi is fundraising leftover, i guess it moves to frack [11:26:19] and it hasn't even asked for the catalog yet... [11:26:23] and we should ping [11:26:32] jeff [11:26:49] hardy boxes do that but it's harmless [11:28:50] RT #5896 - kill erzurumi :p [11:34:29] (03PS1) 10Faidon Liambotis: miredo: fix typo in source => caused by conversion [operations/puppet] - 10https://gerrit.wikimedia.org/r/87537 [11:34:55] (03CR) 10Faidon Liambotis: [C: 032 V: 032] miredo: fix typo in source => caused by conversion [operations/puppet] - 10https://gerrit.wikimedia.org/r/87537 (owner: 10Faidon Liambotis) [11:39:32] (03PS1) 10Dzahn: move check-raid.py from base/files/monitoring/ to nrpe/plugins/ so that it's in the same place with check_dpkg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 [11:40:21] (03CR) 10jenkins-bot: [V: 04-1] move check-raid.py from base/files/monitoring/ to nrpe/plugins/ so that it's in the same place with check_dpkg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [11:40:50] oh, heh [11:40:53] https://integration.wikimedia.org/ci/job/operations-puppet-pep8/3787/violations/ [11:41:13] that check didnt run the former place [11:42:22] root@wtp1001:~# /usr/bin/MegaCli64 -LDInfo -LALL -aALL [11:42:23] Exit Code: 0x00 [11:42:27] empty output [11:42:31] so the new code doesn't catch this [11:43:31] this is so broken [11:43:48] yeah it has that on the rdbs too [11:43:50] ah, yeathat's what apergos said earlier today [11:43:55] it doesn't see the controllers at all there [11:44:21] maybe they don't have controllers at all [11:44:33] controller count: 0 [11:44:35] root@ms-be1001:~# check-raid.py [11:44:36] OK: State is Optimal, checked 1 logical device(s) [11:44:39] that's also ridiculous :) [11:44:44] the rdb ones claim to have h310s [11:45:18] print 'OK: State is %s, checked %d logical device(s)' % (state, numDrives) [11:45:20] root@rdb1001:~# lspci -vv | grep -i perc [11:45:20] Subsystem: Dell PERC H310 Mini Monolithics [11:45:21] lol [11:45:23] so [11:45:31] numDrives is actually the number of *physical* disks [11:45:43] in the last logical device [11:46:27] (03CR) 10Dzahn: "the jenkins fail is just "line too long" in the python script and it didn't run before the move" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87538 (owner: 10Dzahn) [11:53:46] I think I fixed it [11:54:36] OK: State is Optimal, checked 14 logical drive(s), 14 physical drive(s) [11:54:57] :) [11:59:32] (03PS1) 10Dzahn: add a .pep8 file to suppress warnings and jenkins fails due to stuff like long lines, this is just copied from modules/base/files/monitoring where it was done before [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 [12:00:40] (03CR) 10Dzahn: "reason: jenkins fail in Change-Id: I5bad1070ff40261a09b6bee696355ba17731dc14" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 (owner: 10Dzahn) [12:07:32] mutante: are you moving the check? [12:08:16] eh, sure, so far just added Alex to review because he's in the middle of moving it [12:08:28] it's V -1 [12:08:40] that would be fixed by the next change above there [12:08:45] oh [12:08:53] module/nrpe doesnt ignore pep8 long lines yet [12:09:01] but module/base/files/monitoring does [12:09:07] so it didnt' show up before [12:09:07] or maybe I should fix those [12:09:28] the long lines? if you wish:) [12:09:36] I'm hacking on it anyway [12:09:55] ok, cool, just copied behaviour from the base module without questioning it much [12:12:21] !log Inserted 2x 10G MIC into cr1-esams FPC 0, brought online [12:12:32] Logged the message, Master [12:14:31] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:18:44] paravoid: want me to merge the changed that moves it, overriding jenkins for now to make it easier? [12:21:51] mutante: puppet on erzurumi fails so badly on 4-5 things (/etc/sysctl.d, sudo, nrpe, httpy, apt::pin) that I am inclided to say "better to kill it than fix it" [12:23:14] akosiaris: oh yea, i already created an RT to kill it, just need to ask Jeff to confirm [12:24:31] alternative: do-release-upgrade or edit apt sources and upgrade without reinstall ..shrug [12:24:53] yeah....... [12:25:03] reinstall seems faster [12:26:08] yea, just no idea about downtime and if it's all puppetized .. role "ActiveMQ instance" [12:26:51] i expect it's most likely that Jeff will say this will be setup freshly in the frack rack [12:27:04] and that we just need an OK and then remove it completely [12:30:27] akosiaris: move check-raid: change 87538 (i know it uses the file from another module then, but you are changing that anyways) and the reason it fails would be because it doesnt ignore pep8 yet, 87541 , though Faidon said he'll fix the long lines anyways now .. [12:30:55] so dunno if you wanna abandon the second one, fine with me too [12:33:31] mutante: the pedantic person in me says don't suppress warnings... fix them :-) [12:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 12:33:51 UTC 2013 [12:34:10] so let's wait for Faido [12:34:13] Faidon* [12:34:24] hey [12:34:31] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:34:33] (03Abandoned) 10Dzahn: add a .pep8 file to suppress warnings and jenkins fails due to stuff like long lines, this is just copied from modules/base/files/monitoring where it was done before [operations/puppet] - 10https://gerrit.wikimedia.org/r/87541 (owner: 10Dzahn) [12:34:37] ok:) [12:35:31] tbh, I'm not too excited to see all kinds of random checks inside the nrpe module [12:35:47] I'd like to see the checks be near the modules that use them, rather than in a central place [12:35:53] for raid this means base I guess [12:36:07] thought it would be modules/icinga/plugins/ [12:36:16] yeah or even that [12:36:20] they are just executed via nrpe [12:36:25] but they are still normal plugins [12:36:36] mark, hi, did you have any luck with ESI issue? [12:36:37] they should be virtual and collected by nrpe [12:36:38] so, for example, check-raid may have a dependency on some python library and accompanied by a package resource [12:36:42] and an addition to the plugins from the distro package [12:36:52] and labs doesn't need check-raid at all [12:36:53] realized might be a better name [12:37:06] having a $::realm check inside the icinga or nrpe modules sounds wrong to me [12:37:44] so... a define and nrpe realizing these things ? [12:38:00] you mean a nrpe::plugin or something like that? [12:38:07] yes [12:38:22] yup, that sounds reasonable to me [12:38:38] ok [12:38:43] will do [12:38:49] but not now [12:38:49] why separate nagios plugins by location they're being run [12:38:56] ? [12:39:06] some plugins are executed on the monitoring host [12:39:15] and some are on the remote hosts, via nrpe [12:39:30] some maybe both [12:39:34] but besides that, they are all just nagios/icinga plguins [12:39:43] that is also true [12:40:00] so, maybe icinga::plugin, sure [12:45:01] so the latest megacli from lsi (which has to be manually extracted from the rpm) sees the controller on the rdb boxes [12:45:13] we need an updated megacli anyway [12:45:19] to come with libsysfs too [12:45:24] same applies to the ms* boxes [12:45:30] where can I put this? [12:45:46] we have a package called wikimedia-raid-utils [12:45:47] but [12:46:09] maybe we should just use http://hwraid.le-vert.net/wiki/DebianPackages [12:46:34] what's the latest version you found? [12:46:56] I have 8.04.53 on my disk [12:46:57] 8.05.06 [12:47:03] that site has 8.04.07-1 [12:48:44] I can't imagine a minor version change would make the difference, and it would be nice to have the packaging and updating handled for us [12:48:59] it does make a differnce someietimes [12:49:00] happy to test it [12:49:03] ergh [12:49:29] I remember having a segfault when trying to discard some preserved cache [12:49:42] and the version we have in wikimedia-raid-utils didn't have the command at all [12:49:48] the latest version segfaulted [12:49:55] and I went 3 or 4 versions behind and it worked [12:49:58] it was a fun ride :) [12:50:01] great [12:50:07] so here's what I propose [12:50:45] add the repo & key above to files/misc/reprepro/updates [12:51:38] use the grep construct to pick just megacli and perhaps megarc & megamgr (also shipped by wikimedia-raid-utils, although I'm unsure if it's still used anywhere) [12:51:59] grep for wikimedia-raid-utils and ensure => absent it, and add megacli ensure => present [12:52:02] makes sense? [12:53:09] yes but I am going to test the specific version first before proceeding [12:53:34] also want to see what else might be in the wikimedia-raid-utils package if anything [12:54:03] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [12:54:54] there's megarc, megamgr, arcconf and tw_cli [12:57:01] (03PS1) 10Faidon Liambotis: base: fix check-raid to handle no or multiple LDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/87548 [12:57:13] apergos: a review would be most welcome :) [12:57:13] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:32] ok, gimme two secs [12:58:14] 5.3T in swift eqiad already [12:58:16] sweet [13:04:49] meh the packaged megacli64 wants libsysfs2.0.1 (or maybe 2.0.2), precise has 2.1.0 [13:04:52] so that's a fail [13:06:35] the version from lsi's web site did not need libsysfs [13:09:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:41] no [13:10:48] it embeds 2.0.1 [13:10:58] and it's needed [13:11:10] it dlopens() it and it's needed for some configurations only [13:11:52] dpkg -L it, you'll find it installs libsysfs as well and a wrapper to preload it [13:12:50] the lsi vrsion did not need it for MegaCli64 -LDInfo -LALL -aALL but the debian package version did [13:13:16] same box? [13:13:24] is WAP dead yet or not [13:13:27] same box. [13:13:36] hm [13:13:40] but the package worked as is, right? [13:13:57] I have not installed the package as a deb yet [13:14:04] I just put the extracted binary up first [13:14:05] root@ms-be1001:~# dpkg --contents megacli_8.04.07-1_amd64.deb |grep libsys [13:14:08] -rw-r--r-- root/root 93370 2012-08-21 21:16 ./usr/lib/megacli/libsysfs.so.2.0.2 [13:14:43] -rw-r--r-- root/root 93370 2012-08-21 21:16 ./usr/lib/megacli/libsysfs.so.2.0.2 [13:14:46] ser [13:14:49] er [13:14:50] opt/MegaRAID/MegaCli [13:14:51] opt/MegaRAID/MegaCli/MegaCli [13:14:51] opt/MegaRAID/MegaCli/MegaCli64 [13:14:51] opt/MegaRAID/MegaCli/libstorelibir-2.so.13.05-0 [13:14:53] opt/opt/lsi/3rdpartylibs/x86_64/libsysfs.so.2.0.2 [13:14:57] opt/opt/lsi/3rdpartylibs/LGPLLicenseV2.txt [13:14:58] opt/opt/lsi/3rdpartylibs/libsysfs.so.2.0.2 [13:15:08] is what I see on the 8.04.53 rpm [13:15:20] yes, the rpm I have has this library [13:15:22] however [13:15:49] anyway, it doesn't matter [13:15:50] that library is not needed for this command, for whatever reason [13:15:53] doesn't matter what precise has [13:16:01] the deb embeds the version that the binary requires [13:16:04] and provides a wrapper to load it [13:16:16] the deb from hwraid's sid works on precise, I tested it [13:16:17] so that's fine [13:16:43] all righty then [13:18:04] iirc, LSI progressively uses more libsysfs functionality with each version [13:18:33] shall I merge 87548 ? [13:19:47] (03CR) 10Faidon Liambotis: [C: 032] base: fix check-raid to handle no or multiple LDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/87548 (owner: 10Faidon Liambotis) [13:20:08] in the case where all it prints is the exit code [13:20:17] this will accept that as 'ok'.... right? [13:20:28] right [13:21:27] if we had that in place we would never have known about the issue with the version of megacli we have not working with the rdb and other hosts [13:22:13] so I would prefer that blank lines followed by the exit code give us a warning [13:22:27] hm [13:22:34] I wonder if there are cases where this is valid [13:23:14] so, if you run megacli on a box that has no LSI controller [13:23:18] this is what you get [13:23:40] RECOVERY - RAID on terbium is OK: OK: No disks configured for RAID [13:23:54] the check is so broken [13:24:09] it assumes you only have one raid variant... [13:24:16] so ms-be boxes have both mdadm & megacli [13:24:41] getLinuxUtility should at least know not to run megacli on a non lsi controller host [13:25:23] so that would, if it is not buggy, address the one concern [13:27:38] is this the one we're using ? http://exchange.nagios.org/directory/Plugins/System-Metrics/Storage-Subsystem/Raid-Check-using-megaclisas/details [13:28:13] eh, "remotely check" with HOSTNAME/USERNAME? really [13:28:56] (03PS1) 10Faidon Liambotis: base: warn on megacli unknown controllers [operations/puppet] - 10https://gerrit.wikimedia.org/r/87554 [13:29:30] RECOVERY - RAID on rdb1001 is OK: OK: No disks configured for RAID [13:29:35] nevermind, just a random attempt to search nagios exchange for megacli if there is newer stuff [13:29:40] apergos: ^ ? [13:30:04] yep [13:30:07] lgtm [13:30:23] root@rdb1001:~# python check-raid.py [13:30:24] WARNING: No known controller found [13:30:43] as it should be [13:31:01] (03CR) 10Faidon Liambotis: [C: 032] "(spotted by apergos)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87554 (owner: 10Faidon Liambotis) [13:31:21] cool [13:31:23] let's fix megacli now :) [13:31:37] will do in 5 mins, multitasking [13:31:49] I can do it too if you don't mind [13:32:47] up to you, happy to do it [13:35:20] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 13:35:10 UTC 2013 [13:36:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:56:18] (03PS1) 10Faidon Liambotis: Replace wikimedia-raid-utils by up-to-date megacli [operations/puppet] - 10https://gerrit.wikimedia.org/r/87558 [13:56:37] anyone wants to have a look? [13:57:36] you were much faster than me [13:57:51] I was still adding the repo information to the updates file [13:58:06] oh sorry, I thought I took the lock :/ [13:58:16] no worries [14:00:13] are you reviewing it? [14:00:31] I am looking at it right now [14:00:34] yep [14:00:50] k [14:02:54] my only concern is that the other utils tw_cli arcconf etc are referenced in the check_raid script and we don't replace them when we remove the wikimedia-raid-utils package [14:03:05] do we actually use them? [14:03:34] if check-raid warns, we can install them [14:03:53] we can also use a fact for this and install them on demand [14:04:06] but I'd prefer not pulling all kinds of non-free crap into our repo unless we actually use them [14:04:18] if it was regular free software I might not have minded that much [14:04:33] but since it's binaries made by who knows who and where, let's not fetch them for no reason [14:04:35] I like the idea of the fact [14:04:52] yeah I already hate that we have this random lsi binary on there [14:05:09] other than that? [14:06:00] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 14:05:59 UTC 2013 [14:06:11] think it's ok [14:06:32] C+1? [14:06:40] ah sure, sorry :-D [14:07:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:07:48] (03CR) 10ArielGlenn: [C: 031] "we'll see if the other utils from the wikimedia-raid-utils packages are actually needed, if so they can be added later" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87558 (owner: 10Faidon Liambotis) [14:07:49] (03CR) 10Faidon Liambotis: [C: 032] Replace wikimedia-raid-utils by up-to-date megacli [operations/puppet] - 10https://gerrit.wikimedia.org/r/87558 (owner: 10Faidon Liambotis) [14:08:30] if puppet wasn't at > 100%, we might have that in 30' across the infra [14:08:53] eventually [14:21:51] ok, I think it'll fail for the first run now [14:22:01] and we're going to have check raid nagios fun for 30+ minutes [14:23:58] (03PS1) 10Faidon Liambotis: check-raid.py: switch to new megacli binary name [operations/puppet] - 10https://gerrit.wikimedia.org/r/87564 [14:24:10] (03CR) 10Faidon Liambotis: [C: 032] check-raid.py: switch to new megacli binary name [operations/puppet] - 10https://gerrit.wikimedia.org/r/87564 (owner: 10Faidon Liambotis) [14:24:45] ah that's why [14:24:51] (03CR) 10Faidon Liambotis: [V: 032] check-raid.py: switch to new megacli binary name [operations/puppet] - 10https://gerrit.wikimedia.org/r/87564 (owner: 10Faidon Liambotis) [14:24:59] I'm not sure if I hate puppet or jenkins more [14:25:07] why choose? :-D [14:26:30] RECOVERY - RAID on wtp1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:26:36] yay [14:27:10] good [14:27:17] indeed [14:27:43] ok, so let's wait an hour for everything to settle down [14:28:07] root@ms-be6:~# check-raid.py [14:28:07] OK: No disks configured for RAID [14:29:15] btw, I don' [14:29:36] bah nevermind [14:29:43] ? [14:30:59] WARNING: arcconf returned exit status 1 [14:31:01] professor! [14:31:03] look at that [14:31:04] bahhh [14:34:13] Number Of Drives per span:2 [14:34:13] Span Depth : 6 [14:34:16] does that mean 12 disks? [14:34:18] probably [14:34:21] so I'm miscounting disks [14:34:26] root@ms1002:~# check-raid.py [14:34:26] OK: State is Optimal, checked 5 logical drive(s), 10 physical drive(s) [14:34:38] it counts LD properly, but PDs not [14:34:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 14:34:44 UTC 2013 [14:34:55] it works on both adapters though :) [14:35:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:35:20] still much better than it was [14:37:15] if email body contains "pls do needful" then mail is from India [14:37:22] notice: Finished catalog run in 202.45 seconds [14:37:22] root@rdb1001:~# check-raid.py [14:37:22] OK: State is Optimal, checked 2 logical drive(s), 4 physical drive(s) [14:37:28] good [14:41:20] RECOVERY - RAID on solr2 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:41:40] RECOVERY - RAID on wtp1003 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:41:40] RECOVERY - RAID on solr3 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:41:49] here they come [14:42:10] RECOVERY - RAID on wtp1004 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:42:17] yup [14:42:20] RECOVERY - RAID on solr1003 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:43:00] RECOVERY - RAID on rdb1002 is OK: OK: State is Optimal, checked 2 logical drive(s), 4 physical drive(s) [14:47:00] RECOVERY - RAID on solr1002 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:47:30] RECOVERY - RAID on solr1 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:50:10] RECOVERY - RAID on wtp1002 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:50:11] RECOVERY - RAID on erbium is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:51:00] (03PS1) 10Faidon Liambotis: check-raid: readd support for arcconf (Adaptec) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87570 [14:51:22] !request help Thehelpfulone [14:51:40] RECOVERY - RAID on labsdb1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [14:51:43] (03CR) 10Faidon Liambotis: [C: 032] check-raid: readd support for arcconf (Adaptec) [operations/puppet] - 10https://gerrit.wikimedia.org/r/87570 (owner: 10Faidon Liambotis) [15:03:53] !log Completely disconnected csw1-esams from the network [15:04:05] Logged the message, Master [15:04:10] (03PS1) 10Faidon Liambotis: base: cleanup sudo definitions for check-raid.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/87574 [15:04:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 15:04:38 UTC 2013 [15:04:48] (03CR) 10Faidon Liambotis: [C: 032] base: cleanup sudo definitions for check-raid.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/87574 (owner: 10Faidon Liambotis) [15:05:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:05:43] omg how unbelievably shitty [15:06:12] (03PS2) 10Faidon Liambotis: base: cleanup sudo definitions for check-raid.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/87574 [15:06:21] (03CR) 10Faidon Liambotis: [C: 032 V: 032] base: cleanup sudo definitions for check-raid.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/87574 (owner: 10Faidon Liambotis) [15:06:30] RECOVERY - RAID on labsdb1003 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [15:08:00] RECOVERY - RAID on labsdb1002 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [15:09:42] anyone knows why wtp1008 doesn't run parsoid [15:12:41] paravoid: My shell access works now. Thanks for your help with that. [15:19:00] RECOVERY - RAID on solr1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [15:25:30] bd808: np [15:27:50] I love how ocwiki has 190k jobs in the jobqueue [15:28:02] and, well, enwiki 3.8 million [15:28:06] with our threshold being 10k [15:30:18] we need domas to slap oc.wiki again [15:32:18] It will probably peak even worse soon, as with last VisualEditor deployement? [15:34:25] are we at the point when a single runner can't handle enwiki? [15:35:13] !log Added another 10G link to cr1-esams:ae1 <--> csw2-esams:ae1, now 40Gbps total [15:35:14] Single? Wasn't it just doubled [15:35:28] Logged the message, Master [15:35:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 15:35:30 UTC 2013 [15:36:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:40:10] RECOVERY - RAID on stafford is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [15:41:30] PROBLEM - DPKG on labstore4 is CRITICAL: Connection refused by host [15:43:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:44:30] RECOVERY - DPKG on labstore4 is OK: All packages OK [15:44:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.675 second response time [15:44:31] PROBLEM - Disk space on cp1061 is CRITICAL: Connection refused by host [15:46:09] (03PS1) 10Faidon Liambotis: Remove labstore4 & labstore1002 from decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/87582 [15:46:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove labstore4 & labstore1002 from decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/87582 (owner: 10Faidon Liambotis) [15:48:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:48:53] ok, no RAID warn/crit at all anymore :) [15:49:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 17.702 second response time [15:54:35] (03CR) 10Ori.livneh: "I've argued before that I think git-deploy's Puppet resources should allow software projects to have their deployment configuration alongs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 (owner: 10Ryan Lane) [15:57:30] (03PS4) 10Umherirrender: $wgCaptchaWhitelist: whitelist also links with query or anchor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 [16:00:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:07:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 16:07:47 UTC 2013 [16:08:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 19.971 second response time [16:13:08] scapping... [16:15:11] !log maxsem Started syncing Wikimedia installation... : https://gerrit.wikimedia.org/r/87584 [16:15:23] Logged the message, Master [16:25:10] RECOVERY - Host mw1125 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [16:27:10] PROBLEM - twemproxy process on mw1125 is CRITICAL: Connection refused by host [16:27:20] PROBLEM - DPKG on mw1125 is CRITICAL: Connection refused by host [16:27:20] PROBLEM - Disk space on mw1125 is CRITICAL: Connection refused by host [16:27:30] PROBLEM - RAID on mw1125 is CRITICAL: Connection refused by host [16:29:25] !log aaron synchronized php-1.22wmf20/includes/filerepo/file/LocalFile.php 'b48403885d5ac993cb3b3ce7ac7580002ab12e1f' [16:29:39] Logged the message, Master [16:30:09] mhm, my scap is hanging [16:30:45] !log maxsem Finished syncing Wikimedia installation... : https://gerrit.wikimedia.org/r/87584 [16:30:55] phew [16:30:57] Logged the message, Master [16:33:35] MaxSem: you made me hold my breath for a second while reading scrollback [16:33:50] PROBLEM - Host mw1125 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:10] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 16:36:06 UTC 2013 [16:37:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:39:20] RECOVERY - Host cp3004 is UP: PING OK - Packet loss = 0%, RTA = 90.19 ms [16:39:30] RECOVERY - Varnish HTTP upload-backend on cp3004 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.181 second response time [16:40:02] MaxSem: thanks again; fix confirmed [16:41:20] PROBLEM - Varnish HTTP upload-frontend on cp3004 is CRITICAL: Connection refused [16:41:30] PROBLEM - Varnish traffic logger on cp3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [16:42:14] ori-l: g'morning [16:42:44] greg-g: morning [16:42:54] have a good nap? [16:42:54] ori-l: you're supposed to go crash! [16:42:58] *were [16:43:05] maybe he power nap'd? [16:43:19] more like tried to nap for 20m, then say 'fuck it', and get back on IRC? [16:44:07] * bd808 believes that ori-l can go 2 weeks without sleep [16:44:43] ok, enough [16:44:49] this is a puppet freshness channel [16:44:52] keep things topical [16:44:53] [16:45:06] ignoring that bot was like, the 95th best decision of my life [16:45:09] * Nemo_bis feels tropical [16:45:21] 95th or 95th percentile? [16:45:29] 95th [16:45:33] hm [16:45:43] you made lots of good decisions in your life, congrats [16:45:49] 94th was ignoring one particular person, and so was 93rd :P [16:46:22] not very helpful if you still remember them :P [16:47:03] Nemo_bis: only as part of a long list of good decisions :) [16:47:20] RECOVERY - Varnish HTTP upload-frontend on cp3004 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.182 second response time [16:47:23] (03CR) 10Anomie: [C: 031] "Tested now, works as advertised. Maybe Reedy can deploy it on Monday." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 (owner: 10Umherirrender) [16:47:30] RECOVERY - Varnish traffic logger on cp3004 is OK: PROCS OK: 2 processes with command name varnishncsa [16:47:42] ugh, forgot to confirm with sean pringle what he plans are re db migrations next week [16:47:47] anyone in here happen to know? [16:48:00] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [16:48:43] * bd808 follows YuviPanda's lead and /ignores icinga-wm [16:49:32] censors! [16:51:10] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:44] * YuviPanda ignores icinga-wm for greg-g [17:03:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 17:03:43 UTC 2013 [17:04:01] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:06:48] ignore icinga-wm why? [17:14:04] who fixed puppet and how? :) [17:15:35] greg-g: might know, depends what migration [17:15:50] apergos: master db rotation [17:16:37] hm nope, I know of two other things but [17:16:43] not that one [17:17:27] heh, what are those two other things, apergos ? [17:18:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection refused [17:18:10] one is some things moving to mariadb [17:18:30] and another is continued slow conversion to file-per-table [17:18:44] not in the loop about masterdb rot though [17:19:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [17:20:16] RECOVERY - DPKG on db9 is OK: All packages OK [17:21:50] apergos: hmm, either of those tow you mention require read-only on the msaters? [17:22:58] nope, slaves only just now [17:23:05] RECOVERY - DPKG on erzurumi is OK: All packages OK [17:23:32] (03PS1) 10Akosiaris: Cleanup swift monitor_service entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/87598 [17:26:37] also, wowie typos [17:30:09] (03CR) 10Marco: [C: 031] Allow Commons admins self-adding translationadmin group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86366 (owner: 10Rillke) [17:34:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Oct 4 17:34:32 UTC 2013 [18:34:23] (03CR) 10GWicke: "I don't think this will work as the front-ends do not cache at all. IMO it is better to do this client-side." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87535 (owner: 10Ori.livneh) [18:40:10] (03PS1) 10Faidon Liambotis: autoinstall: add erzurumi, raid1-1partition.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87609 [18:40:56] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: add erzurumi, raid1-1partition.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/87609 (owner: 10Faidon Liambotis) [18:41:45] (03Abandoned) 10Ori.livneh: Parsoid frontends VCL: strip X-Parsoid-Performance from cache hits [operations/puppet] - 10https://gerrit.wikimedia.org/r/87535 (owner: 10Ori.livneh) [18:44:44] !log reinstalling erzurumi [18:45:02] Logged the message, Master [18:46:03] PROBLEM - Host erzurumi is DOWN: PING CRITICAL - Packet loss = 100% [18:51:12] RECOVERY - Host erzurumi is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [18:53:22] PROBLEM - RAID on erzurumi is CRITICAL: Connection refused by host [18:53:42] PROBLEM - Disk space on erzurumi is CRITICAL: Connection refused by host [18:53:52] PROBLEM - SSH on erzurumi is CRITICAL: Connection refused [18:54:02] PROBLEM - DPKG on erzurumi is CRITICAL: Connection refused by host [18:55:33] * johnbender waves at Krinkle [18:55:43] johnbender: so roughly, it's a case of the musical chairs game [18:55:56] depool, upgrade, repool, for each of the slaves [18:56:00] and eventually master as well [18:56:02] and then repeat [18:56:05] and repeat [18:56:13] until the replag is small enough to do it live [18:56:42] Krinkle: I'm curious about what specifically is done to migrate the schema at the syntactic or tooling level [18:56:47] the first run could take days depending on the kind of schema change [18:56:55] Krinkle: yah [18:57:04] Krinkle: that's still very useful info [18:57:04] paravoid: AaronSchulz: Fill in here :) johnbender wants to know how we do schema changes [18:57:44] RobH: Reedy: also, maybe. not sure who knows it. y'all know it better than me for sure. [18:57:52] I'm curious about how you guys write/build your migrations [18:58:17] johnbender: it depends on the schema we're modifying [18:58:25] in production we do it by hand [18:58:32] Ryan_Lane: so SQL DDL and DML? [18:58:35] some schema migrations can take a very long time [18:58:44] you probably want to talk to springle-away [18:58:45] I can imagine [18:58:55] * johnbender makes note [18:58:56] RoanKattouw_away may be able to help, too [18:59:28] johnbender: example of a schema change https://github.com/wikimedia/mediawiki-core/commit/9c40037b0077b772ddd5691824de4a26fe8fc29a [18:59:49] note, we don't use the Updater.php in production though, but we do use the patch.sql files [19:00:17] Krinkle: Ryan_Lane: what about alters or drops [19:01:09] afaik the same [19:02:01] software upgrade (e.g. the web app) is usually first one to go, and by design has to be compatible with the current (previous) schema. [19:02:19] Krinkle: alright so I should be able to root around in that patches directory to find interesting things! [19:02:41] though I think for some changes were we can't or don't want to be backcompat in software in the past, we'd do the schema change first and then deploy the software update. though that is rare I think [19:03:00] Krinkle: this is useful context [19:03:08] johnbender: Yep, that should be a good starting point. Feel free to ask back here anytime. As well as the wikitech-l mailing list [19:03:34] https://lists.wikimedia.org/mailman/listinfo/wikitech-l [19:03:34] https://lists.wikimedia.org/mailman/listinfo/wikitech-l [19:03:38] https://www.mediawiki.org/wiki/Mailing_lists * [19:04:02] Krinkle: Ryan_Lane: thank you both so much [19:04:17] Krinkle: do you know how a given migration knows which patche scripts to use? [19:04:21] *patch [19:04:52] RECOVERY - SSH on erzurumi is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:05:52] PROBLEM - NTP on erzurumi is CRITICAL: NTP CRITICAL: No response from NTP server [19:10:03] johnbender: okay, this is grey area for me, but I would guess we use the core/maintenance/sql.php script, which we pass --wiki (one of the 800+ wiki database names, of which en.wikipedia.org is 1), and the path to the sql patch [19:10:07] I haven't done or seen anyone do that, but that's my guess given the puzzle pieces I have [19:10:50] I suppose we'd use at least some scripting as to not have to do it 800+ times and for each of the db slaves. [19:11:34] probably in groups of N number of wikis at a time. e.g. by db cluster segment (all wikis are in one of 7 db cluster segments) [19:11:42] hm.. I guess I know it better than I thought [19:12:48] all wikis: https://github.com/wikimedia/operations-mediawiki-config/blob/master/all.dblist db config: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/db-eqiad.php [19:14:43] johnbender: So which migration tool is it you talked about in the e-mail about the UCLA paper? [19:18:38] johnbender: we usually use OSC by percona [19:19:08] unless the table has no primary key, in which case you need the musical chairs [19:20:17] Krinkle: the unfortunately named PRISM [19:21:08] AaronSchulz: do you do any verification before hand other than testing with a staging cluser or similar? [19:21:29] Krinkle: per usual it's not in wide use but the paper is pretty impressive [19:21:44] it includes nice things like query rewriting etc [19:24:40] cool, so we don't use the musical chairs always. that's good to know. [19:24:58] AaronSchulz: Do you know how far back our use of that dates? [19:25:03] (OSC) [19:27:01] over a year, asher was the one who started it [19:27:40] AaronSchulz: Did we have something else before, or was it all manual before that (musical chairs or just live query) [19:31:58] all musical chairs except for the logging table PK addition, which was tim manually doing something kind of like OSC [19:32:19] since you had to ensure consistent IDs across slaves [19:32:28] that was fun... [19:32:57] though a dump ALTER might have worked [19:33:03] *dumb [19:40:01] (03CR) 10Brion VIBBER: [C: 032] "Here's the schema version change:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87485 (owner: 10Jdlrobson) [19:40:37] (03Merged) 10jenkins-bot: Update MobileWebClickTracking schema revision [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87485 (owner: 10Jdlrobson) [19:44:10] PROBLEM - Host srv291 is DOWN: PING CRITICAL - Packet loss = 100% [20:20:29] (03PS1) 10Faidon Liambotis: swift: extend token validity 1d -> 7d [operations/puppet] - 10https://gerrit.wikimedia.org/r/87621 [20:20:50] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: extend token validity 1d -> 7d [operations/puppet] - 10https://gerrit.wikimedia.org/r/87621 (owner: 10Faidon Liambotis) [20:50:03] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 6 copy to table, 165 statistics [20:51:03] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 6 copy to table, 4 statistics [21:08:13] paravoid: just checking, do you know anything about sean's db rotation work next week? [21:08:21] I'm sorry, no [21:08:59] no worries [21:09:20] dang saturdays in the future [21:11:03] ksnider: so, since it is 2:10pm Pacific on Friday, and I don't know exactly when these things should be scheduled, I'm going to hav eto suggest we delay it one week so we can send out the right amount of communication :/ [21:19:54] greg-g: Understood - I've reached out to Sean for details [21:21:29] ksnider: thank you [21:21:50] I don't want us to get flack for even 10 minutes of read-only without warning :/; [21:21:53] -; [21:22:02] +adequate [21:22:07] :) [21:43:18] (03PS6) 10Krinkle: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [21:43:22] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [21:55:45] (03CR) 10Faidon Liambotis: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87598 (owner: 10Akosiaris) [21:55:50] (03PS7) 10Krinkle: Enable VisualEditor on "phase 2" Wikipedias (all users) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [21:56:42] (03CR) 10Krinkle: "* "Unstuck" this revision, it had the Change-Id footer of another change (the one it depends on that has been merged)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/84370 (owner: 10Jforrester) [22:08:01] (03PS1) 10Faidon Liambotis: swift: fix swift::proxy::monitoring for eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/87633 [22:09:47] (03CR) 10Faidon Liambotis: [C: 032] swift: fix swift::proxy::monitoring for eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/87633 (owner: 10Faidon Liambotis) [22:18:17] RoanKattouw: around? [22:18:23] Yes [22:18:27] For a little bit [22:18:37] I have to pack up and go to the airport in ~10 [22:18:39] do you know what's the deal with wtp1008? [22:18:49] I don't know the latest on that [22:18:53] icinga complains that parsoid isn't running there [22:18:58] what's the latest-1? [22:19:06] At some point Chris had stolen it for testing re some CPU issue [22:19:16] I recently noticed that the pybal lists show it as pooled again [22:19:43] And I think he said something about how he was done with it, but I'm not sure [22:19:58] but it should have an up-to-date parsoid copy? [22:20:12] So, please check with Chris whether wtp1008 is ready for me to take back (or see if he said so in RT), and then email me and I'll get it fixed up [22:20:15] i.e. was it part of your deployments? [22:20:19] I don't know [22:20:22] I'll have to look into that [22:20:25] okay [22:20:27] thank you [22:20:29] have a nice flight :) [22:20:47] But please check with Chris first, because there's no point in me fixing that box back up if he's just gonna take it apart agin [22:20:56] I *think* he's done with it, but I'm not totally sure [22:20:56] nod [22:21:02] I'll check with him, don't worry [22:26:51] <^d> RoanKattouw: Can I get +Aiotv on #wikimedia-dev? [22:27:01] <^d> Had it on #mediawiki forever, forgot I didn't on -dev. [22:27:14] ^d: I have to run, sorry [22:27:21] <^d> No worries, not important [22:56:10] (03PS1) 10Ori.livneh: Add role::statsd; provision on tungsten; grant self access [operations/puppet] - 10https://gerrit.wikimedia.org/r/87636 [22:57:47] (03CR) 10Faidon Liambotis: [C: 032] Add role::statsd; provision on tungsten; grant self access [operations/puppet] - 10https://gerrit.wikimedia.org/r/87636 (owner: 10Ori.livneh) [23:00:11] thanks [23:00:26] I'm force-running puppet on tungsten for a while [23:00:46] done [23:00:47] it's easy to saturate ganglia with too many metrics, since statsd computes a bunch of aggregates by default [23:00:53] which is why i didn't include it as a backend [23:01:33] (in case you were wondering) [23:02:04] include what? [23:02:42] ok, I can modify by hand a ms-fe node, the puppet manifests for swift aren't very good and it's too late to do anything about it [23:02:59] I've been evaluating the puppetlabs swift module and it's not too bad, although ironically has no support for statsd :) [23:04:10] ack? [23:04:22] hrm? [23:04:38] shall I start pushing metrics from ms-fe1001? [23:04:57] oh! it's done-done. that was fast. let me look for a second. [23:05:41] log_statsd_host = tungsten.eqiad.wmnet [23:05:42] log_statsd_port = 8125 [23:05:42] log_statsd_metric_prefix = swift.eqiad.ms-fe1001 [23:06:05] I wonder if I should call it "ms" or "mediastorage" instead of swift [23:06:12] the metrics only make sense for swift though [23:06:33] bah, let's just use that for now and we can reevaluate later [23:07:26] * paravoid is about to hit enter [23:08:29] hang on another moment [23:10:04] ok, hit it [23:10:23] they're coming [23:10:32] they = metrics [23:10:36] I see them with tcpdump [23:10:39] heh [23:12:09] statsd tries to connect to professor's port 0 [23:13:00] I'm guessing we need to define graphitePort [23:13:11] yes, I thought it was set by default, but it's not [23:13:25] patch coming [23:14:19] heh I was half-way into fixing it [23:15:28] (03PS1) 10Ori.livneh: statsd on tungsten: explicitly specify graphitePort [operations/puppet] - 10https://gerrit.wikimedia.org/r/87638 [23:15:41] actually [23:15:43] wait [23:15:54] (03CR) 10Faidon Liambotis: [C: 032 V: 032] statsd on tungsten: explicitly specify graphitePort [operations/puppet] - 10https://gerrit.wikimedia.org/r/87638 (owner: 10Ori.livneh) [23:16:03] well, all right. [23:16:09] er [23:16:10] 2004? [23:17:18] 2003 [23:17:45] yeah I figured it out [23:17:47] 2004 is pickle [23:19:00] alright, applied [23:21:06] ok, it pushed to carbon [23:22:15] poor firefox [23:23:50] so, where do I look for the metrics? [23:24:03] which category? [23:24:10] stats [23:24:12] i see it [23:24:14] 'swift' [23:24:24] ah I was looking under statsd [23:24:24] metric type: stats, then find swift in the tree nav [23:24:24] so close [23:25:15] you probably won't have anything much to look at for a little bit [23:25:33] yep, that's what I see [23:25:36] but that's fine [23:25:42] that's nice [23:25:48] that was very easy [23:25:52] shhh [23:26:08] it has to maintain an aura of complexity and mystique [23:26:16] haha [23:26:44] the more interesting data are under timers > swift > ... [23:27:16] timing data [23:28:47] that's kinda weird [23:28:50] the double hierarchy [23:29:16] yes [23:29:19] * ^d fumes at people who make .debs with no source packages. [23:29:32] there's a cottage industry of front-ends to graphite because the default one is so clunky [23:29:34] ^d: case in point? [23:29:49] and each of those front-ends is great in one specific way [23:30:07] <^d> paravoid: hhvm. someone build some 12.04 packages and hosted them in a public apt mirror, but only binary packages, no source. [23:30:41] heh [23:30:44] so the other weird part is [23:31:08] there's almost no object GETs in eqiad [23:31:13] but a shit ton of PUTs [23:32:04] how do you make sense of that? [23:32:13] ;) [23:32:40] no, I mean that's the traffic, I know that [23:32:43] ori-l: building a general purpose metric display system that can actually be used may be unpossible [23:32:45] but the statsd say otherwise [23:33:03] tcpdump on 8125 shows that swift indeed only pushes stats about GET.200 [23:34:15] So no 404s or puts? Lame. [23:35:37] that's not what the config says :) [23:36:44] If you were seeing in tcpdump but not graphite I'd blame new metric creation rate limiting [23:37:34] carbon tries to keep from swamping disk io by throttling the number of new metrics it creates [23:38:37] When we'd turn up a new app server host at $DAYJOB-1 it was common not to see all the stats in graphite for an hour [23:41:43] weird [23:43:49] Our core app tracked a *lot* of timers. [23:43:58] no, what's happening is weird :) [23:44:00] here [23:50:22] RECOVERY - MySQL disk space on es1003 is OK: DISK OK [23:50:32] RECOVERY - Disk space on es1003 is OK: DISK OK