[00:00:03] good point; no [00:00:14] it's the same with respect to multimediaviwer [00:00:32] We may wind up reverting after all...fflorin is saying he has more issues [00:00:37] Whatever you put there will be blown away by the scap [00:00:42] makes sense [00:00:49] Standby [00:02:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 8.941 second response time [00:02:25] bd808, the file generated by hand did cause changes though [00:02:28] the sha's are different [00:02:32] mwalker: I'm so sorryyyy [00:02:37] the diff is too messy for me to tell you what exactly changed [00:02:46] but it wasn't what I cared about in this particular case [00:02:54] marktraceur, heh; iz ok [00:02:59] The order of things changes based on the wikidb used :/ [00:03:08] ah; ok; so I used enwiki [00:03:16] I have a patch for that but I haven't submitted it [00:03:36] scap will use the the wiki returned by `mwversionsinuse --with-db` [00:03:43] marktraceur, so; you want me to revert what should've been your no op patch? [00:04:02] 1.23wmf21=cawikibooks 1.23wmf22=test2wiki [00:04:19] No [00:04:34] mwalker: We might revert to wmf20 for the wmf21 mw branch... [00:04:40] I so hate that idea [00:05:07] Wait, no, that might not work because oojs had changes in the meantime [00:05:12] bd808, still a different sha when I generate with cawikibooks; but I'll take your word for it [00:05:26] mwalker: Never mind, we appear to be sticking to our guns [00:05:36] Hopefully it gets magically fixed but we're fine with the status quo [00:05:52] marktraceur, OK [00:06:04] you want to do a !log to explain your status [00:06:30] marktraceur: confirm/deny that the only places that will see this through clicking alone (no messing with urls) are testwikis+mw.org. [00:06:45] !log leaving MultimediaViewer slightly broken on enwiki based on the fact that logged-in users seem mostly unaffected and other wikis aren't seeing issues, will investigate more tomorrow and fix on Monday [00:06:50] Logged the message, Master [00:07:00] greg-g: I don't think wmf22 is broken at all [00:07:07] Minor bugs but nothing so critical as this [00:07:11] k [00:07:16] enwiki seems to be the only thing affected [00:07:16] misunderstood (again) [00:07:27] And then usually just when following a link as a logged-out user [00:08:24] marktraceur, ohhoho: https://en.wikipedia.org/wiki/MediaWiki:Multimediaviewer-download-tab [00:08:27] the message exists on wiki [00:08:28] greg-g: only way to get this error is to click on a link shared by someone else who had betafeatures enabled on enwiki [00:08:32] logged in users on enwiki will see it all the time when clicking on an image then click "use this" [00:08:42] We know this, mwalker...hence confusion [00:08:50] marktraceur, are you guys actually pushing that message to resourceloader? [00:08:55] greg-g: or having betafeatures enabled yourself [00:09:01] Pretty sure, since locally it's working [00:09:50] oh, fine, who cares then, just a BF :) [00:10:01] mwalker: https://de.wikipedia.org/wiki/Wikipedia:Hauptseite?uselang=en#mediaviewer/Datei:Rhombodera basalis 2 Luc Viatour cropped.jpg [00:10:02] I think based on that decision I'm going to leave tgr in charge for like five minutes [00:10:03] Hey! :-) [00:10:11] same branch, same language, different wiki, works fine [00:10:13] LIKE SERIOUSLY FIVE MINUTES [00:10:18] Try not to mess things up [00:13:31] hmm... I dunno why RL is screwing up; but it's not pushed to JS for me on enwiki... but it is on dewiki [00:13:41] this is probably a question for Roan; I've not seen RL do this before [00:14:54] You could force a RL refresh? scap doesn't do that (I don't know why) but l10nupdate does [00:15:05] https://en.wikipedia.org/wiki/Even_If_and_Especially_When?uselang=en-xxx#mediaviewer/File:EvenIfAndEspeciallyWhen.jpg [00:15:12] this works too [00:15:14] mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php --wiki="$wiki" [00:15:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 453694 bytes in 9.526 second response time [00:15:29] haha [00:15:52] * bd808 likes that a screaming trees album cover was chosen to test [00:16:02] :) [00:16:05] ditto [00:17:09] ok guys, I have to go, if you want to do that last suggestion from bd808, go for it, but after that, call it a day and go have a beer/kool-aid/glass of water [00:17:20] *nods* [00:17:30] refreshMessageBlobs.php is taking forecter [00:17:39] * mwalker is skeptical that it's working... I got no output [00:17:41] it always does [00:17:50] (03CR) 10Tim Landscheidt: "So the problem lies not with this change, but apparently is present before when it tries to install openjdk-7-jre-headless. I'll file a b" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [00:18:04] Upto 45 minutes isn't unknown... [00:18:12] *whistles* [00:18:15] well; it is now finished [00:18:23] scap it? or can I sync-dir somethign? [00:18:42] nope and nope [00:19:02] it's a load of db queries [00:19:47] seems fixed [00:20:19] interesting; well; I now know of a new script [00:20:44] thanks a bunch [00:20:52] should i file an RL bug? [00:21:04] I don't know what you would put in it... [00:22:23] in any case; thanks bd808 and Reedy! [00:22:40] yw [00:34:24] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:35] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:14] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3917 bytes in 0.084 second response time [00:35:24] RECOVERY - check google safe browsing for wikiversity.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3925 bytes in 0.085 second response time [00:37:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 460444 bytes in 9.831 second response time [00:41:14] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:24] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:04] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:42:14] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:46:32] (03CR) 10Helder.wiki: Enhanced recent changes: explicitly disable by default (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (owner: 10Nemo bis) [02:07:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [02:12:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 2873 MB (3% inode=99%): [02:17:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:18:11] (03PS1) 10Ori.livneh: uWSGI: strip .ini suffix from instance names [operations/puppet] - 10https://gerrit.wikimedia.org/r/125367 [02:18:52] (03CR) 10Ori.livneh: [C: 032 V: 032] uWSGI: strip .ini suffix from instance names [operations/puppet] - 10https://gerrit.wikimedia.org/r/125367 (owner: 10Ori.livneh) [02:19:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3669 MB (3% inode=99%): [02:24:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:24:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:24:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:24:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:27:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [02:29:25] !log graphite: carbon instance 'f' saturates a cpu core. it's the instance that mediawiki profiling data gets hashed to. collector should probably emit to statsd and have statsd compute per-minute rollups [02:29:32] Logged the message, Master [02:30:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [02:31:37] ori_: Make sure to find a statsd that works. I'm fairly convinced that txstatsd is slightly broken based on the behavior I see for scap metrics [02:33:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [02:41:35] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-11 02:41:33+00:00 [02:41:40] Logged the message, Master [02:47:27] (03PS4) 10Aude: Add wikidata tables for fullview [operations/software] - 10https://gerrit.wikimedia.org/r/118582 [02:47:29] (03PS1) 10Aude: Add rc_source recentchanges column to labs replica databases [operations/software] - 10https://gerrit.wikimedia.org/r/125369 [02:53:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:56:49] (03PS2) 10Aude: Add rc_source recentchanges column to labs replica databases [operations/software] - 10https://gerrit.wikimedia.org/r/125369 [03:00:14] RECOVERY - Disk space on virt0 is OK: DISK OK [03:05:34] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:34] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 460195 bytes in 9.908 second response time [03:09:34] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:34] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 460195 bytes in 9.346 second response time [03:26:39] bd808: can you recommend one? i tried etsy's; it was also broken. i fixed a few bugs in txstatsd already. i'd love to find one that is really robust. [03:29:30] I don't have a great suggestion. I think I told you before that at $DAYJOB-1 I wrote my own. [03:29:45] I have a version of that one that I've been cleaning up to release [03:29:57] But it's never seen high volume usage [03:30:13] We aggregated on each server [03:30:38] So I had a copy running on each of our application servers [03:31:04] And our peak volume was a dribble compared to the prod farm [03:33:38] bd808: well, we also aggregate on each server; i've suggested that before. based on the commit activity and what i've heard, ops are converging on having a diamond metrics instance on each host [03:33:48] bd808: we *should* also, that is [03:36:10] ori_: My version is at https://github.com/bd808/yastatsd. I need to test it. As you'll see the commit history there is pretty short :) [03:37:00] It's the guts of something larger that I wrote under contract. [03:38:40] bd808: that looks nice and clean [03:39:10] Simple things are best :) [03:40:01] bd808: seriously, any reason not to use it? [03:40:57] No. I actually cleaned it up and posted it there so I could justify testing it in beta to replace txstatsd :) [03:47:04] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-11 03:47:01+00:00 [03:47:10] Logged the message, Master [03:48:36] bd808: test it on pypy [03:49:01] it'll get a nice speedup; pypy + pickle performs better than cpython with cPickle [03:49:43] Cool. I really haven't ever use pypy [03:50:18] I'll probably have some time tomorrow to mess with it in beta [04:08:34] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:34] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 457927 bytes in 9.583 second response time [04:26:45] ok well this sucks moose: https://www.eff.org/deeplinks/2014/04/wild-heart-were-intelligence-agencies-using-heartbleed-november-2013 [04:27:00] http://arstechnica.com/security/2014/04/heartbleed-vulnerability-may-have-been-exploited-months-before-patch/ [04:31:44] I wonder what we would have thought pre-snowden [04:33:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 11 04:33:40 UTC 2014 (duration 33m 39s) [04:33:50] Logged the message, Master [04:39:55] (03PS1) 10BryanDavis: Use valid hostnames in $wgLBFactoryConf [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125373 [05:25:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:25:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:25:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:25:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:28:44] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3554.47 ms [07:29:04] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [07:49:22] (03CR) 10Hashar: [C: 032] Use valid hostnames in $wgLBFactoryConf [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125373 (owner: 10BryanDavis) [07:49:31] (03Merged) 10jenkins-bot: Use valid hostnames in $wgLBFactoryConf [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125373 (owner: 10BryanDavis) [08:26:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:26:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:26:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:26:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:30:39] (03CR) 10Nuria: Allowing wikimetrics $debug parameter to be set from labsconsole (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125274 (owner: 10Ottomata) [09:18:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 453924 bytes in 9.531 second response time [09:28:42] anybody wants to merge https://gerrit.wikimedia.org/r/125267 ?:) [09:39:59] hashar_: i was first with the underscore fashion trend [09:40:24] i see you acquired one too, now [09:41:16] (03CR) 10Ori.livneh: [C: 032] Revoke my key, I'm relocating to SF [operations/puppet] - 10https://gerrit.wikimedia.org/r/125267 (owner: 10MaxSem) [09:41:18] lets pretend we are leet! [09:41:57] * H4|- waves at MaxSem [09:42:16] thanks ori_ [09:42:17] we *are* leet [09:42:47] our twelve year old selves would be impressed [09:43:28] unfortunately we are late :P no one gets to be cool when they are twelve [09:44:11] i'm raving. good night! [09:46:55] (03CR) 10Ori.livneh: "purged on all hosts via salt as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125267 (owner: 10MaxSem) [09:48:16] (03CR) 10Hashar: [C: 04-1] "Duplicates the packages defined contint::android-sdk . You might want to create a common class for tools labs and contint :]" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [10:46:44] PROBLEM - MySQL Recent Restart on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:34] RECOVERY - MySQL Recent Restart on db1016 is OK: OK 9404397 seconds since restart [11:27:12] (03CR) 10Milimetric: Allowing wikimetrics $debug parameter to be set from labsconsole (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125274 (owner: 10Ottomata) [11:27:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:27:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:27:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:27:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:58:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:55:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [13:03:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:23:14] PROBLEM - Check rp_filter disabled on lvs3003 is CRITICAL: Timeout while attempting connection [13:39:02] manybubbles: mornin! [13:39:06] shall I start on 1005? [13:50:30] ottomata: sure@ [13:50:57] <^d> Morning ottomata, manybubbles [13:51:06] * ^d yawns, rubs sleepy out of eyes [13:51:30] mornin [13:51:34] ok! [13:51:42] moving shards [13:53:06] cool [14:24:14] ok manybubbles, taking 1005 down [14:24:29] cool. so long as shards off of it [14:24:44] yup [14:24:52] at least, nothing moving [14:25:42] {"length":0,"node":"elastic1005"} [14:25:43] cool [14:27:35] !log reinstalling elastic1005 [14:27:41] Logged the message, Master [14:28:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:28:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:28:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:28:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:28:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:29:24] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:57] ottomata: sweet [14:34:34] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:36:44] PROBLEM - DPKG on elastic1005 is CRITICAL: Connection refused by host [14:37:14] PROBLEM - Disk space on elastic1005 is CRITICAL: Connection refused by host [14:37:14] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [14:37:24] PROBLEM - RAID on elastic1005 is CRITICAL: Connection refused by host [14:37:24] PROBLEM - SSH on elastic1005 is CRITICAL: Connection refused [14:37:24] PROBLEM - check configured eth on elastic1005 is CRITICAL: Connection refused by host [14:37:34] PROBLEM - puppet disabled on elastic1005 is CRITICAL: Connection refused by host [14:41:20] (03PS1) 10Ottomata: Adding hdfs user to stats group [operations/puppet] - 10https://gerrit.wikimedia.org/r/125403 [14:43:24] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 951.990627963 [14:48:44] PROBLEM - mysqld processes on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:49:04] PROBLEM - NTP on elastic1005 is CRITICAL: NTP CRITICAL: No response from NTP server [14:50:35] RECOVERY - mysqld processes on db1016 is OK: PROCS OK: 1 process with command name mysqld [14:51:24] RECOVERY - SSH on elastic1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [14:53:24] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2732.54399653 [14:56:44] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:25] (03PS2) 10Ottomata: Adding hdfs user to stats group [operations/puppet] - 10https://gerrit.wikimedia.org/r/125403 [15:01:54] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:02:59] (03CR) 10Ottomata: [C: 032 V: 032] Adding hdfs user to stats group [operations/puppet] - 10https://gerrit.wikimedia.org/r/125403 (owner: 10Ottomata) [15:04:24] PROBLEM - SSH on elastic1005 is CRITICAL: Connection refused [15:05:54] (03PS1) 10Ottomata: Need to fully qualify all executables in unless command [operations/puppet] - 10https://gerrit.wikimedia.org/r/125404 [15:06:11] (03CR) 10Ottomata: [C: 032 V: 032] Need to fully qualify all executables in unless command [operations/puppet] - 10https://gerrit.wikimedia.org/r/125404 (owner: 10Ottomata) [15:09:39] (03PS1) 10Ottomata: Using exec path rather than fully qualifying [operations/puppet] - 10https://gerrit.wikimedia.org/r/125406 [15:10:13] (03CR) 10Ottomata: [C: 032 V: 032] Using exec path rather than fully qualifying [operations/puppet] - 10https://gerrit.wikimedia.org/r/125406 (owner: 10Ottomata) [15:17:49] manybubbles: hmm, I rebooted 1005 after I had installed and added the new partition, but before running puppet [15:18:00] i wanted to make sure it would come back online with the new partition mounted (it should) [15:18:09] ottomata: cool [15:18:16] its responding to pings, but hasn't done anyting for like 15 minutes or more [15:18:22] oh [15:18:24] thats not great [15:18:34] yeah, nothing on console output... [15:18:41] i'm going to try powercycling [15:18:45] sure [15:26:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [15:28:24] RECOVERY - SSH on elastic1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [15:40:44] manybubbles: I am learning some things I did know know about mdadm [15:40:59] like, newer versions (one we use) prefer if I give logical names to arrays [15:41:12] quit [15:41:34] unless I am very specific, it will assign its own numbers to the arrays whenever they are assembled, like on boot [15:41:43] which means they won't mount properly [15:41:58] i'm testing with elastic1005 now [15:42:07] I'll get this one all fixed up the way it should be [15:42:09] and document [15:42:20] but, i think we shoudl reformat 1003 and 1004 again, just to keep everything consistent [15:42:24] we don't have to reinstall [15:42:39] just move shards, stop elasticsearch, reformat, start, move shards back [15:43:14] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:05] (03PS1) 10Manybubbles: Switch Cirrus to a faster query type ` [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125410 [15:47:34] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [15:50:29] ottomata: ok [15:52:14] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:34] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [16:03:44] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2768.79 ms [16:04:04] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:05:14] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:34] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:08:46] ok, this setup is looking better manybubbles, running puppet on 1005 now, will deploy and ask you to check, then start moving shards to it [16:10:37] ottomata: sweet [16:11:14] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [16:16:14] RECOVERY - Disk space on elastic1005 is OK: DISK OK [16:16:24] RECOVERY - RAID on elastic1005 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:16:24] RECOVERY - check configured eth on elastic1005 is OK: NRPE: Unable to read output [16:16:34] RECOVERY - puppet disabled on elastic1005 is OK: OK [16:16:44] RECOVERY - DPKG on elastic1005 is OK: All packages OK [16:22:27] ottomata: looks good but missing the jar files from the plugins [16:22:40] doing that now [16:22:54] sweet [16:23:06] ok check now [16:23:08] is it more complicated then the rest of the git deploy stuff once git-fat is installed? [16:23:22] no, salt just needs to be done properly [16:23:28] keys acppeted, then puppet run again, etc. [16:23:51] ok, s'ok? can I start the shard wranglin? [16:24:09] manybubbles: ? [16:24:29] ottomata: looks gret [16:24:31] great [16:25:15] ok shards moving [16:28:54] RECOVERY - NTP on elastic1005 is OK: NTP OK: Offset -0.01147723198 secs [16:30:02] (03PS2) 10Chad: Switch Cirrus to a faster query type [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125410 (owner: 10Manybubbles) [16:32:08] (03CR) 10Tim Landscheidt: [C: 031] "I'm now confident that the change as is is good(TM), and that the Puppet failure is unrelated (cf. bug #63823 for that)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [16:38:33] ottomata: sweet [16:40:12] <^d> manybubbles, ottomata: So just 15 more to go now? :) [16:40:27] ottomata: hehe, it would have been fewer but mdadm [16:40:30] ^d: ^ [16:41:00] <^d> Hmm? [16:54:30] yeah, i want to reformat 1003 and 1004, which we did yesterday, i learned of a better way to do it than I was [16:54:35] don't have to reinstall everything though [16:54:37] just reformat [16:54:45] which just means move shards off, reformat, move them back on [16:54:46] i think [17:02:22] <^d> ottomata: Cool beans :) [17:19:33] (03PS1) 10Ottomata: Enabling base::firewall on stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125425 [17:21:51] (03CR) 10Ottomata: [C: 032 V: 032] Enabling base::firewall on stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125425 (owner: 10Ottomata) [17:29:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:29:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:29:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:29:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:41:29] !log git-deploy: Deploying integration/slave-scripts 'Ia9ee438fa2675170' [17:41:35] Logged the message, Master [17:46:00] (03CR) 10Tim Landscheidt: ""Change as is" = patch set #2." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [17:47:34] manybubbles: can I start moving shards off of 1003? [17:48:30] ottomata: absultely [17:48:33] yes [17:48:46] probably could have started earlier if you'd have liked [17:48:48] but not big deal [17:49:46] k doint [17:49:47] doing [17:56:48] (03CR) 10Ottomata: Make Elasticsearch more reliable in beta (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125331 (owner: 10Manybubbles) [17:57:36] (03CR) 10Ottomata: [C: 032] "Shall I merge?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124903 (owner: 10Manybubbles) [17:57:57] (03CR) 10Manybubbles: "Sure!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124903 (owner: 10Manybubbles) [17:59:23] (03PS2) 10Ottomata: Elasticsearch config to better spread shards [operations/puppet] - 10https://gerrit.wikimedia.org/r/124903 (owner: 10Manybubbles) [17:59:30] (03CR) 10Ottomata: [C: 032 V: 032] Elasticsearch config to better spread shards [operations/puppet] - 10https://gerrit.wikimedia.org/r/124903 (owner: 10Manybubbles) [18:01:14] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1414 MB (2% inode=86%): [18:05:46] (03PS4) 10Ottomata: Adding logster module and using it to monitor CirrusSearch-slow.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/123466 [18:05:53] (03CR) 10Ottomata: [C: 032 V: 032] Adding logster module and using it to monitor CirrusSearch-slow.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/123466 (owner: 10Ottomata) [18:08:42] !log git-deploy: Deploying integration/slave-scripts I04d8e308daedb3ccb8 [18:08:48] Logged the message, Master [18:14:39] !log git-deploy: Deploying integration/slave-scripts I38b90e8c08d7cb [18:14:43] (03PS10) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [18:14:45] Logged the message, Master [18:16:05] bblack: ping [18:17:00] ok shards off of 1003 [18:17:09] turning off elastic search there for a bit [18:19:22] !log rebooting elastic1003 [18:19:28] Logged the message, Master [18:19:54] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [18:21:14] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:23] woot, manybubbles: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=fluorine.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1397240521&v=0.0&m=CirrusSearch-slow.log_line_rate&vl=lines%20per%20sec&ti=&z=large [18:22:24] :) [18:22:58] yay [18:23:39] icinga should check that too, but puppet takes a while for that stuff [18:23:47] i'll look for it there in a bit [18:26:24] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [18:28:35] PROBLEM - puppet disabled on elastic1003 is CRITICAL: Connection refused by host [18:28:35] PROBLEM - DPKG on elastic1003 is CRITICAL: Connection refused by host [18:28:44] PROBLEM - Disk space on elastic1003 is CRITICAL: Connection refused by host [18:28:54] PROBLEM - RAID on elastic1003 is CRITICAL: Connection refused by host [18:28:54] PROBLEM - check configured eth on elastic1003 is CRITICAL: Connection refused by host [18:28:54] PROBLEM - SSH on elastic1003 is CRITICAL: Connection refused [18:33:15] ottomata: 1003 is empty [18:33:26] naw its confused [18:33:32] oh shards? [18:33:35] RECOVERY - puppet disabled on elastic1003 is OK: OK [18:33:35] RECOVERY - DPKG on elastic1003 is OK: All packages OK [18:33:44] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:33:44] RECOVERY - Disk space on elastic1003 is OK: DISK OK [18:33:58] RECOVERY - check configured eth on elastic1003 is OK: NRPE: Unable to read output [18:33:58] RECOVERY - RAID on elastic1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:33:58] RECOVERY - SSH on elastic1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [18:41:28] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:05] shhh its ok [18:43:48] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:44:15] ottomata: You're starting to talk to the servers like they're your children, that seems like not a great sign [18:44:38] it comforts them [18:44:57] ottomata: Ops summit can't be over soon enough, can it [18:45:38] the servers talk to me, they are my friends [18:45:55] the servers and I are going on our own summit someday [18:45:56] i just know nit [18:45:59] it [18:46:16] :/ [18:49:18] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:19] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:48] PROBLEM - mysqld processes on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:38] PROBLEM - Full LVS Snapshot on db1016 is CRITICAL: Timeout while attempting connection [18:51:08] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:51:08] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:51:19] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2441 MB (3% inode=85%): [18:51:28] RECOVERY - Full LVS Snapshot on db1016 is OK: OK no full LVM snapshot volumes [18:51:38] RECOVERY - mysqld processes on db1016 is OK: PROCS OK: 1 process with command name mysqld [18:52:16] ok elastic1003 looks better [18:52:43] ok manybubbles: moving shards back to 1003 [18:53:12] manybubbles: can I go ahead and start moving them off of 1004, or shoudl I wait? [19:21:08] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [19:26:08] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:26:32] iiinteresting [19:31:08] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00334448160535 [19:35:32] ottomata: hmmmm [19:35:41] ottomata: sure, you can more them around, I think [19:36:08] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:36:45] ottomata: it looks like we had an issue around 14:28:18 where we logged a bunch [19:37:13] logged slow logs? [19:37:17] ottomata: but we've logged three in the past two house [19:37:19] hours [19:37:22] which isn't too bad [19:37:29] yeah [19:37:45] seems a little sensiive to me, no? [19:38:54] a bit, yeah [19:40:54] hm, yeah but it isn't really per/hour [19:41:02] so yeah [19:41:11] i could make it per hour i think [19:41:15] it would be a little jumpy in ganglia mabye [19:41:20] and it would take an hour to recover [19:41:25] right now it is just checked every 5 minutes [19:42:32] oh manybubbles, I see that rolling-restarts has a new fast way? [19:42:36] shoudl I use that for 1004? [19:42:48] the downtime is fast [19:42:52] ah [19:42:53]