[00:00:12] <ori>	 *nod*
[00:00:36] <ori>	 so i ran sync-common manually on mw1151 so it's OK now, but no one should deploy until the patch you +1'd above is merged and puppet runs
[00:02:48] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "doing this now because otherwise deployments break, i can confirm and show in git log how mortals were included on appservers before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto)
[00:03:02] <grrrit-wm>	 (03PS2) 10Dzahn: mediawiki: re-include deployment users [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto)
[00:03:13] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds  
[00:03:45] <ori>	 blergh, what?
[00:03:51] <ori>	 re: error rate
[00:04:27] <ori>	 i certainly hope that this is wrong: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)
[00:04:43] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0]  
[00:04:46] <ori>	 no, exception spike too: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1
[00:05:30] <ori>	 springle: are you doing anything with the databases?
[00:08:28] <ori>	 OK, the math extension is still broken
[00:08:50] <ori>	 i'm going to revert it
[00:10:34] <ori>	 i'm also going to run scap in the background because i suspect things are in an inconsistent state because of the issue you flagged above mutante
[00:10:42] <mutante>	 ori: +1
[00:10:55] <springle>	 ori: nope
[00:10:57] <mutante>	 ori: let me merge that fix
[00:11:03] <mutante>	 the one adding mortals
[00:11:22] <ori>	 i have root, so i should be able to scap
[00:11:37] <ori>	 but yeah, imo it should be merged
[00:11:50] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] mediawiki: re-include deployment users [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto)
[00:11:53] <logmsgbot>	 !log ori Started scap: fix any lingering inconsistencies in the state of the app servers (see https://gerrit.wikimedia.org/r/139089)
[00:11:57] <morebots>	 Logged the message, Master
[00:12:18] <ori>	 now, math
[00:13:58] <mutante>	 ori: i see users being created on random host (1209)
[00:16:04] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "after Change-Id: I49bcbbdfbbc , then it's fine, but marking -1 for clarity" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139072 (owner: 10Matanya)
[00:16:51] <mutante>	 https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22  looks better again
[00:17:43] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[00:19:07] <ori>	 [fluorine:/a/mw-log] $ grep -Po 'Exception from line [^:]+' exception.log  | sort | uniq -c | sort -rn
[00:19:07] <ori>	    3027 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf8/includes/db/Database.php
[00:19:07] <ori>	    1132 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf7/includes/db/Database.php
[00:19:09] <ori>	     394 Exception from line 50 of /usr/local/apache/common-local/php-1.24wmf8/extensions/Math/MathSource.php
[00:22:13] <ori>	 exceptions have subsided too
[00:22:51] <mutante>	 ori: yes, the graph looks normal again
[00:22:55] <mutante>	 thx
[00:23:29] <ori>	 thank you. i still suspect the database errors have to do with the math extension, and i worry that they'll spike again
[00:23:41] <ori>	 but i'm a bit burnt out and need a break
[00:24:07] <springle>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=66492#c9
[00:24:33] <springle>	 (wild guess)
[00:25:27] <ori>	 yeah. https://gerrit.wikimedia.org/r/#/c/139068/ didn't help, evidently
[00:25:48] <ori>	 "To be honest I don't understand why this table has to be created manually. "
[00:27:18] <ori>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=65793 wtf.
[00:27:25] <mutante>	 should we restore https://gerrit.wikimedia.org/r/#/c/138993/  ?
[00:27:33] <mutante>	 and do the temp. disable thing?
[00:29:52] <ori>	 well, $wgMathValidModes isn't set to MW_MATH_MATHML presently
[00:29:56] <ori>	 or doesn't include it, rather
[00:30:06] <ori>	 so i don't see how that patch would have an effect, but i could use a second pair of eyes
[00:32:51] <ori>	 change I75f24cb762609d6728247e3758fcc18f2ebfc6e6
[00:33:04] <ori>	 "Invalid settings for math rendering mode will default to MathMathML."
[00:33:07] <ori>	 fun
[00:33:40] <mutante>	 uhm...
[00:34:12] <legoktm>	 wait https://gerrit.wikimedia.org/r/#/c/138572/16/MathRenderer.php,cm
[00:34:19] <legoktm>	 change the default: case
[00:34:27] <legoktm>	 default should be PNG
[00:35:23] <ori>	 i'd prefer to revert, but i'm having a hard time identifying a safe point in the past
[00:35:33] <ori>	 so your suggestion may be the best one, legoktm
[00:35:39] <ori>	 could you submit a patch?
[00:35:45] <ori>	 i'll continue looking meanwhile
[00:36:12] <legoktm>	 on it
[00:36:29] <legoktm>	 also I hope that code wasn't contact a wmflabs domain from prod
[00:36:54] <legoktm>	 ori: [05:36:46 PM]  <grrrit-wm>	 (PS1) Legoktm: Set default fallback rendering option to MW_MATH_PNG [extensions/Math] - https://gerrit.wikimedia.org/r/139301
[00:37:10] <legoktm>	 for some reason MathTexvc is marked as deprecated
[00:37:13] <legoktm>	  * @deprecated will be deleted in one of the next versions without further notice
[00:37:16] <legoktm>	 :|
[00:37:56] <ori>	 i believe the "without further notice" piece
[00:38:14] <legoktm>	 hm
[00:38:19] <legoktm>	 I'm going to have to revert the tests too
[00:38:26] <ori>	 and yes it does
[00:38:30] <ori>	 connect to labs i mean
[00:38:33] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 18:36:54 UTC  
[00:38:37] <ori>	 i think it might be best to revert to 1bb3bfa3b5656af5ee57784578996e9513600a4d
[00:38:52] <logmsgbot>	 !log ori Finished scap: fix any lingering inconsistencies in the state of the app servers (see https://gerrit.wikimedia.org/r/139089) (duration: 26m 59s)
[00:38:54] <mutante>	 on tin.. sigh
[00:38:57] <morebots>	 Logged the message, Master
[00:39:06] <ori>	 mutante: tin's my apache change from earlier, i ack'd but it must have expired
[00:39:07] <mutante>	 i bet that is related to the restoring mortals
[00:39:10] <legoktm>	 uh, https://gerrit.wikimedia.org/r/#/c/137549/2 how will that help?
[00:39:11] <mutante>	 oh, it's not
[00:39:17] <mutante>	 ori: alright
[00:39:18] <legoktm>	 oh
[00:39:20] <legoktm>	 revert to there
[00:39:35] <legoktm>	 ori: yes, that's probably the safest choice
[00:40:38] <mutante>	 Sponsored by https://www.xsede.org/  heh
[00:41:51] <ori>	 !log removed Physikerwelt and Frédéric Wang from extension-Math group in Gerrit pending further inquiry into recent changes
[00:41:56] <morebots>	 Logged the message, Master
[00:44:45] <logmsgbot>	 !log ori Synchronized php-1.24wmf9/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 06s)
[00:44:49] <morebots>	 Logged the message, Master
[00:45:48] <logmsgbot>	 !log ori Synchronized php-1.24wmf8/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 05s)
[00:45:52] <morebots>	 Logged the message, Master
[00:46:21] <ori>	 exceptions subsiding
[00:46:57] <legoktm>	 I'll start looking into a proper revert
[00:47:10] <ori>	 thanks
[00:47:37] <ori>	 yeah, no more exceptions.
[00:51:13] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[00:52:49] <grrrit-wm>	 (03PS1) 10Ori.livneh: apache::vhost: replace generic params with literal values; remove ref to A2mod [operations/puppet] - 10https://gerrit.wikimedia.org/r/139307 
[00:54:15] <vvv>	 I am trying to open an article and I get "[799823e8] 2014-06-13 00:53:18: Fatal exception of type MWException"
[00:54:37] <vvv>	 It seems to be an error which I cannot reproduce while logged out, so probably preferences-specific or something else?
[00:55:01] <legoktm>	 vvv: is it a math-related article?
[00:55:13] <vvv>	 It might contain math symbols
[00:55:25] <icinga-wm>	 ACKNOWLEDGEMENT - Puppet freshness on tin is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 18:36:54 UTC ori.livneh Broken by https://gerrit.wikimedia.org/r/#/c/138769/ , fixed in https://gerrit.wikimedia.org/r/#/c/139307/
[00:56:09] <vvv>	 It does, in fact, contain <math> tag
[00:56:40] <legoktm>	 ugh
[00:56:59] <vvv>	 It looks like it gets back when I switch to PNG
[00:57:02] <legoktm>	 vvv: if you go to preferences and change your math type to PNG, it should work
[00:57:23] <legoktm>	 vvv: what's the article? and what was your math preference set to before?
[00:57:48] <vvv>	 https://en.wikipedia.org/wiki/Omega -- they were set to "show plain TeX + MathJax"
[00:58:32] <vvv>	 I assume this is a known bug?
[00:58:53] <legoktm>	 yeah
[00:59:24] <vvv>	 Could you give me the bug#?
[01:00:11] <legoktm>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=65793
[01:02:12] <legoktm>	 I'll post the traceback in a sec
[01:26:33] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC  
[01:36:43] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data exceeded the critical threshold [500.0]  
[01:50:43] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[02:22:50] <bblack>	 ori: the submodule thing fixed what looked like a varnish issue, or not?
[02:35:44] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf8) at 2014-06-13 02:34:41+00:00
[02:35:49] <morebots>	 Logged the message, Master
[02:38:03] <icinga-wm>	 PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:38:53] <icinga-wm>	 PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:03] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:03] <icinga-wm>	 PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:03] <icinga-wm>	 PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:03] <icinga-wm>	 PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:03] <icinga-wm>	 PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:13] <icinga-wm>	 PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:13] <icinga-wm>	 PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:23] <icinga-wm>	 PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:24] <icinga-wm>	 PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:39:43] <icinga-wm>	 PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100%  
[02:40:43] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0]  
[02:43:47] <springle>	 wtf
[02:44:13] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[02:44:33] <icinga-wm>	 RECOVERY - Host lvs3003 is UP: PING WARNING - Packet loss = 61%, RTA = 96.00 ms  
[02:44:33] <icinga-wm>	 RECOVERY - Host cp3013 is UP: PING WARNING - Packet loss = 61%, RTA = 95.32 ms  
[02:44:33] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 95.16 ms  
[02:44:33] <icinga-wm>	 RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 95.38 ms  
[02:44:33] <icinga-wm>	 RECOVERY - Host lvs3004 is UP: PING OK - Packet loss = 0%, RTA = 96.33 ms  
[02:44:34] <icinga-wm>	 RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 95.45 ms  
[02:44:34] <icinga-wm>	 RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 96.34 ms  
[02:44:35] <icinga-wm>	 RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 96.36 ms  
[02:44:43] <icinga-wm>	 RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 96.78 ms  
[02:44:53] <icinga-wm>	 RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 95.29 ms  
[02:45:03] <icinga-wm>	 RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 96.16 ms  
[02:45:03] <icinga-wm>	 RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 95.30 ms  
[02:52:07] <springle>	 whatever that was, it caused a massive 5xx spike
[02:52:53] <icinga-wm>	 PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 40.4921116667  
[02:56:53] <icinga-wm>	 RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.282772272727  
[03:00:43] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[03:12:31] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf9) at 2014-06-13 03:11:28+00:00
[03:12:36] <morebots>	 Logged the message, Master
[03:14:13] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[03:54:23] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 13 03:53:17 UTC 2014 (duration 53m 16s)
[03:54:28] <morebots>	 Logged the message, Master
[04:27:33] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC  
[05:18:13] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 83 data above and 9 below the confidence bounds  
[05:37:42] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "Should probably be part of Ori's puppet mediawiki refactor?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (https://bugzilla.wikimedia.org/61090) (owner: 10Hashar)
[05:57:37] <grrrit-wm>	 (03PS1) 10Springle: repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 
[05:58:06] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 (owner: 10Springle)
[05:58:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 (owner: 10Springle)
[05:58:53] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1051 (duration: 00m 14s)
[05:58:58] <morebots>	 Logged the message, Master
[06:07:57] <grrrit-wm>	 (03PS1) 10Springle: Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 
[06:08:39] <grrrit-wm>	 (03CR) 10Springle: [C: 032] Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 (owner: 10Springle)
[06:08:45] <grrrit-wm>	 (03Merged) 10jenkins-bot: Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 (owner: 10Springle)
[06:09:33] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: depool db1062 (duration: 00m 12s)
[06:09:37] <morebots>	 Logged the message, Master
[06:19:57] <grrrit-wm>	 (03PS4) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) 
[06:23:10] <grrrit-wm>	 (03PS6) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) 
[06:49:38] <grrrit-wm>	 (03PS1) 10Springle: repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 
[06:50:48] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 (owner: 10Springle)
[06:50:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 (owner: 10Springle)
[06:51:51] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1062, depool db1065 (duration: 00m 09s)
[06:51:56] <morebots>	 Logged the message, Master
[07:08:53] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 325 seconds  
[07:09:53] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds  
[07:11:03] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 303 seconds  
[07:11:26] <paravoid>	 Krinkle: ping?
[07:11:36] <Krinkle>	 last call
[07:11:39] <Krinkle>	 what's up
[07:11:40] <paravoid>	 hah
[07:11:48] <paravoid>	 is cvn-app4 yours?
[07:11:53] <Krinkle>	 It is
[07:11:59] <Krinkle>	 Coren told me it is eating labs NFS
[07:12:03] <paravoid>	 ah, heh
[07:12:18] <paravoid>	 40MB/s+
[07:12:23] <icinga-wm>	 PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 322 seconds  
[07:12:24] <Krinkle>	 I'm assuming that's the matter?
[07:12:28] <paravoid>	 yeah
[07:12:39] <Krinkle>	 paravoid: Its' running a cluster of irc bots that feed off of irc.wikimedia.org
[07:12:43] <icinga-wm>	 PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 341 seconds  
[07:12:47] <paravoid>	 why does it need so much I/O?
[07:12:55] <Krinkle>	 the horrible creatures that I so nicely inherited from another maintainer is written in C# and uses an SQLite database
[07:13:11] <Krinkle>	 and it opens/closes the file (which, since 2007, has grown to about 12MB) on each sql query
[07:13:19] <paravoid>	 lol
[07:13:27] <Krinkle>	 x 12 bots
[07:13:28] <paravoid>	 does it have to be in NFS?
[07:13:33] <paravoid>	 can't it be local?
[07:13:43] <icinga-wm>	 RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay -1 seconds  
[07:13:54] <Krinkle>	 well, so as horrible as it is even off NFS, it doens't have to be on NFS
[07:13:58] <Krinkle>	 I just didn't know that /data/project was NFS
[07:14:03] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds  
[07:14:15] <Krinkle>	 all I knew was 1) /data/project is persistent, 2) anything on the instance is temporary
[07:14:23] <icinga-wm>	 RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds  
[07:14:24] <paravoid>	 yeah that's partially true
[07:14:34] <Krinkle>	 I figured it'd all be virtual, but I guess because there is more than one labs virt host, it is real NFS underneath.
[07:14:47] <Krinkle>	 and because I share them between instances and don't want to lose data, I have them on NFS for now
[07:14:57] <Krinkle>	 I talked with Coren and I'll move them to local store, no worries.
[07:15:08] <paravoid>	 any ETA on that?
[07:15:28] <Krinkle>	 I'll set up a cron to create read-only copies for the API servers (which are serving from cvn.wmflabs.org/api.php) to read from.
[07:16:04] <Krinkle>	 those would stil be read live from NFS, but Coren doesn't think that'll be an issue. And if needed I can set up another cron that will pull it from there
[07:16:21] <Krinkle>	 Well, last week I've been shifting from one fix to the next in wmf work. This is volunteer time.
[07:16:34] <Krinkle>	 I expect to get to it this monday.
[07:16:40] <paravoid>	 it's killing labs
[07:16:51] <Krinkle>	 Coren said he can bottle neck it
[07:16:56] <paravoid>	 yeah but he didn't
[07:17:00] <paravoid>	 I might try to force the port to 100mbps
[07:18:13] <grrrit-wm>	 (03PS1) 10Springle: repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 
[07:18:20] <Krinkle>	 This is the first time since 2010 that the cvn bots have run on a proper server and that they're all running from the same infrastructure and actually maintained to some degree. I've been hoping to get graphs on memory, network, cpu and disk stats
[07:18:23] <Krinkle>	 but ganglia is being flaky
[07:18:45] <Krinkle>	 I'm kind of running blind (short of running top/ps all the time on individual nodes)
[07:19:08] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 (owner: 10Springle)
[07:19:13] <Krinkle>	 Actually, why don't I give it a shot now
[07:19:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 (owner: 10Springle)
[07:19:25] <paravoid>	 is it very hard to just restart the bots with a non-/data/project path?
[07:19:46] <Krinkle>	 There's quite a lot of infrastructure, yes.
[07:19:54] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1065, depool db1066 (duration: 00m 13s)
[07:19:58] <morebots>	 Logged the message, Master
[07:20:01] <Krinkle>	 especially because it's interconnected.
[07:20:33] <Krinkle>	 the bots communicate and there is a web API that should be near real-time because people click links on the irc feed and then interact further on-wiki where javascript fetches more info from the API. It should be in sync.
[07:24:10] <Krinkle>	 paravoid: I enabled labs role mnt by accident yesterday and want to swithc to srv
[07:24:21] <Krinkle>	 There's nothing on it yet, but it seems it doesn't want to switch
[07:24:42] <Krinkle>	 I disabled the role but the mount remains (as puppet naturally doesn't unmount it)
[07:25:41] <Krinkle>	 I enabled the other role in a separate run, but it didn't come up. I guess it doesn't automatically unmount when truying to mount elsewhere
[07:28:06] <Krinkle>	 Hm.. I guess I can just sudo umount /mnt, and then rerun puppet
[07:28:19] <_joe_>	 Krinkle: not if you don't tell it to puppet, and yes that would solve it
[07:28:33] <icinga-wm>	 PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC  
[07:28:39] <Krinkle>	 that felt scary and low level linuxy.
[07:28:42] <_joe_>	 (puppet does not manage removed properties at all)
[07:28:43] <Krinkle>	 never touched stuff like that
[07:28:51] <_joe_>	 mount(1)? 
[07:28:57] <_joe_>	 wow.
[07:29:19] <Krinkle>	 yeah, I know that about puppet (the infamous ensure=>absent to stay forever in our source code)
[07:31:10] <grrrit-wm>	 (03PS1) 10Withoutaname: Enable Echo on Wikimedia wikis by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 
[07:33:39] <Krinkle>	 _joe_: Hm.. it seems puppet is stuck on cvn-app4
[07:33:49] <Krinkle>	 the lock file has been there since 1:30 UTC yesterday
[07:34:09] <Krinkle>	 (I tried to run puppet manually and said lock file exists)
[07:34:13] <_joe_>	 Krinkle: labs? do I have access? if so, let me take a look
[07:34:17] <Krinkle>	 checking syslog tells me the automated one was skipped as well
[07:34:24] <Krinkle>	 Jun 13 07:23:01 cvn-app5 CRON[27536]: (root) CMD (timeout  -k 300 1800 puppet agent --onetime --verbose --no-daemonize --splay --splaylimit 60 --show_diff >> /var/log/puppet.log 2>&1)
[07:34:25] <Krinkle>	 Jun 13 07:23:02 cvn-app5 puppet-agent[27538]: Run of Puppet configuration client already in progress; skipping  (/var/lib/puppet/state/agent_catalog_run.lock exists)
[07:34:28] <Krinkle>	 yeah, labs, sorry.
[07:34:41] <Krinkle>	 app5, not app4.
[07:34:44] <_joe_>	 Krinkle: I'd say a run of puppet died
[07:34:50] <_joe_>	 and left the lock there
[07:35:00] <_joe_>	 ps -ef | fgrep puppet
[07:35:13] <_joe_>	 and, we may continue this in query maybe
[07:41:33] <icinga-wm>	 PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table dewiki._page_new doesnt exist on query. Default data  
[07:41:53] <springle>	 grr
[07:42:33] <icinga-wm>	 RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error:  
[07:46:51] <grrrit-wm>	 (03PS1) 10Springle: repool db1066, warm up. depool db1070 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139327 
[07:47:32] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 
[07:47:50] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1066, warm up. depool db1070 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139327 (owner: 10Springle)
[07:48:52] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1066, depool db1070 (duration: 00m 07s)
[07:48:58] <morebots>	 Logged the message, Master
[07:49:58] <Krinkle>	 paravoid: Can you see how cvn-apache5 does in comparison to cvn-app4? (assume the other cvn- instances are not notable in their NFS I/O)
[07:50:30] <paravoid>	 Krinkle: apt-get install iotop; iotop ;)
[07:52:22] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 
[07:52:28] <Krinkle>	 paravoid: thx
[07:52:48] <Krinkle>	 cvn-app4 shows 10 to 40 MB/s for each of the bots
[07:53:06] <Krinkle>	 cvn-apache shows mostly 0 and spikes of 5-20 MB/s presumably during a web request
[07:53:18] <Krinkle>	 so I guess that one open/closes on each query as well
[07:53:20] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 (owner: 10Faidon Liambotis)
[07:53:23] <Krinkle>	 I'll keep them both local
[07:53:31] <paravoid>	 nod
[07:53:32] <Krinkle>	 push from app to nfs, and pull from apache
[08:01:51] <grrrit-wm>	 (03PS1) 10Springle: Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 
[08:02:11] <grrrit-wm>	 (03CR) 10Springle: [C: 032] Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 (owner: 10Springle)
[08:02:17] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 (owner: 10Springle)
[08:05:14] <grrrit-wm>	 (03PS1) 10Springle: repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 
[08:05:39] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 (owner: 10Springle)
[08:05:45] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 (owner: 10Springle)
[08:06:13] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, depool db1071 (duration: 00m 06s)
[08:06:19] <morebots>	 Logged the message, Master
[08:10:20] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, depool db1071 (duration: 00m 12s)
[08:10:24] <morebots>	 Logged the message, Master
[08:17:40] <akosiaris>	 aptitude why mpt-status
[08:17:40] <akosiaris>	 i   nagios-plugins-extra   Suggests nagios-plugins-contrib
[08:17:40] <akosiaris>	 p   nagios-plugins-contrib Suggests mpt-status            
[08:17:53] <akosiaris>	 somehow I doubt this makes sense in a labs machine
[08:18:03] <akosiaris>	 well VM to be more precise
[08:19:12] <akosiaris>	 disregard
[08:25:15] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Minor lint base::monitoring::host [operations/puppet] - 10https://gerrit.wikimedia.org/r/139332 
[08:25:19] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Removed reference to unused -v option in jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139208 (owner: 10Aaron Schulz)
[08:26:53] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Removed unused "forkcount" stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139191 (owner: 10Aaron Schulz)
[08:27:04] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Removed unused "forkcount" stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139191 (owner: 10Aaron Schulz)
[08:27:14] <_joe_>	 godog: you're brave enough to tackle jobs-loop?
[08:27:18] <_joe_>	 wow.
[08:28:29] <godog>	 not my work :)) merely an observer
[08:28:41] <grrrit-wm>	 (03PS4) 10Nikerabbit: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 
[08:30:05] <godog>	 it'd make for some nice entries for http://seeninproduction.tumblr.com though