[00:00:12] *nod* [00:00:36] so i ran sync-common manually on mw1151 so it's OK now, but no one should deploy until the patch you +1'd above is merged and puppet runs [00:02:48] (03CR) 10Dzahn: [C: 032] "doing this now because otherwise deployments break, i can confirm and show in git log how mortals were included on appservers before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto) [00:03:02] (03PS2) 10Dzahn: mediawiki: re-include deployment users [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto) [00:03:13] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [00:03:45] blergh, what? [00:03:51] re: error rate [00:04:27] i certainly hope that this is wrong: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [00:04:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [00:04:46] no, exception spike too: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [00:05:30] springle: are you doing anything with the databases? [00:08:28] OK, the math extension is still broken [00:08:50] i'm going to revert it [00:10:34] i'm also going to run scap in the background because i suspect things are in an inconsistent state because of the issue you flagged above mutante [00:10:42] ori: +1 [00:10:55] ori: nope [00:10:57] ori: let me merge that fix [00:11:03] the one adding mortals [00:11:22] i have root, so i should be able to scap [00:11:37] but yeah, imo it should be merged [00:11:50] (03CR) 10Dzahn: [C: 032] mediawiki: re-include deployment users [operations/puppet] - 10https://gerrit.wikimedia.org/r/139089 (owner: 10Giuseppe Lavagetto) [00:11:53] !log ori Started scap: fix any lingering inconsistencies in the state of the app servers (see https://gerrit.wikimedia.org/r/139089) [00:11:57] Logged the message, Master [00:12:18] now, math [00:13:58] ori: i see users being created on random host (1209) [00:16:04] (03CR) 10Dzahn: [C: 04-1] "after Change-Id: I49bcbbdfbbc , then it's fine, but marking -1 for clarity" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139072 (owner: 10Matanya) [00:16:51] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22 looks better again [00:17:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [00:19:07] [fluorine:/a/mw-log] $ grep -Po 'Exception from line [^:]+' exception.log | sort | uniq -c | sort -rn [00:19:07] 3027 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf8/includes/db/Database.php [00:19:07] 1132 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf7/includes/db/Database.php [00:19:09] 394 Exception from line 50 of /usr/local/apache/common-local/php-1.24wmf8/extensions/Math/MathSource.php [00:22:13] exceptions have subsided too [00:22:51] ori: yes, the graph looks normal again [00:22:55] thx [00:23:29] thank you. i still suspect the database errors have to do with the math extension, and i worry that they'll spike again [00:23:41] but i'm a bit burnt out and need a break [00:24:07] https://bugzilla.wikimedia.org/show_bug.cgi?id=66492#c9 [00:24:33] (wild guess) [00:25:27] yeah. https://gerrit.wikimedia.org/r/#/c/139068/ didn't help, evidently [00:25:48] "To be honest I don't understand why this table has to be created manually. " [00:27:18] https://bugzilla.wikimedia.org/show_bug.cgi?id=65793 wtf. [00:27:25] should we restore https://gerrit.wikimedia.org/r/#/c/138993/ ? [00:27:33] and do the temp. disable thing? [00:29:52] well, $wgMathValidModes isn't set to MW_MATH_MATHML presently [00:29:56] or doesn't include it, rather [00:30:06] so i don't see how that patch would have an effect, but i could use a second pair of eyes [00:32:51] change I75f24cb762609d6728247e3758fcc18f2ebfc6e6 [00:33:04] "Invalid settings for math rendering mode will default to MathMathML." [00:33:07] fun [00:33:40] uhm... [00:34:12] wait https://gerrit.wikimedia.org/r/#/c/138572/16/MathRenderer.php,cm [00:34:19] change the default: case [00:34:27] default should be PNG [00:35:23] i'd prefer to revert, but i'm having a hard time identifying a safe point in the past [00:35:33] so your suggestion may be the best one, legoktm [00:35:39] could you submit a patch? [00:35:45] i'll continue looking meanwhile [00:36:12] on it [00:36:29] also I hope that code wasn't contact a wmflabs domain from prod [00:36:54] ori: [05:36:46 PM] (PS1) Legoktm: Set default fallback rendering option to MW_MATH_PNG [extensions/Math] - https://gerrit.wikimedia.org/r/139301 [00:37:10] for some reason MathTexvc is marked as deprecated [00:37:13] * @deprecated will be deleted in one of the next versions without further notice [00:37:16] :| [00:37:56] i believe the "without further notice" piece [00:38:14] hm [00:38:19] I'm going to have to revert the tests too [00:38:26] and yes it does [00:38:30] connect to labs i mean [00:38:33] PROBLEM - Puppet freshness on tin is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 18:36:54 UTC [00:38:37] i think it might be best to revert to 1bb3bfa3b5656af5ee57784578996e9513600a4d [00:38:52] !log ori Finished scap: fix any lingering inconsistencies in the state of the app servers (see https://gerrit.wikimedia.org/r/139089) (duration: 26m 59s) [00:38:54] on tin.. sigh [00:38:57] Logged the message, Master [00:39:06] mutante: tin's my apache change from earlier, i ack'd but it must have expired [00:39:07] i bet that is related to the restoring mortals [00:39:10] uh, https://gerrit.wikimedia.org/r/#/c/137549/2 how will that help? [00:39:11] oh, it's not [00:39:17] ori: alright [00:39:18] oh [00:39:20] revert to there [00:39:35] ori: yes, that's probably the safest choice [00:40:38] Sponsored by https://www.xsede.org/ heh [00:41:51] !log removed Physikerwelt and Frédéric Wang from extension-Math group in Gerrit pending further inquiry into recent changes [00:41:56] Logged the message, Master [00:44:45] !log ori Synchronized php-1.24wmf9/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 06s) [00:44:49] Logged the message, Master [00:45:48] !log ori Synchronized php-1.24wmf8/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 05s) [00:45:52] Logged the message, Master [00:46:21] exceptions subsiding [00:46:57] I'll start looking into a proper revert [00:47:10] thanks [00:47:37] yeah, no more exceptions. [00:51:13] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:52:49] (03PS1) 10Ori.livneh: apache::vhost: replace generic params with literal values; remove ref to A2mod [operations/puppet] - 10https://gerrit.wikimedia.org/r/139307 [00:54:15] I am trying to open an article and I get "[799823e8] 2014-06-13 00:53:18: Fatal exception of type MWException" [00:54:37] It seems to be an error which I cannot reproduce while logged out, so probably preferences-specific or something else? [00:55:01] vvv: is it a math-related article? [00:55:13] It might contain math symbols [00:55:25] ACKNOWLEDGEMENT - Puppet freshness on tin is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 18:36:54 UTC ori.livneh Broken by https://gerrit.wikimedia.org/r/#/c/138769/ , fixed in https://gerrit.wikimedia.org/r/#/c/139307/ [00:56:09] It does, in fact, contain tag [00:56:40] ugh [00:56:59] It looks like it gets back when I switch to PNG [00:57:02] vvv: if you go to preferences and change your math type to PNG, it should work [00:57:23] vvv: what's the article? and what was your math preference set to before? [00:57:48] https://en.wikipedia.org/wiki/Omega -- they were set to "show plain TeX + MathJax" [00:58:32] I assume this is a known bug? [00:58:53] yeah [00:59:24] Could you give me the bug#? [01:00:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=65793 [01:02:12] I'll post the traceback in a sec [01:26:33] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC [01:36:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data exceeded the critical threshold [500.0] [01:50:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [02:22:50] ori: the submodule thing fixed what looked like a varnish issue, or not? [02:35:44] !log LocalisationUpdate completed (1.24wmf8) at 2014-06-13 02:34:41+00:00 [02:35:49] Logged the message, Master [02:38:03] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:53] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:03] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:03] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:03] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:03] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:03] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:13] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:13] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:23] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:24] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:43] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [02:40:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [02:43:47] wtf [02:44:13] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [02:44:33] RECOVERY - Host lvs3003 is UP: PING WARNING - Packet loss = 61%, RTA = 96.00 ms [02:44:33] RECOVERY - Host cp3013 is UP: PING WARNING - Packet loss = 61%, RTA = 95.32 ms [02:44:33] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 95.16 ms [02:44:33] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 95.38 ms [02:44:33] RECOVERY - Host lvs3004 is UP: PING OK - Packet loss = 0%, RTA = 96.33 ms [02:44:34] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 95.45 ms [02:44:34] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 96.34 ms [02:44:35] RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 96.36 ms [02:44:43] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 96.78 ms [02:44:53] RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 95.29 ms [02:45:03] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 96.16 ms [02:45:03] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 95.30 ms [02:52:07] whatever that was, it caused a massive 5xx spike [02:52:53] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 40.4921116667 [02:56:53] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.282772272727 [03:00:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [03:12:31] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-13 03:11:28+00:00 [03:12:36] Logged the message, Master [03:14:13] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [03:54:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 13 03:53:17 UTC 2014 (duration 53m 16s) [03:54:28] Logged the message, Master [04:27:33] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC [05:18:13] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 83 data above and 9 below the confidence bounds [05:37:42] (03CR) 10Faidon Liambotis: "Should probably be part of Ori's puppet mediawiki refactor?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (https://bugzilla.wikimedia.org/61090) (owner: 10Hashar) [05:57:37] (03PS1) 10Springle: repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 [05:58:06] (03CR) 10Springle: [C: 032] repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 (owner: 10Springle) [05:58:12] (03Merged) 10jenkins-bot: repool db1051 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139320 (owner: 10Springle) [05:58:53] !log springle Synchronized wmf-config/db-eqiad.php: repool db1051 (duration: 00m 14s) [05:58:58] Logged the message, Master [06:07:57] (03PS1) 10Springle: Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 [06:08:39] (03CR) 10Springle: [C: 032] Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 (owner: 10Springle) [06:08:45] (03Merged) 10jenkins-bot: Depool db1062 for schema changes. Move s1 vslow/api/dump back to db1051. Raise db1061 load back to normal. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139321 (owner: 10Springle) [06:09:33] !log springle Synchronized wmf-config/db-eqiad.php: depool db1062 (duration: 00m 12s) [06:09:37] Logged the message, Master [06:19:57] (03PS4) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [06:23:10] (03PS6) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) [06:49:38] (03PS1) 10Springle: repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 [06:50:48] (03CR) 10Springle: [C: 032] repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 (owner: 10Springle) [06:50:53] (03Merged) 10jenkins-bot: repool db1062, warm up. depool db1065 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139324 (owner: 10Springle) [06:51:51] !log springle Synchronized wmf-config/db-eqiad.php: repool db1062, depool db1065 (duration: 00m 09s) [06:51:56] Logged the message, Master [07:08:53] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 325 seconds [07:09:53] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [07:11:03] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 303 seconds [07:11:26] Krinkle: ping? [07:11:36] last call [07:11:39] what's up [07:11:40] hah [07:11:48] is cvn-app4 yours? [07:11:53] It is [07:11:59] Coren told me it is eating labs NFS [07:12:03] ah, heh [07:12:18] 40MB/s+ [07:12:23] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 322 seconds [07:12:24] I'm assuming that's the matter? [07:12:28] yeah [07:12:39] paravoid: Its' running a cluster of irc bots that feed off of irc.wikimedia.org [07:12:43] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 341 seconds [07:12:47] why does it need so much I/O? [07:12:55] the horrible creatures that I so nicely inherited from another maintainer is written in C# and uses an SQLite database [07:13:11] and it opens/closes the file (which, since 2007, has grown to about 12MB) on each sql query [07:13:19] lol [07:13:27] x 12 bots [07:13:28] does it have to be in NFS? [07:13:33] can't it be local? [07:13:43] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay -1 seconds [07:13:54] well, so as horrible as it is even off NFS, it doens't have to be on NFS [07:13:58] I just didn't know that /data/project was NFS [07:14:03] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [07:14:15] all I knew was 1) /data/project is persistent, 2) anything on the instance is temporary [07:14:23] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds [07:14:24] yeah that's partially true [07:14:34] I figured it'd all be virtual, but I guess because there is more than one labs virt host, it is real NFS underneath. [07:14:47] and because I share them between instances and don't want to lose data, I have them on NFS for now [07:14:57] I talked with Coren and I'll move them to local store, no worries. [07:15:08] any ETA on that? [07:15:28] I'll set up a cron to create read-only copies for the API servers (which are serving from cvn.wmflabs.org/api.php) to read from. [07:16:04] those would stil be read live from NFS, but Coren doesn't think that'll be an issue. And if needed I can set up another cron that will pull it from there [07:16:21] Well, last week I've been shifting from one fix to the next in wmf work. This is volunteer time. [07:16:34] I expect to get to it this monday. [07:16:40] it's killing labs [07:16:51] Coren said he can bottle neck it [07:16:56] yeah but he didn't [07:17:00] I might try to force the port to 100mbps [07:18:13] (03PS1) 10Springle: repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 [07:18:20] This is the first time since 2010 that the cvn bots have run on a proper server and that they're all running from the same infrastructure and actually maintained to some degree. I've been hoping to get graphs on memory, network, cpu and disk stats [07:18:23] but ganglia is being flaky [07:18:45] I'm kind of running blind (short of running top/ps all the time on individual nodes) [07:19:08] (03CR) 10Springle: [C: 032] repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 (owner: 10Springle) [07:19:13] Actually, why don't I give it a shot now [07:19:14] (03Merged) 10jenkins-bot: repool db1065, warm up. depool db1066 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139325 (owner: 10Springle) [07:19:25] is it very hard to just restart the bots with a non-/data/project path? [07:19:46] There's quite a lot of infrastructure, yes. [07:19:54] !log springle Synchronized wmf-config/db-eqiad.php: repool db1065, depool db1066 (duration: 00m 13s) [07:19:58] Logged the message, Master [07:20:01] especially because it's interconnected. [07:20:33] the bots communicate and there is a web API that should be near real-time because people click links on the irc feed and then interact further on-wiki where javascript fetches more info from the API. It should be in sync. [07:24:10] paravoid: I enabled labs role mnt by accident yesterday and want to swithc to srv [07:24:21] There's nothing on it yet, but it seems it doesn't want to switch [07:24:42] I disabled the role but the mount remains (as puppet naturally doesn't unmount it) [07:25:41] I enabled the other role in a separate run, but it didn't come up. I guess it doesn't automatically unmount when truying to mount elsewhere [07:28:06] Hm.. I guess I can just sudo umount /mnt, and then rerun puppet [07:28:19] <_joe_> Krinkle: not if you don't tell it to puppet, and yes that would solve it [07:28:33] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Thu 12 Jun 2014 19:24:26 UTC [07:28:39] that felt scary and low level linuxy. [07:28:42] <_joe_> (puppet does not manage removed properties at all) [07:28:43] never touched stuff like that [07:28:51] <_joe_> mount(1)? [07:28:57] <_joe_> wow. [07:29:19] yeah, I know that about puppet (the infamous ensure=>absent to stay forever in our source code) [07:31:10] (03PS1) 10Withoutaname: Enable Echo on Wikimedia wikis by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 [07:33:39] _joe_: Hm.. it seems puppet is stuck on cvn-app4 [07:33:49] the lock file has been there since 1:30 UTC yesterday [07:34:09] (I tried to run puppet manually and said lock file exists) [07:34:13] <_joe_> Krinkle: labs? do I have access? if so, let me take a look [07:34:17] checking syslog tells me the automated one was skipped as well [07:34:24] Jun 13 07:23:01 cvn-app5 CRON[27536]: (root) CMD (timeout -k 300 1800 puppet agent --onetime --verbose --no-daemonize --splay --splaylimit 60 --show_diff >> /var/log/puppet.log 2>&1) [07:34:25] Jun 13 07:23:02 cvn-app5 puppet-agent[27538]: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [07:34:28] yeah, labs, sorry. [07:34:41] app5, not app4. [07:34:44] <_joe_> Krinkle: I'd say a run of puppet died [07:34:50] <_joe_> and left the lock there [07:35:00] <_joe_> ps -ef | fgrep puppet [07:35:13] <_joe_> and, we may continue this in query maybe [07:41:33] PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table dewiki._page_new doesnt exist on query. Default data [07:41:53] grr [07:42:33] RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [07:46:51] (03PS1) 10Springle: repool db1066, warm up. depool db1070 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139327 [07:47:32] (03PS1) 10Faidon Liambotis: mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 [07:47:50] (03CR) 10Springle: [C: 032] repool db1066, warm up. depool db1070 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139327 (owner: 10Springle) [07:48:52] !log springle Synchronized wmf-config/db-eqiad.php: repool db1066, depool db1070 (duration: 00m 07s) [07:48:58] Logged the message, Master [07:49:58] paravoid: Can you see how cvn-apache5 does in comparison to cvn-app4? (assume the other cvn- instances are not notable in their NFS I/O) [07:50:30] Krinkle: apt-get install iotop; iotop ;) [07:52:22] (03PS2) 10Faidon Liambotis: mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 [07:52:28] paravoid: thx [07:52:48] cvn-app4 shows 10 to 40 MB/s for each of the bots [07:53:06] cvn-apache shows mostly 0 and spikes of 5-20 MB/s presumably during a web request [07:53:18] so I guess that one open/closes on each query as well [07:53:20] (03CR) 10Faidon Liambotis: [C: 032] mail: use real booleans rather than quoted [operations/puppet] - 10https://gerrit.wikimedia.org/r/139328 (owner: 10Faidon Liambotis) [07:53:23] I'll keep them both local [07:53:31] nod [07:53:32] push from app to nfs, and pull from apache [08:01:51] (03PS1) 10Springle: Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 [08:02:11] (03CR) 10Springle: [C: 032] Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 (owner: 10Springle) [08:02:17] (03Merged) 10jenkins-bot: Remove db1011 from s4; no longer in the shard. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139329 (owner: 10Springle) [08:05:14] (03PS1) 10Springle: repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 [08:05:39] (03CR) 10Springle: [C: 032] repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 (owner: 10Springle) [08:05:45] (03Merged) 10jenkins-bot: repool db1070, warm up. depool db1071 for schema changes. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139330 (owner: 10Springle) [08:06:13] !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, depool db1071 (duration: 00m 06s) [08:06:19] Logged the message, Master [08:10:20] !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, depool db1071 (duration: 00m 12s) [08:10:24] Logged the message, Master [08:17:40] aptitude why mpt-status [08:17:40] i nagios-plugins-extra Suggests nagios-plugins-contrib [08:17:40] p nagios-plugins-contrib Suggests mpt-status [08:17:53] somehow I doubt this makes sense in a labs machine [08:18:03] well VM to be more precise [08:19:12] disregard [08:25:15] (03PS1) 10Alexandros Kosiaris: Minor lint base::monitoring::host [operations/puppet] - 10https://gerrit.wikimedia.org/r/139332 [08:25:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Removed reference to unused -v option in jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139208 (owner: 10Aaron Schulz) [08:26:53] (03PS2) 10Filippo Giunchedi: Removed unused "forkcount" stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139191 (owner: 10Aaron Schulz) [08:27:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Removed unused "forkcount" stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/139191 (owner: 10Aaron Schulz) [08:27:14] <_joe_> godog: you're brave enough to tackle jobs-loop? [08:27:18] <_joe_> wow. [08:28:29] not my work :)) merely an observer [08:28:41] (03PS4) 10Nikerabbit: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 [08:30:05] it'd make for some nice entries for http://seeninproduction.tumblr.com though