[00:01:55] (03CR) 10Dzahn: "Tim, do you have any idea why apt doesn't like to read a gzipped Packages.gz anymore (it's the one that is being added in here). We had to" [puppet] - 10https://gerrit.wikimedia.org/r/145573 (owner: 10Tim Landscheidt) [00:03:43] (03PS4) 10BBlack: SNI nginx ssl on varnish boxes at all sites [puppet] - 10https://gerrit.wikimedia.org/r/161180 [00:11:53] (03CR) 10BBlack: "Very basic testing with curl seems to work for me now." [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [00:14:36] (03CR) 10BBlack: "star/unified backwards in example command descriptions above :)" [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [00:20:32] how goes the SWAT? [00:21:13] ori, RoanKattouw: is it ok if I deploy OCG? [00:21:20] cscott: yes [00:21:21] go for it [00:24:44] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [00:34:07] (03PS1) 10Ejegg: Change donate cookie to 250 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161402 [00:35:33] does anyone know what happened to deployment-graphite.eqiad.wmflabs ? it's "Host deployment-graphite.eqiad.wmflabs not found: 3(NXDOMAIN)" on deployment-pdf01 (beta/labs) [00:43:33] !log updated OCG to version ce16f7adb60d7c77409e2e11ba0e5d6cce6955d5 [00:43:38] Logged the message, Master [00:45:34] (03PS1) 10Dzahn: add redirects for wikimania.org/.com [puppet] - 10https://gerrit.wikimedia.org/r/161405 [00:46:08] ocg looks good [00:51:42] ori, congrats to getting the beta feature out :) [00:51:47] fyi, https://en.wikipedia.org/w/index.php?title=MediaWiki:Tag-HHVM-description&oldid=626155109 [00:52:27] might want to set this as default text for the revtag so it can be translated [00:56:54] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:58:40] Eloquence: yeah, we should do that in the extension... [01:01:34] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 93, initializing_shards: 3, number_of_data_nodes: 3 [01:01:54] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 93, initializing_shards: 3, number_of_data_nodes: 3 [01:01:54] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 93, initializing_shards: 3, number_of_data_nodes: 3 [01:39:13] PROBLEM - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:42:51] Eloquence: thanks! [01:46:24] ori: congrats! [01:47:24] ragesoss: thanks! :) [01:47:44] 9 users are trying this feature! [01:48:05] reckless fools! [01:48:11] also, 10 :P [01:48:48] ACKNOWLEDGEMENT - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle RT 8384 [02:05:24] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3874 MB (3% inode=99%): [02:29:06] (03CR) 10Dzahn: "isn't this what was supposed to work in the latest release but not quite yet on the one we were running? last time Ariel and I looked at t" [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [02:36:34] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 1 failures [02:37:34] PROBLEM - puppet last run on mw1064 is CRITICAL: CRITICAL: Puppet has 1 failures [02:38:25] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-19 02:38:25+00:00 [02:38:31] Logged the message, Master [02:38:34] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [02:54:54] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:55:34] RECOVERY - puppet last run on mw1064 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:56:34] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:00:53] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:33] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [03:11:43] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-19 03:11:43+00:00 [03:11:50] Logged the message, Master [03:19:35] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:36:43] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [03:45:33] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-19 03:45:33+00:00 [03:45:39] Logged the message, Master [03:54:43] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:23:18] (03CR) 10Ori.livneh: "The diff of the two config files amounts to:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [04:26:30] (03CR) 10Ori.livneh: "Looks good apart from what I mentioned inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [05:01:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Sep 19 05:01:54 UTC 2014 (duration 1m 52s) [05:01:58] Logged the message, Master [05:26:28] (03PS1) 10Ori.livneh: hhvm: make the admin server available on job runners [puppet] - 10https://gerrit.wikimedia.org/r/161424 [05:42:52] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: make the admin server available on job runners [puppet] - 10https://gerrit.wikimedia.org/r/161424 (owner: 10Ori.livneh) [06:28:23] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Epic puppet fail [06:28:24] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Epic puppet fail [06:28:43] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Epic puppet fail [06:28:45] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Epic puppet fail [06:28:54] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Epic puppet fail [06:29:24] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:54] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:08] (03PS1) 10BBlack: Add SNI test hostname [dns] - 10https://gerrit.wikimedia.org/r/161426 [06:34:53] (03CR) 10BBlack: [C: 032] Add SNI test hostname [dns] - 10https://gerrit.wikimedia.org/r/161426 (owner: 10BBlack) [06:45:23] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:04] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:33] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:44] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:54:02] (03PS1) 10BBlack: Another SNI test hostname... [dns] - 10https://gerrit.wikimedia.org/r/161430 [06:54:15] (03CR) 10BBlack: [C: 032 V: 032] Another SNI test hostname... [dns] - 10https://gerrit.wikimedia.org/r/161430 (owner: 10BBlack) [06:56:16] (03CR) 10Giuseppe Lavagetto: "I do understand why you want to do this - I just think the cons outwheigh the pros here." [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [07:01:28] (03CR) 10Giuseppe Lavagetto: [C: 031] "As you may have understood I'm not a fan of HSTS :) but I don't want to block this - it's not that fundamental. The change is obviously go" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [07:13:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Ok so, first of all, reloading an apache "gone rouge" is part of ops responsibilities, so we're ok continuing to do that I guess :) On the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [07:15:10] (03PS4) 10Giuseppe Lavagetto: mediawiki: make HAT appservers a separate cluster in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/160624 [07:38:45] _joe_: have you managed to update the hhvm package finally ? [07:41:43] <_joe_> hashar: we'll do that on beta today [07:41:47] <_joe_> not in prod though [07:42:07] <_joe_> ori spoke with hhvm engineers and someone said they have some pretty severe regression [07:42:15] <_joe_> vs no new important feature [07:47:52] (03CR) 10Filippo Giunchedi: [C: 031] "before this gets merged I think it'd be nice to get an idea of how much root gets logged in usually from non-console, should be relatively" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [07:54:13] ori: _joe_ should I let you guys upgrade hhvm package on beta cluster? IIRC it is ensure => present [07:54:44] <_joe_> hashar: yes [07:54:44] I dont want to be messy /D [07:54:59] we also have hhvm being used on the continuous integration jenkins slaves in labs [07:55:04] <_joe_> I still need a) to upload the packages b) to rebuild extensions [07:55:18] ideally we would have puppet ensure a given version to make sure the contint labs box are updated as well [07:56:54] <_joe_> mmmh dunno [07:59:16] maybe 'present' in production, 'latest' in beta? [07:59:21] for everything, i mean [08:02:51] (03CR) 10Filippo Giunchedi: salt: make grain-ensure.py operate locally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [08:04:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "Alea iacta est." [puppet] - 10https://gerrit.wikimedia.org/r/160624 (owner: 10Giuseppe Lavagetto) [08:08:45] _joe_: ori: hiera( 'hhvm_version' ) [08:08:57] <_joe_> hashar: $hhvm_version [08:09:03] <_joe_> no need to hiera() that [08:09:04] (03CR) 10Ori.livneh: salt: make grain-ensure.py operate locally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [08:09:21] <_joe_> hashar: https://gerrit.wikimedia.org/r/#/c/160924/ btw [08:09:31] I only have a big picture of what hiera does :( looking forward to know more and help migrate our manifests [08:10:14] which reminds me that openstack is using hiera and recently started some kind of overhaul of their infra [08:10:23] need to find their RFC / blueprints and point you to them [08:10:29] ori: I guess a separate config is more trouble than worth? (re: grain-ensure.py) [08:11:21] godog: actually, that's probably a good idea [08:11:39] godog: you're probably right that that is more likely to be stable over time [08:12:17] (03CR) 10JanZerebecki: [C: 031] "Can be done next week." [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [08:12:48] (03CR) 10Ori.livneh: [C: 04-1] "per filippo; i'll follow up with a patch that uses a separate config file" [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [08:12:56] ori: cool, yeah looking at how it is done in the newer version is probably worth going with the config [08:13:36] how can I check the status of a table ( bounce_records ) in the server deployement-mediawiki-02 ( beta ) ? When I give $ mysql, it says mysql is not installed [08:16:33] (03CR) 10JanZerebecki: [C: 04-1] "Looks good, except as Ori said." [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [08:18:12] tonythomas: on deployment-bastion (the main work machine) you can use a mysql wrapper script [08:18:16] $ sql enwiki [08:18:19] (mw@deployment-db1) [enwiki]> Bye [08:18:26] and, I think there is some issue with the shortcut to extensions in /srv/common-local/w [08:18:34] I think it even gives you write access, so be very careful [08:18:34] tonythomas01@deployment-mediawiki02:/srv/common-local/w$ ls -l extensions [08:18:34] lrwxrwxrwx 1 mwdeploy mwdeploy 45 Apr 21 16:55 extensions -> /usr/local/apache/common-local/php/extensions [08:18:41] which doesnt seem to exist [08:18:48] the paths are a mess [08:18:56] usr/local/apache/common-local/php-master/extensions [08:18:59] exists though [08:19:10] I think they got changed recently in favor of /srv/mediawiki [08:19:47] it hasn't been announced publicly though [08:20:04] okey. so currently, extensions are taken up from /usr/local/apache/common-local/php-master/extensions right [08:20:04] ? [08:21:19] and the beta ( http://deployment.wikimedia.beta.wmflabs.org/ ) host is deployment-mediawiki02 or deployment-db1 ? [08:21:27] I have forwarded the email to the qa list [08:21:32] oeky. [08:22:03] tonythomas: https://lists.wikimedia.org/pipermail/qa/2014-September/001962.html [08:22:10] we have yet to clear up the old paths :( [08:22:16] they might still be used somehow [08:22:31] so basically /srv/mediawiki [08:22:38] and on bastion that is something else [08:23:05] and beta cluster is automatically updated via a Jenkins job that runs scap for us (scap being the deploy utility to push mw setup to the application servers) [08:23:14] so usually, you dont have to mess with the files :] [08:23:27] ok. anyway, looks like I dont have the access. tonythomas01@deployment-mediawiki02:~$ sql enwiki [08:23:27] The program 'sql' is currently not installed. To run 'sql' please ask your administrator to install the package 'parallel' [08:24:10] deployment-bastion [08:24:13] :D [08:24:21] oh. true. :) [08:24:31] on the production cluster, human being use a deployment box named tin.eqiad.wmnet [08:24:40] which is the one we usually use to interact with mediawiki [08:24:56] deployment-bastion is the equivalent of tin [08:25:00] and has all the utility needEd [08:25:07] such as scap (the deployment script) [08:25:08] ok. anyway, our extension is not yet installed in prod, so wont be a necessity [08:25:16] and a bunch of "useful" scripts such as sql [08:25:26] k. let me try the sql command here [08:25:33] note that anything that is breaking on beta might well break on prod later on [08:25:47] so you have to get it fixed properly in puppet / mediawiki git repos whatever [08:25:52] dont ever do manual hack :] [08:26:03] or we will end up facing the same issue later in prod [08:26:08] yeah. :) our puppet patch got merged yesterday though [08:26:12] \O/ [08:26:22] now we just need to observe things from this table 'bounce_records' [08:26:27] gerrit.wikimedia.org/r/#/c/155753/ the one [08:26:33] to update the database schema, there is a slight different between prod and beta [08:26:39] in beta, that is done by using update.php [08:26:57] so if the patches are properly registered, they will be applied automatically on beta once per hour (via a jenkins job) [08:27:07] in production, one need to fill a database schema change bug [08:27:20] ok. I need to check which db http://deployment.wikimedia.beta.wmflabs.org/ use [08:27:24] that is then reviewed / death with by the DBA on the prod server BEFORE the change is merged/deployed [08:27:25] ok [08:27:39] <_joe_> death with :P [08:27:55] cause some SQL schema might end up being bad for production and pre commit review might not have catcher it [08:27:58] catched it [08:28:07] so we have our DBA do a post review of any SQL change [08:28:18] * _joe_ imagines sean with a big axe killing poor developers [08:28:28] also, some sql changes might take a while in production (think potentially X days) [08:29:02] _joe_: back in the old days jamesday / domas were killing bad code with live hack and the magic operator: // [08:30:26] ah [08:30:29] _joe_: http://dom.as/2007/11/15/optimization-operator/ :D [08:30:58] <_joe_> eheh [08:32:51] I can still remember domas joining in #mediawiki stating "hi I am 15 years old, how can help?" :D [08:33:09] tonythomas: what's the best place where to look for the status of VERP things? :) [08:33:46] Nemo_bis: yesterday, our router is live in polonium, lead and everywhere - gerrit.wikimedia.org/r/#/c/155753/ got merged [08:33:56] which makes beta ready to recieve bounce too ;) [08:34:27] I was just looking for our bounce_records table to instantiate some bounces, and check whether it gets recorded into the table as expected [08:34:45] ( we got beta producing VERP return paths in send mail long back ) [08:35:00] (03PS1) 10Ori.livneh: Add '-hhvm' to profiler ID when running on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161435 [08:35:00] gerrit.wikimedia.org/r/#/c/155753/ took bit of ~ 1 month ( almost 50 PS ) :D [08:35:03] hashar: ^ [08:35:52] Nemo_bis: we will have to test for almost 2 weeks here in beta, before we push things into prod - as per Jeff and Lego [08:36:44] and if you can, what database name do http://deployment.wikimedia.beta.wmflabs.org/ use ? [08:37:32] glad to hear about it, you could just paste the same things you told me here at https://bugzilla.wikimedia.org/show_bug.cgi?id=46640 [08:37:51] oh. forgot to update all those. will do [08:40:07] ori: yeah you should head to bed now [08:40:34] tonythomas: I think deployment.wikimedia is labswiki [08:40:37] unsure though [08:40:48] hashar: thanks. let me try that out [08:41:40] ori: is that going to namespace the metrics in graphite ? [08:41:54] hashar: yep [08:44:28] hashar: do you think this might be due to lack of access ? I should get the table headers atleast right ? [08:44:28] mysql> select * from bounce_records; [08:44:28] Empty set (0.00 sec) [08:44:52] (03CR) 10Hashar: [C: 031] "I guess namespacing the metrics in graphite needs an additional change somewhere else. We have profiling under MediaWiki , will probably " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161435 (owner: 10Ori.livneh) [08:45:59] (03CR) 10Hashar: [C: 04-1] "test2 as well :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161435 (owner: 10Ori.livneh) [08:50:03] tonythomas: na that just show all records in the bounce_records table [08:50:19] tonythomas: you want: show columns from bounce_records; [08:50:34] or show columns from bounce_records \G [08:50:34] :D [08:53:03] hashar: oh :D looks like my knowledge of mysql is outdated [08:53:37] ah. that one shows : ) [08:57:37] mediawiki also have a sql.php [08:57:41] though I never played with it [08:57:59] mwscript sql.php --wiki=enwiki [08:58:25] that yields var_dump() [09:08:17] (03CR) 10Filippo Giunchedi: [C: 031] Add Grafana module & role [puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [09:14:51] <_joe_> !log restarted hhvm on mw1053, stuck to 100% cpu since last restart (activating stats) [09:14:56] Logged the message, Master [09:22:28] InfluxDB !!! :D [09:23:56] (03CR) 10Hashar: "If we can ever get that for the beta cluster, I will owe you several rounds of your favorite beverages." [puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [09:25:18] (03PS2) 10Ori.livneh: salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 [09:28:44] (03CR) 10Filippo Giunchedi: [C: 031] salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [09:34:06] ori: thanks for sticking with it :) LGMT [09:34:13] LGTM even [09:48:44] PROBLEM - Apache HTTP on mw1129 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.077 second response time [09:50:44] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.134 second response time [09:52:13] PROBLEM - Apache HTTP on mw1172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.057 second response time [09:52:13] PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.078 second response time [09:52:14] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1503 bytes in 3.107 second response time [09:52:34] PROBLEM - Apache HTTP on mw1123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.056 second response time [09:52:57] <_joe_> uh? [09:53:04] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [09:53:04] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.977 second response time [09:53:04] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [09:53:43] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [09:54:05] ah, that explains my 500s [09:54:59] [Error] Failed to load resource: the server responded with a status of 403 (Forbidden) (arrow-expanded.svg, line 0) [09:55:02] [Error] Failed to load resource: the server responded with a status of 403 (Forbidden) (break.png, line 0) [09:55:05] also nice [09:55:23] https://en.wikipedia.org/w/skins/Vector/images/arrow-expanded.svg [09:55:23] PROBLEM - Apache HTTP on mw1212 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [09:55:23] PROBLEM - Apache HTTP on mw1038 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.101 second response time [09:56:23] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.242 second response time [09:56:23] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.849 second response time [09:59:40] PROBLEM - Apache HTTP on mw1096 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.068 second response time [09:59:40] PROBLEM - Apache HTTP on mw1059 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.062 second response time [10:00:03] PROBLEM - Apache HTTP on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.078 second response time [10:00:03] PROBLEM - Apache HTTP on mw1151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [10:00:07] <_joe_> http://gdash.wikimedia.org/dashboards/reqerror/ [10:00:09] ooops [10:00:24] <_joe_> hashar: someone did something? [10:00:30] PROBLEM - Apache HTTP on mw1168 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.092 second response time [10:00:31] PROBLEM - Apache HTTP on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.053 second response time [10:00:31] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.087 second response time [10:00:32] not me [10:00:33] <_joe_> I don't see these errors in fatal.log [10:00:40] PROBLEM - Apache HTTP on mw1157 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.067 second response time [10:00:40] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.230 second response time [10:00:50] hhvm ? [10:00:59] <_joe_> thedj: nope [10:01:09] something storming dbs [10:01:09] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.138 second response time [10:01:09] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [10:01:14] <_joe_> thedj: something is extremely slow on the backend [10:01:17] <_joe_> maybe some db? [10:01:23] (03CR) 10Alexandros Kosiaris: "@ori. Yeah I know, I am now testing on squid3. IIRC correctly from install-server there are going to be a couple of other changes as well," [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [10:01:24] all s1 slaves hit max connections [10:01:30] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [10:01:32] PROBLEM - Apache HTTP on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.075 second response time [10:01:33] PROBLEM - Apache HTTP on mw1054 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.067 second response time [10:01:38] we had a few occurrences around 8:50 UTC [10:01:42] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [10:01:47] spiked started after 9:50UTc [10:02:05] which wikis are on S1? [10:02:30] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.320 second response time [10:02:32] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.191 second response time [10:02:32] PROBLEM - Apache HTTP on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.055 second response time [10:02:33] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [10:02:37] hashar: enwiki [10:02:54] blames the english community [10:03:31] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [10:03:48] back to normalish now on s1 slaves [10:03:50] we are lucky to have you around springle :-] [10:04:10] well i didn't do much. something else caused masses of hits on dbs [10:04:12] <_joe_> springle: what extactly happened? [10:04:16] then went away [10:04:18] still digging [10:04:19] <_joe_> mmmh [10:04:33] <_joe_> maybe hhvm servers can help us [10:04:38] <_joe_> they do log slow queries [10:05:34] pool counter reports some lucene/elasticsearch queue being full [10:05:45] _joe_: so does mediawiki [10:06:04] and a spam of ':revid:626188722' (ArticleView): Pool queue is full [10:06:08] (on enwiki [10:06:09] hashar: yeah en.wp reported that since sept/11 elasticsearch has not been updating [10:06:26] <_joe_> springle: when did the spike began [10:06:48] some occurrences started at 8:50 am utc apparently [10:06:58] last 6 hours of 500 errors https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1hours&from=-6%20hour&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22) [10:07:08] something wrote to s1, caused unusual slave lag, everything did MASTER_POS_WAIT, backed up, max_connections [10:07:36] <_joe_> mmh [10:07:44] https://en.wikipedia.org/w/index.php?oldid=626188722 Scottish independence referendum, 2014 [10:09:03] _joe_: around 08:20 load started to climb on s1 [10:09:38] both read and write [10:10:24] s1 master shows relatively normal query distribution, lots of LinksUpdate [10:12:10] dont we use pool counter for db connections as well? [10:12:45] (03PS1) 10Ori.livneh: Don't manipulate the environment to determine TZ offset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161439 (https://bugzilla.wikimedia.org/71036) [10:13:32] anyway at 8:50 we had a drop of poolcounter metrics https://graphite.wikimedia.org/render/?title=PoolCounter%20Client%20Sampled%20Call%20Rate%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=cactiStyle(MediaWiki.PoolCounter.Client.*.count) [10:13:39] no clue what it means [10:15:48] <_joe_> springle: delete traffic has gone up as well it seems [10:16:25] _joe_: yep "both read and write" ^ [10:16:48] <_joe_> yes I was trying to understand what really happened [10:17:14] and somehow graphite missing bunch of stats :/ [10:17:29] https://graphite.wikimedia.org/render/?width=783&height=308&_salt=1411121811.445&from=-6hours&target=MediaWiki.JobRunner.run-ParsoidCacheUpdateJob.count [10:17:31] that is for MediaWiki.JobRunner.run-ParsoidCacheUpdateJob.count [10:17:38] nothing shown before 9:10am [10:18:42] <_joe_> hashar: that is when I restarted the hhvm JR [10:18:51] <_joe_> but it's not doing so much traffic btwe [10:19:39] could it be it started processing jobs again [10:19:43] at a high rate? [10:20:48] <_joe_> no [10:21:05] <_joe_> it's actually slower than traditional jobrunners [10:26:49] _joe_: http://aerosuidae.net/paste/a/541c04e3 [10:26:53] from binlog [10:27:33] that is a non-trivial LinksUpdate spike, but it does look like legit write traffic; just a lot of it [10:27:41] <_joe_> LinksUpdate::updateLinksTimestamp [10:27:45] <_joe_> what does that? [10:29:57] _joe_: LinusUpdate::doUpdate, indirectly. "Update link tables with outgoing links from an updated article" [10:30:16] LinksUpdate::doUpdate rather [10:31:12] i don;t know if that is triggered by web hits or a job [10:33:28] might be both [10:33:42] what is ArticleCompileProcessor? [10:34:22] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:34:59] (random guess: changes to a template?) [10:35:11] <_joe_> godog: I guess so [10:35:20] <_joe_> mmh and that is hhvm [10:35:48] <_joe_> ouch [10:39:15] ArticleCompileProcessor is in PageTriage extension [10:39:28] no idea if it's normal to see that have high write traffic [10:39:35] I suspect it is related to search somehow [10:39:40] but havn't noticed it before in top 10 lists [10:45:30] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 444 bytes in 0.048 second response time [10:49:59] * YuviPanda waves at _joe_ [10:50:11] I'm in the UK so I think for next week at least our timezones should not be too bad [10:51:21] <_joe_> YuviPanda: ok, whenever you have a patch, ping me [10:51:31] _joe_: will do [10:51:53] (03PS1) 10Giuseppe Lavagetto: hhvm: disable collection of sql stats [puppet] - 10https://gerrit.wikimedia.org/r/161444 [10:51:59] _joe_: I'm going to move the commons tuff to a module called 'nagios-config', we can rename it later for bikeshedding (unless you've another name in mind now) [10:52:21] <_joe_> YuviPanda: my idea is - you take whatever is good out of the mess that our icinga/nagios manifests are [10:52:28] <_joe_> YuviPanda: nagios_config, please!!! [10:52:31] ah [10:52:32] right [10:52:34] nagios_config [10:52:36] <_joe_> :) [10:52:47] <_joe_> or call it nagios_common [10:52:52] <_joe_> maybe [10:52:52] hmm, that sounds better [10:52:59] nagios_common then [10:53:31] _joe_: I'm going to do it piecemeal. 1. pick something useful, 2. move it into module, 3. use it in both icinga and shinken [10:53:35] (per patch) [10:53:48] this way patches will be small, and more easily understood [10:56:00] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:12] <_joe_> YuviPanda: I agree fully [10:56:29] _joe_: \o/ [10:56:33] moving check plugins now [10:56:51] hmm, let me start smaller, and move the resources.cfg custom templates [10:57:04] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: disable collection of sql stats [puppet] - 10https://gerrit.wikimedia.org/r/161444 (owner: 10Giuseppe Lavagetto) [10:58:02] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:00:10] lunch [11:13:40] (03PS1) 10Yuvipanda: nagios_common: Extract user_definitions from icinga [puppet] - 10https://gerrit.wikimedia.org/r/161446 [11:13:57] _joe_: small starter ^ [11:14:21] _joe_: I suppose when we have hiera enabled, we can default $config_dir to looking up from hiera [11:14:35] although then I've to get hiera working for labs, but I'm sure that can be worked around :) [11:15:13] <_joe_> YuviPanda: hiera _is_ active in labs [11:15:28] _joe_: right, but not in prod [11:15:39] _joe_: so we can't use it for the common modules, since icinga runs in prod [11:16:02] _joe_: oh, I can use hiera without having to add support to wikitech? [11:16:08] * YuviPanda hasn't kept up [11:16:08] <_joe_> well, we can fix that today maybe [11:16:23] 'that' -> no hiera in prod? :) [11:16:26] <_joe_> YuviPanda: you /can/, it's just not such a good idea [11:16:41] true, but I've no idea how much work it'll be to get that there. [11:16:44] <_joe_> YuviPanda: yep [11:16:47] PHP, ugh. [11:17:49] _joe_: but I suppose in the meantime I can just pass $config_dir around [11:17:57] I'd keep it later on as well anyway, just have it default to hiera value [11:18:00] <_joe_> yes [11:18:49] _joe_: cool. wanna merge this one? :) [11:19:13] next stop would be to move the entire custom wmf plugins, and then all plugins. and then I can include 'all plugins' in shinken [11:19:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "small comments, LGTM otherwise." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [11:19:34] <_joe_> YuviPanda: welcome in the nitpick realm :) [11:19:40] ah [11:20:00] <_joe_> but since we're starting over, let's do it as good as possible [11:20:21] _joe_: yeah, makes sense [11:20:27] _joe_: also realized I need to make user/group configurable as well [11:20:32] _joe_: since shinken expects it to be shinken/shinken [11:21:41] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.186 second response time [11:22:01] <_joe_> mmh very well [11:22:41] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71306 bytes in 0.178 second response time [11:25:29] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:36] (03PS2) 10Yuvipanda: nagios_common: Extract user_definitions from icinga [puppet] - 10https://gerrit.wikimedia.org/r/161446 [11:30:02] _joe_: ^ [11:30:17] _joe_: I'm conflicted on setting defaults for config_dir [11:30:25] I could set it to icinga's, and simplify prod code [11:30:36] but then the prod and labs will be different... [11:30:39] same for owner and group [11:30:41] <_joe_> YuviPanda: do it. [11:30:43] (03PS1) 10Springle: depool db1072. seems more susceptible to replag; find out why. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161447 [11:30:50] <_joe_> YuviPanda: it will go in hiera in labs [11:30:52] <_joe_> for now [11:30:56] hmm, right [11:31:03] <_joe_> YuviPanda: do that. define it via hiera in labs [11:31:06] (03CR) 10Springle: [C: 032] depool db1072. seems more susceptible to replag; find out why. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161447 (owner: 10Springle) [11:31:11] <_joe_> right in your change [11:31:11] (03Merged) 10jenkins-bot: depool db1072. seems more susceptible to replag; find out why. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161447 (owner: 10Springle) [11:31:33] _joe_: it's not used in labs in this change. the entire plugins class I want to move over before I use in labs. [11:32:07] <_joe_> ok, so - use prod values as defaults [11:32:10] !log springle Synchronized wmf-config/db-eqiad.php: depool db1072. seems more susceptible to replag; find out why. (duration: 00m 10s) [11:32:15] Logged the message, Master [11:32:16] <_joe_> which we will override via hiera in labs [11:32:24] (03PS3) 10Yuvipanda: nagios_common: Extract user_definitions from icinga [puppet] - 10https://gerrit.wikimedia.org/r/161446 [11:32:25] _joe_: yeah [11:32:59] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Extract user_definitions from icinga [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [11:33:04] q [11:33:10] grrr [11:33:20] (03PS4) 10Yuvipanda: nagios_common: Extract user_definitions from icinga [puppet] - 10https://gerrit.wikimedia.org/r/161446 [11:42:35] _joe_: wanna merge this ^? or should I submit dependent patches as well? [11:43:23] <_joe_> YuviPanda: submit them [11:43:28] heh alright [11:43:45] <_joe_> now I gotta go fetch my step-daughter, eat lunch, sleep a little [11:45:04] * YuviPanda continues eating chocolate cake [11:58:09] I wanted some hack to get the email confirmed of my fake user in deployement.beta.wmflabs. I tried to set, user_email_authenticated=true -- but it looks like the change is getting reverted once I refresh my wiki page [12:01:52] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:28] https://commons.wikimedia.org/w/index.php?title=Special:RecentChanges&tagfilter=HHVM why this tags on prod? [12:05:13] and why are deletions tagged as (Tag: HHVM) ? [12:05:19] it's a beta feature [12:05:24] new* [12:05:38] https://lists.wikimedia.org/pipermail/wikitech-l/2014-September/078694.html [12:05:57] thanks [12:07:20] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.131 second response time [12:08:17] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71306 bytes in 0.175 second response time [12:09:10] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [12:13:02] (03PS1) 10Springle: repool db1072 with ReadAheadNone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161449 [12:13:38] (03CR) 10Springle: [C: 032] repool db1072 with ReadAheadNone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161449 (owner: 10Springle) [12:13:40] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.160 second response time [12:13:42] (03Merged) 10jenkins-bot: repool db1072 with ReadAheadNone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161449 (owner: 10Springle) [12:14:15] !log springle Synchronized wmf-config/db-eqiad.php: repool db1072 with ReadAheadNone (duration: 00m 09s) [12:14:21] Logged the message, Master [12:15:43] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71306 bytes in 0.146 second response time [12:20:22] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:34] mark: LinksUpdate traffic is spiking. from enwiki binlog http://aerosuidae.net/paste/b/541c20cc [12:27:05] it's normal to see LinksUpdate high on the list, but that ratio is way out of whack [12:28:32] where are those queries coming from? [12:30:55] so basically we're seeing many more edits on enwiki than usual? [12:32:54] mark: i think so [12:33:08] the queries are primarily coming from wikiadmin [12:33:43] various mw nodes: mw1001, mw1009, etc [12:35:00] so job runners [12:35:16] right [12:36:00] !log temporarily disable log fsync on enwiki slaves [12:36:06] Logged the message, Master [12:36:43] now client connections spiking again [12:37:09] wikiuser connections [12:37:19] reading from slaves, not writers [12:40:09] and down again [12:40:46] (03Abandoned) 10Hoo man: Allow deployers to graceful apache [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [12:50:55] http://gdash.wikimedia.org/dashboards/editpage/ [12:51:06] _joe_: i suppose restarting hhvm is something deployers can't do? [12:51:11] looks... different... since approximately the time it started [12:51:12] in case it crashes [12:51:29] <_joe_> aude: not right now, no [12:51:35] hmmm ok [12:51:40] <_joe_> but ops should be around and able to do that [12:51:42] if enough ops are around on the weekend [12:51:43] but it could as well be a secondary effect [12:51:59] * aude obviously wants to investigate and fix this issue [12:52:02] deployments during a weekend? [12:52:09] mark: restart hhvm [12:52:21] or do we leave crashed until monday> [12:52:24] ? [12:53:02] http://gdash.wikimedia.org/dashboards/poolcounter/ [12:53:40] what happened at 8:50? [12:53:41] hashar: ^ you mentioned that. what were you thinking about search earlier? [12:53:57] that might have been unrelated [12:54:25] but at 9:00 we had a loooot of elastica exceptions due to a partial shard failure (not sure what that means) [12:54:53] then I have seen a few search related issue, but that might just have been caused by the job runners being overloaded [13:01:01] and what happened here? http://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&h=mw1053.eqiad.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2 [13:01:39] actually looks like there it went back to normal [13:01:45] that's the hhvm jobrunner? [13:02:05] (03PS1) 10Giuseppe Lavagetto: HAT: depool faulty servers, monitor hhvm rendering [puppet] - 10https://gerrit.wikimedia.org/r/161450 [13:02:15] <_joe_> mark: the JR had the mysql stats bug [13:02:19] <_joe_> I wrote about before [13:02:25] <_joe_> so it was doing _nothing_ [13:02:38] <_joe_> hanging at 100% CPU in a deadlock [13:02:59] ok [13:03:05] so job runners are definitely spiky the last few hours [13:03:11] but again, that might be because DBs are unhappy etc [13:03:17] <_joe_> yeah [13:03:31] iidc springle mentionner lot of LinksUpdate occuring [13:03:36] but that might just be the usual traffic [13:03:57] hashar: no it's defeinitely high. see ratio here: http://aerosuidae.net/paste/b/541c20cc [13:04:05] yeah [13:04:09] but where does it come from [13:04:30] <_joe_> a template change could trigger that? [13:04:46] and of course none are profiled :D [13:04:58] (03CR) 10Giuseppe Lavagetto: [C: 032] HAT: depool faulty servers, monitor hhvm rendering [puppet] - 10https://gerrit.wikimedia.org/r/161450 (owner: 10Giuseppe Lavagetto) [13:06:20] the wikiuser client spikes come after bursts of wikiadmin doing LinksUpdate and *::invalidateCache [13:07:21] but most wikiadmin connections on enwiki master have been sleeping for ~20min [13:07:34] did someone do something 20min ago? [13:08:35] so do job runners use wikiadmin? [13:11:24] yes [13:11:34] ok i'm looking at logstash [13:11:45] filtered to runJobs on enwiki [13:11:56] and I see a ton of CirrusSearchLinksUpdate [13:12:07] i have a suspicion it's related :P [13:12:22] ah eheh [13:12:34] how can I get a query string out of logstash [13:12:46] this is the first time i'm using it :P [13:13:05] springle: the count of LinksUpdate::doIncrementalUpdate() https://graphite.wikimedia.org/render/?width=743&height=297&_salt=1411132355.8&from=-12hours&target=MediaWiki.LinksUpdate.doIncrementalUpdate.count [13:13:07] * springle no use with logstash either [13:13:17] https://logstash.wikimedia.org/#/dashboard/elasticsearch/LinksUpdate%20issues#dashboard/temp/e75cDLfUS4mHBFF4g5nnIg [13:13:19] try that [13:13:37] for some reason the graphite metrics have holes before 8:50am :( [13:14:10] i have no idea why cirrussearch needs to update links [13:14:20] i'll leave that for the mediawiki afficionados [13:14:28] that logstash links does not work for me :/ [13:14:35] can you log in [13:14:38] and see on the dashboard [13:14:45] "LinksUpdate issues"? [13:14:55] ah yeah https://logstash.wikimedia.org/#/dashboard/elasticsearch/LinksUpdate%20issues [13:15:05] cool [13:16:02] so yeah [13:16:04] springle: can you log in? [13:16:12] i see it [13:16:24] seems to me cirrus search is hooked in links update to update the search index? [13:17:11] would cirrus cause writes on the master though? [13:17:17] or could it be yet another symptom [13:17:34] no clue :( [13:18:29] hashar: I find that our patch that got merged yesterday ( https://gerrit.wikimedia.org/r/#/c/155753/ ) is still not applied in the default beta host deployment-mediawiki02. [13:19:13] it hooks in 'LinksUpdateComplete' for sure [13:19:16] exim logs there tells that the bounce email coming all the way back till deployment-mediawiki02 and getting re-routed since no router is there [13:19:25] cirrusSearchLinksUpdate was set as low priority in the jobrunner config by giuseppe [13:19:41] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:19:41] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:19:51] can't login to jenkins [13:20:11] i'm in now but got a 503 [13:20:17] springle: can we say the outage was due to too many client connections being opened on the mysql master/slaves of S1? [13:24:01] (03PS1) 10coren: Labs: move labsdb1001 to labs-support1-a [dns] - 10https://gerrit.wikimedia.org/r/161451 [13:24:31] cmjohnson1: ^^ for when we are ready [13:24:57] ok [13:25:06] I see a lot of refreshLinks as well [13:26:01] hashar: well, a little more complicated than that [13:26:18] here's your problem: https://en.wikipedia.org/w/index.php?title=Template:Redirect_template&action=history [13:28:01] though that one started on 8:30 [13:28:17] so the job doing that started a bit later or something? [13:28:32] there were other edits to it before that today [13:28:35] so that seems to match up [13:28:37] yeah [13:28:42] https://en.wikipedia.org/w/index.php?title=Special:RecentChangesLinked&hidebots=0&days=30&namespace=10&target=Wikipedia%3ADatabase+reports%2FTemplates+transcluded+on+the+most+pages [13:28:46] gripping them till 9:30 at least [13:28:55] first edit at 3:33 [13:29:10] mark: Ugh. Maybe we need a maximum-number-of-transclusions limit to edit like we have for maximum-number-of-revs for deletion. Having someone make a succession of tiny edits to a template like that is begging for trouble. [13:29:14] (that's the link I use, from my bookmarks) [13:29:18] fgrep Redirect_template /a/mw-log/runJobs.log|fgrep -v STARTING [13:29:20] on fluorine [13:29:32] mark that will give you the completion of jobs with their execution times [13:29:44] ok [13:29:56] time is t=123 (in milisec) [13:30:02] definitely fast one, but a lot of them per seconds [13:30:19] well anyway [13:30:24] i think we can conclude it's the edits to that template [13:30:44] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:30:44] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:30:50] I suggest to superprotect it, to avoid drama [13:30:56] https://en.wikipedia.org/wiki/Template:Redirect_template/core bah [13:31:06] I am not even sure why that kind of feature is a template when it can be build in [13:31:14] so are we saying: 1. that template edit. 2. wikiadmin job write spike. 3. page caches invalidated. 4. wikiuser read spike ? [13:31:31] springle: or something like that yes [13:31:58] I don't know mediawiki nearly well enough to conclude anything definitively, but it sounds like it [13:32:05] 1 135 407 refreshLinks jobs for Template:Redirect_template [13:32:10] ok [13:32:53] andit was edited twice [13:32:57] thrice [13:35:08] hashar: can you calculate when it will finish? [13:35:12] coren ping me when labsdb1001 is ready to move [13:35:24] no clue how many links it is going to update [13:36:20] mark: there was a previous edit at 03:34:16 which enqueued as well [13:38:22] I wish we could get ride of templates such as Template:Redirect template [13:40:37] mark: hashar: thanks for pursuing all that. am handling the labsdb1001 move right now. brb [13:40:51] i can do the outage email afterwards if you wish [13:40:54] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:40:54] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:41:08] i am trying to figure out the mess of jobs that got triggered [13:41:16] apparently we have several jobs enqueued [13:41:19] cmjohnson1: It's all yours. [13:41:25] PROBLEM - Host labsdb1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:41:32] k [13:41:54] PROBLEM - Apache HTTP on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.078 second response time [13:42:09] springle: if you could, please do [13:42:15] PROBLEM - Apache HTTP on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.067 second response time [13:42:20] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.082 second response time [13:42:20] PROBLEM - Apache HTTP on mw1170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.061 second response time [13:42:21] PROBLEM - Apache HTTP on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.085 second response time [13:42:25] PROBLEM - Apache HTTP on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.062 second response time [13:42:25] PROBLEM - Apache HTTP on mw1051 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:42:25] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:42:26] PROBLEM - Apache HTTP on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.086 second response time [13:42:46] wikiuser spike again [13:42:50] PROBLEM - Apache HTTP on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1503 bytes in 3.065 second response time [13:42:54] what are the queries? [13:42:55] PROBLEM - Apache HTTP on mw1168 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.071 second response time [13:43:16] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.058 second response time [13:43:16] PROBLEM - Apache HTTP on mw1199 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.074 second response time [13:43:16] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1503 bytes in 3.060 second response time [13:43:25] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:43:26] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [13:43:26] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [13:43:27] PROBLEM - Apache HTTP on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.077 second response time [13:43:27] PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.094 second response time [13:43:27] PROBLEM - Apache HTTP on mw1152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.099 second response time [13:43:27] PROBLEM - Apache HTTP on mw1111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.109 second response time [13:43:35] PROBLEM - Apache HTTP on mw1074 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.072 second response time [13:43:36] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:43:51] PROBLEM - Apache HTTP on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.102 second response time [13:43:52] PROBLEM - Apache HTTP on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.144 second response time [13:43:55] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:44:05] PROBLEM - Apache HTTP on mw1175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:44:05] PROBLEM - Apache HTTP on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.098 second response time [13:44:16] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [13:44:16] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [13:44:16] PROBLEM - Apache HTTP on mw1172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.060 second response time [13:44:28] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:44:29] PROBLEM - Apache HTTP on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.057 second response time [13:44:29] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [13:44:30] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [13:44:30] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [13:44:30] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.101 second response time [13:44:30] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [13:44:35] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [13:44:35] PROBLEM - Apache HTTP on mw1094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.054 second response time [13:44:47] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.102 second response time [13:44:47] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [13:44:55] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [13:45:05] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [13:45:06] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.109 second response time [13:45:17] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [13:45:17] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [13:45:18] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [13:45:25] PROBLEM - Apache HTTP on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1488 bytes in 0.049 second response time [13:45:25] PROBLEM - Apache HTTP on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.057 second response time [13:45:33] (03CR) 10coren: [C: 032 V: 032] "Migration in progress, box is on its way to its new home." [dns] - 10https://gerrit.wikimedia.org/r/161451 (owner: 10coren) [13:45:35] PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.113 second response time [13:45:36] PROBLEM - Apache HTTP on mw1105 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.099 second response time [13:45:36] PROBLEM - Apache HTTP on mw1077 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.073 second response time [13:45:36] PROBLEM - Apache HTTP on mw1069 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.099 second response time [13:45:36] PROBLEM - Apache HTTP on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:45:36] PROBLEM - Apache HTTP on mw1102 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.079 second response time [13:45:43] :S [13:45:45] PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.088 second response time [13:46:04] is this all paine ellsworth's fault? [13:46:05] PROBLEM - Apache HTTP on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.065 second response time [13:46:06] PROBLEM - Apache HTTP on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [13:46:06] PROBLEM - Apache HTTP on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.061 second response time [13:46:07] PROBLEM - Apache HTTP on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.058 second response time [13:46:15] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1488 bytes in 0.044 second response time [13:46:15] PROBLEM - Apache HTTP on mw1089 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:46:16] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.072 second response time [13:46:32] well placing blame is not that helpful [13:46:35] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:35] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [13:46:35] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.087 second response time [13:46:35] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.050 second response time [13:46:35] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.103 second response time [13:46:35] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.060 second response time [13:46:45] PROBLEM - Apache HTTP on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.049 second response time [13:46:45] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.051 second response time [13:46:46] PROBLEM - Apache HTTP on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.073 second response time [13:46:55] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.064 second response time [13:47:06] PROBLEM - Apache HTTP on mw1215 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.061 second response time [13:47:06] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:47:06] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [13:47:08] :( [13:47:15] PROBLEM - Apache HTTP on mw1153 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:47:19] Error: 2006 MySQL server has gone away ? [13:47:26] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.118 second response time [13:47:26] PROBLEM - Apache HTTP on mw1155 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.074 second response time [13:47:37] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 444 bytes in 0.096 second response time [13:47:40] PROBLEM - Apache HTTP on mw1183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.084 second response time [13:47:40] PROBLEM - Apache HTTP on mw1096 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.207 second response time [13:47:42] PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.062 second response time [13:47:42] PROBLEM - Apache HTTP on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.084 second response time [13:47:42] PROBLEM - Apache HTTP on mw1074 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 1.257 second response time [13:47:46] I'm getting "(Cannot contact the database server: Too many connections (10.64.48.21))" [13:47:46] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.062 second response time [13:47:54] sumanah: known [13:47:56] PROBLEM - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.081 second response time [13:47:56] PROBLEM - Apache HTTP on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.068 second response time [13:47:56] PROBLEM - Apache HTTP on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.076 second response time [13:47:56] PROBLEM - Apache HTTP on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.062 second response time [13:47:56] PROBLEM - Apache HTTP on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:47:56] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [13:47:57] PROBLEM - Apache HTTP on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.067 second response time [13:47:59] sumanah: yep [13:48:00] can we stop further cache invalidations? and the job altogether perhaps? [13:48:07] too many connections on mysql databases [13:48:40] so how can we remove those jobs [13:48:45] Erm, where is authdns-update invoked from nowadays? [13:48:45] or at least reduce [13:49:01] can we throttle them back? [13:49:36] Coren: ns[012] [13:49:52] good (bad?) morning :) [13:50:00] https://www.mediawiki.org/wiki/Manual:Job_queue#Changes_introduced_in_MediaWiki_1.23 [13:50:21] but that's generic mediawiki [13:50:27] PROBLEM - Apache HTTP on mw1137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.160 second response time [13:50:28] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [13:50:37] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.088 second response time [13:50:37] PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.094 second response time [13:50:37] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.133 second response time [13:50:37] PROBLEM - Apache HTTP on mw1123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.116 second response time [13:50:47] PROBLEM - Apache HTTP on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.075 second response time [13:50:47] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [13:50:47] PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.075 second response time [13:50:48] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.079 second response time [13:50:49] PROBLEM - Apache HTTP on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.069 second response time [13:50:49] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:50:49] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:50:49] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 71712 bytes in 0.460 second response time [13:50:57] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [13:50:57] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [13:51:08] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.107 second response time [13:51:08] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.075 second response time [13:51:17] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:51:21] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [13:51:22] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [13:51:22] PROBLEM - Apache HTTP on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.066 second response time [13:51:22] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [13:51:22] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [13:51:22] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [13:51:22] PROBLEM - Apache HTTP on mw1051 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.071 second response time [13:51:23] PROBLEM - Apache HTTP on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [13:51:27] PROBLEM - Apache HTTP on mw1098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.087 second response time [13:51:27] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.093 second response time [13:51:27] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.090 second response time [13:51:27] PROBLEM - Apache HTTP on mw1116 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.096 second response time [13:51:27] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:51:28] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.089 second response time [13:51:28] PROBLEM - Apache HTTP on mw1058 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.067 second response time [13:51:29] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.080 second response time [13:51:32] yeah $wgJobRunRate is the poor man job runner [13:51:37] PROBLEM - Apache HTTP on mw1121 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.085 second response time [13:51:38] PROBLEM - Apache HTTP on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.071 second response time [13:51:38] PROBLEM - Apache HTTP on mw1144 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.095 second response time [13:51:38] PROBLEM - Apache HTTP on mw1128 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.079 second response time [13:51:38] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.079 second response time [13:51:48] PROBLEM - Apache HTTP on mw1073 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.085 second response time [13:51:48] (03PS1) 10Hoo man: Cut down the number of basic job runners [puppet] - 10https://gerrit.wikimedia.org/r/161452 [13:51:50] PROBLEM - Apache HTTP on mw1182 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:51:51] PROBLEM - Apache HTTP on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.064 second response time [13:52:05] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [13:52:15] PROBLEM - Apache HTTP on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.107 second response time [13:52:16] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:52:16] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1782 bytes in 0.056 second response time [13:52:23] PROBLEM - Apache HTTP on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.072 second response time [13:52:23] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1488 bytes in 0.043 second response time [13:52:23] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.096 second response time [13:52:23] PROBLEM - Apache HTTP on mw1056 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.099 second response time [13:52:23] PROBLEM - Apache HTTP on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.063 second response time [13:52:25] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 3.049 second response time [13:52:53] PROBLEM - Apache HTTP on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.069 second response time [13:53:03] PROBLEM - Apache HTTP on mw1076 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.058 second response time [13:53:04] PROBLEM - Apache HTTP on mw1065 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.080 second response time [13:53:04] PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.078 second response time [13:53:04] PROBLEM - Apache HTTP on mw1110 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.098 second response time [13:53:04] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.073 second response time [13:53:05] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [13:53:05] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [13:53:05] PROBLEM - Apache HTTP on mw1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.100 second response time [13:53:05] PROBLEM - Apache HTTP on mw1075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.126 second response time [13:53:06] PROBLEM - Apache HTTP on mw1086 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.112 second response time [13:53:06] PROBLEM - Apache HTTP on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.104 second response time [13:53:07] PROBLEM - Apache HTTP on mw1050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.571 second response time [13:53:07] PROBLEM - Apache HTTP on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.082 second response time [13:53:08] PROBLEM - Apache HTTP on mw1043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.072 second response time [13:53:08] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [13:53:09] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.097 second response time [13:53:09] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.085 second response time [13:53:10] PROBLEM - Apache HTTP on mw1037 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.094 second response time [13:53:10] PROBLEM - Apache HTTP on mw1036 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.102 second response time [13:53:13] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.080 second response time [13:53:17] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:18] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [13:53:18] PROBLEM - Apache HTTP on mw1196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.058 second response time [13:53:19] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [13:53:19] PROBLEM - Apache HTTP on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.045 second response time [13:53:19] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.093 second response time [13:53:19] PROBLEM - Apache HTTP on mw1066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.089 second response time [13:53:19] PROBLEM - Apache HTTP on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.075 second response time [13:53:20] PROBLEM - Apache HTTP on mw1083 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:53:20] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.079 second response time [13:53:21] PROBLEM - Apache HTTP on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.091 second response time [13:53:21] PROBLEM - Apache HTTP on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.070 second response time [13:53:24] PROBLEM - Apache HTTP on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.061 second response time [13:53:24] PROBLEM - Apache HTTP on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.072 second response time [13:53:24] PROBLEM - Apache HTTP on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.060 second response time [13:53:24] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [13:53:29] refreshLinks is already low prio [13:53:35] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [13:53:35] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [13:53:35] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [13:53:35] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 444 bytes in 0.050 second response time [13:53:35] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1488 bytes in 0.066 second response time [13:53:43] PROBLEM - Apache HTTP on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.059 second response time [13:53:43] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [13:53:43] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71393 bytes in 3.607 second response time [13:53:53] PROBLEM - Apache HTTP on mw1155 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.077 second response time [13:53:53] PROBLEM - Apache HTTP on mw1060 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.080 second response time [13:53:53] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [13:53:53] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [13:54:03] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:54:08] mark: yep [13:54:11] PROBLEM - Apache HTTP on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.062 second response time [13:54:11] PROBLEM - Apache HTTP on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.065 second response time [13:54:12] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [13:54:12] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71393 bytes in 2.693 second response time [13:54:19] we could also temporary not run them at all, if you favour that [13:54:24] but maybe we should do something [13:54:25] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [13:54:25] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [13:54:25] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.112 second response time [13:54:26] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [13:54:26] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [13:54:26] PROBLEM - Apache HTTP on mw1149 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.078 second response time [13:54:26] PROBLEM - Apache HTTP on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.064 second response time [13:54:27] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [13:54:27] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [13:54:28] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.115 second response time [13:54:28] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.100 second response time [13:54:29] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [13:54:29] PROBLEM - Apache HTTP on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.092 second response time [13:54:30] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [13:54:30] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.082 second response time [13:54:31] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.079 second response time [13:54:36] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [13:54:36] PROBLEM - Apache HTTP on mw1070 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.069 second response time [13:54:36] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [13:54:36] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.062 second response time [13:54:36] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [13:54:37] PROBLEM - Apache HTTP on mw1044 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.103 second response time [13:54:37] PROBLEM - Apache HTTP on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.079 second response time [13:54:38] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [13:54:38] PROBLEM - Apache HTTP on mw1142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1503 bytes in 3.072 second response time [13:54:39] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.093 second response time [13:54:39] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [13:54:40] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.103 second response time [13:54:40] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [13:54:46] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:46] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [13:54:56] PROBLEM - Apache HTTP on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.063 second response time [13:54:56] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [13:54:56] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.065 second response time [13:54:56] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.198 second response time [13:55:10] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:55:11] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.058 second response time [13:55:11] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 444 bytes in 0.061 second response time [13:55:12] PROBLEM - Apache HTTP on mw1170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.063 second response time [13:55:12] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.201 second response time [13:55:20] well [13:55:21] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.089 second response time [13:55:26] i think i'm gonna kill a few jobrunners [13:55:26] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.094 second response time [13:55:26] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [13:55:26] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.108 second response time [13:55:26] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:55:26] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.097 second response time [13:55:27] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.331 second response time [13:55:27] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.107 second response time [13:55:32] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.118 second response time [13:55:33] PROBLEM - Apache HTTP on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.063 second response time [13:55:33] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:33] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.098 second response time [13:55:52] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.125 second response time [13:55:52] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [13:55:52] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.115 second response time [13:55:52] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [13:55:52] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [13:55:53] PROBLEM - Apache HTTP on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.082 second response time [13:55:53] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [13:55:54] PROBLEM - Apache HTTP on mw1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.076 second response time [13:55:54] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [13:55:55] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.133 second response time [13:55:55] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.102 second response time [13:55:56] PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 1.865 second response time [13:55:56] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.367 second response time [13:55:57] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.087 second response time [13:55:57] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.071 second response time [13:55:58] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.095 second response time [13:55:58] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [13:55:59] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:55:59] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.100 second response time [13:56:00] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [13:56:00] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.131 second response time [13:56:02] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.371 second response time [13:56:02] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.464 second response time [13:56:02] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [13:56:02] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [13:56:14] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [13:56:14] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.058 second response time [13:56:14] PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.073 second response time [13:56:14] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [13:56:23] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.111 second response time [13:56:23] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 71647 bytes in 3.373 second response time [13:56:28] !log Stopped jobrunners on mw1001-1003 [13:56:30] !log killing any sleeping connection on enwiki db slaves to make room [13:56:33] Logged the message, Master [13:56:34] puppet will probably restart them [13:56:37] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [13:56:37] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.087 second response time [13:56:37] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [13:56:38] PROBLEM - Apache HTTP on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.065 second response time [13:56:38] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [13:56:38] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [13:56:38] Logged the message, Master [13:56:48] PROBLEM - Apache HTTP on mw1062 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.051 second response time [13:56:48] PROBLEM - Apache HTTP on mw1046 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.053 second response time [13:56:51] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.113 second response time [13:56:51] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [13:56:51] PROBLEM - Apache HTTP on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.130 second response time [13:56:51] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:56:51] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [13:56:51] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.118 second response time [13:56:52] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.064 second response time [13:56:52] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.111 second response time [13:56:58] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:56:58] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.107 second response time [13:56:58] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.133 second response time [13:57:10] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.491 second response time [13:57:23] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.058 second response time [13:57:28] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [13:57:33] PROBLEM - Apache HTTP on mw1042 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1473 bytes in 0.082 second response time [13:57:33] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.066 second response time [13:57:45] (03Abandoned) 10Hoo man: Cut down the number of basic job runners [puppet] - 10https://gerrit.wikimedia.org/r/161452 (owner: 10Hoo man) [13:57:48] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.064 second response time [13:57:49] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [13:57:50] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.102 second response time [13:57:50] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [13:57:50] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [13:57:50] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [13:57:50] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [13:57:50] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.108 second response time [13:57:51] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.110 second response time [13:57:51] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:57:52] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [13:58:08] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [13:58:08] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.097 second response time [13:58:12] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [13:58:12] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.129 second response time [13:58:13] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.062 second response time [13:58:28] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [13:58:31] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [13:58:31] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [13:58:31] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [13:58:40] (03Abandoned) 10Alexandros Kosiaris: Remove puppet freshness check and all dependencies [puppet] - 10https://gerrit.wikimedia.org/r/143304 (owner: 10Alexandros Kosiaris) [13:59:41] and now we have a refresh links for commonswiki: Module:I18n/date [14:00:42] (03CR) 10Alexandros Kosiaris: [C: 032] nfs cleanups [puppet] - 10https://gerrit.wikimedia.org/r/160984 (owner: 10Alexandros Kosiaris) [14:01:35] http://gdash.wikimedia.org/dashboards/reqerror/ [14:01:36] not good [14:01:38] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:01:38] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:01:42] why did it become so much worse suddenly? [14:02:07] <_joe_> mark: when dbs get connections maxed out [14:02:27] i guess because now is the busiest time of day [14:02:41] _joe_: not that simple [14:02:49] for last hour: [14:02:50] 445226 enwiki: cirrusSearchLinksUpdate [14:02:50] 515622 enwiki: refreshLinks [14:02:50] 708081 commonswiki: cirrusSearchLinksUpdate [14:02:50] 955759 commonswiki: refreshLinks [14:02:57] <_joe_> springle: error rate soars for that reason I guess [14:02:58] which is a lot of jobs :-] [14:03:13] (all jobs are counted twice though) [14:03:45] the jobrunners seem to go in waves. binch of heavy writes, which invaidates caches, then wikiuser apaches show up in stampede. it isn't db simply eventually maxing out; it's practically instant when it happens [14:04:44] <_joe_> why is this happening all of a sudden today? [14:04:55] because of that template edit [14:05:11] and commons had a huge template being edited a few minutes ago [14:05:22] ah [14:05:25] so that was that new spike? [14:05:29] I guess so [14:05:56] Is the issue persisting? If so, we should announce it. I can do it if you confirm. [14:06:12] I would trash our job queue system entirely to be honest [14:06:13] well [14:06:14] we're ok now [14:06:28] and we can kill more job runners indeed [14:06:30] ok :) Thanks [14:06:31] trying to find some useful metrics [14:06:43] coren: booting up now [14:06:47] someone proposed a change in operations/puppet to reduce the number of normal jobrunners [14:06:47] mark: got a few moments? [14:07:03] hoo: can you just write a response in RT please? [14:07:04] hashar: that was me [14:07:04] thanks [14:07:08] mark: that last apache outage was not due to commons. it's still enwiki [14:07:13] mark: I can [14:07:15] springle: ok [14:07:20] commons slaves are getting hammered, but nothing like [14:07:28] the job runners that I stopped have been restarted [14:07:44] (03PS1) 10Aude: Bump wgJobBackoffThrottling['cirrusSearchLinksUpdate'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161454 [14:08:08] that essentially reverts https://gerrit.wikimedia.org/r/#/c/158401/ [14:08:11] if that helps [14:08:21] put it back at 5 [14:08:43] it's at 1.25 now [14:08:55] i'm not sure [14:09:00] refreshLinks is probably the actual issue [14:09:04] maybe [14:09:06] aude: that would affect db read traffic, but not write? [14:09:24] (which would still be handy -- just wondering) [14:09:25] i think write? [14:09:36] cirrus causes writes to the master db? [14:09:38] why does it write? [14:09:43] doubt it [14:09:47] i look at what it does :) [14:09:56] there is a bunch of cirrusSearchLinksUpdateSecondary [14:09:58] no clue what it is [14:10:23] i think that's the actual update of the elastic index [14:10:47] i was looking at that code at some point earlier [14:10:58] yeah [14:11:16] http://www.gossamer-threads.com/lists/wiki/mediawiki-cvs/424877 [14:11:34] so would help with reads [14:11:39] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:11:41] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:11:47] probably not the main issue [14:11:59] yup [14:12:07] well i'll leave it up to sean to determine whether that would be useful [14:12:35] maybe manybubbles knows more :) [14:13:06] hopefully ;) [14:13:10] if that can reduce the surge of wikiuser traffic on slaves after these jobs execute, then lets try it [14:14:01] cirrusSearchLinksUpdate{,Secondary} on enwiki https://graphite.wikimedia.org/render/?width=617&height=320&_salt=1411136011.322&from=-12hours&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdateSecondary-enwiki.count&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdate-enwiki.count [14:14:04] not sure it helps in anything [14:14:44] springle: presumably these would be wikiadmin though? [14:15:20] replied on RT [14:15:46] oh right [14:15:49] mark: yeah ok [14:17:10] i would prefer to defer to manybubbles [14:17:14] i'm not sure increasing is right [14:17:33] yeah we have no real evidence yet that it would help [14:17:40] (03Abandoned) 10Aude: Bump wgJobBackoffThrottling['cirrusSearchLinksUpdate'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161454 (owner: 10Aude) [14:18:12] how are DBs doing now? [14:18:17] i thikn we'd want the opposite [14:18:41] mark: happy at the moment [14:19:17] how come [14:21:13] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:21:13] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:24:33] (03PS1) 10Jackmcbarn: Add OTRS-member group to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) [14:31:24] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:31:24] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:36:57] (03CR) 10Anomie: [C: 031] Don't manipulate the environment to determine TZ offset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161439 (https://bugzilla.wikimedia.org/71036) (owner: 10Ori.livneh) [14:37:51] !log initiated rsync of tridge data that is to be kept to nas1001-a [14:37:56] Logged the message, Master [14:37:56] (03CR) 10Ottomata: "Ah, that's because analytics VLAN is firewalled off from the rest of production." [puppet] - 10https://gerrit.wikimedia.org/r/160467 (owner: 10Giuseppe Lavagetto) [14:38:17] once the rsync is complete, we can kill tridge [14:39:41] finally tampa is going [14:41:18] mark: Last thing we need from (probably) you to finish the labsdb1001 migration is the firewall rules for the new IP (and cleanup of the old ones) [14:42:32] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:42:32] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:45:18] Coren: ok [14:49:17] Coren: new ip has been added [14:49:19] when can I remove the old? [14:49:51] mark: Now if you want; the box is up and working in the new row now. :-) [14:50:18] ok [14:50:37] mark: email sent [14:50:43] i have to sleep [14:50:57] thanks so much! [14:50:58] sleep well [14:51:00] and have a nice weekend [14:51:13] outage reports are allowed to wait btw [14:51:13] Thanks springle; good weekend! [14:51:25] no need to lose sleep over them as long as they capture the information later ;) [14:51:45] i would forget too much [14:52:09] hehe [14:52:11] Coren: removed [14:52:44] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:52:44] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [14:53:33] mark: RT 6860 correspondingly closed as resolved. :-) [14:53:41] excellent, thanks [14:53:57] (03PS1) 10Andrew Bogott: Add service aliases for ldap servers. [dns] - 10https://gerrit.wikimedia.org/r/161461 [14:55:28] (03CR) 10Andrew Bogott: Add service aliases for ldap servers. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/161461 (owner: 10Andrew Bogott) [14:56:33] godog (or someone else who's familiar with our DNS setup), I could use advice about ^ [14:57:03] hm, bblack maybe [14:57:29] !log Jenkins friday deploy: migrate all MediaWiki extension qunit jobs to Zuul cloner. [14:57:34] Logged the message, Master [14:57:35] andrewbogott: yep that'll do, I'll comment [14:57:41] thx [14:57:54] (03CR) 10Mark Bergsma: [C: 031] Add service aliases for ldap servers. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/161461 (owner: 10Andrew Bogott) [14:58:38] (03CR) 10Filippo Giunchedi: [C: 031] "what Mark said :)" [dns] - 10https://gerrit.wikimedia.org/r/161461 (owner: 10Andrew Bogott) [15:02:52] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:02:52] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:03:03] (03CR) 10Alexandros Kosiaris: [C: 032] librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 (owner: 10Alexandros Kosiaris) [15:10:25] someone was looking for me? [15:11:27] gooood morning manybubbles :-] [15:12:02] manybubbles: tldr; we had two outages today (around 9am UTC and 1pm UTC) caused by a myriad of jobs saturating the number of mysql connection for the enwiki S1 shard [15:12:23] due to some heavily used template being edited, which creates million of jobs [15:12:25] ah - I saw the email [15:12:44] refreshLinks and some CirrusSearchOnLinksUpdate{,secondary} , the later apparently hitting the master db [15:12:53] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:12:54] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:13:06] I believe chad set them up to hit the master db but I don't recall why [15:13:19] ^d: ^^ - looks like cirrus's jobs were causing contention [15:13:25] they should all be read traffic [15:13:26] not just cirrus really [15:13:33] they were also just a symptom [15:13:39] k [15:14:00] but sure, anything that reduces read load too will help [15:14:42] manybubbles: might be relevant https://graphite.wikimedia.org/render/?width=617&height=320&_salt=1411136011.322&from=-12hours&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdateSecondary-enwiki.count&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdate-enwiki.count&target=MediaWiki.stats.job-pop-refreshLinks-enwiki.count [15:14:58] (03PS1) 10Giuseppe Lavagetto: icinga: fix duplicate contactgroup def [puppet] - 10https://gerrit.wikimedia.org/r/161464 [15:14:59] that is graph for some jobs pop on enwiki [15:15:35] <_joe_> hashar: ^^ icinga was not reloading since forever due to this [15:15:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] icinga: fix duplicate contactgroup def [puppet] - 10https://gerrit.wikimedia.org/r/161464 (owner: 10Giuseppe Lavagetto) [15:15:44] manybubbles: and the 50 hours graph https://graphite.wikimedia.org/render/?width=617&height=320&_salt=1411136011.322&from=-50hours&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdateSecondary-enwiki.count&target=MediaWiki.stats.job-pop-cirrusSearchLinksUpdate-enwiki.count&target=MediaWiki.stats.job-pop-refreshLinks-enwiki.count [15:16:18] whelp, that was me, _joe_ [15:16:20] sorry [15:16:28] hashar: makes sense - cirrus gets one linksUpdate job per refresh links job - its how it knows that the page might have changed [15:16:43] yeah we figured that out eventually [15:17:19] first we were thinking it was cirrussearchlinksupdate doing link updates but that just didn't make sense ;) [15:17:36] as in, link update writes in mysql [15:19:11] mark: yeah - its a bad name for the job but one I haven't had the willpower to rename because renaming them is a pain [15:19:26] its called that because it chains on the links update process - but thats not a good excuse [15:19:38] it needs the links update process to have been completed, actually [15:19:50] <_joe_> YuviPanda: what's worse, it was not the only error [15:20:02] did I fuck up more? [15:20:26] <_joe_> no [15:20:39] <_joe_> the next one, I introduced with my change this morning [15:20:39] * YuviPanda should be doubly careful when merging things on machines he doesn't have access to [15:20:43] <_joe_> and I kinda expected it [15:22:12] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:22:12] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:32:22] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:32:22] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:35:18] <_joe_> YuviPanda: do you see where the hell hostrgoups get defined in icinga? [15:35:48] _joe_: hmm, no? I presumed they were from naggen2 [15:35:52] or similarly generated [15:36:40] * _joe_ facepalms [15:36:54] <_joe_> it's a lot of time we don't introduce a new hostgroup I'd say [15:36:56] _joe_: do tell me if you find out [15:36:58] off for the week-end have a good afternoon [15:37:08] <_joe_> YuviPanda: we don't [15:37:15] oh [15:37:15] wat [15:37:24] _joe_: we don't use hostgroups? [15:37:39] <_joe_> we do, but they're not in any icinga config file I can find [15:42:39] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:42:39] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:52:07] hey guys; Varnish seems to be 503ing when attempting to create a new account - 10.64.0.105 I believe is the IP the request is being serviced through :) [15:52:28] I have HHVM enabled so ori ^ (just for relevance :) ) [15:52:52] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:52:52] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:53:29] ori: disabled HHVM and fixed so - HHVM issue :) [15:53:47] JohnLewis: file buggg! [15:53:49] ori is asleep [15:53:51] (I hope) [15:54:03] yuvipanda: mkay :p [15:55:28] yuvipanda: does HHVM have its own component anywhere or is it one of these lonely services which only have a keyword? [15:55:36] I think keyword [15:56:25] blurg; I'll file it under General/Unknown then and add the keyword [15:58:37] <_joe_> JohnLewis: a tag [15:58:43] <_joe_> HHVM as a tag [15:58:49] _joe_: kk [15:58:56] <_joe_> thanks a lot! [16:00:04] (03PS1) 10Giuseppe Lavagetto: adding appservers_hhvm monitor group [puppet] - 10https://gerrit.wikimedia.org/r/161467 [16:00:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] adding appservers_hhvm monitor group [puppet] - 10https://gerrit.wikimedia.org/r/161467 (owner: 10Giuseppe Lavagetto) [16:03:08] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:03:09] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:04:08] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [16:06:38] (03PS1) 10Jackmcbarn: Re-enable the Lua profiler on production HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161468 [16:07:00] ori: ^ [16:08:35] ottomata: thanks for the offer to help! [16:09:02] (03PS1) 10Giuseppe Lavagetto: adding mysql_codfw hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/161469 [16:09:08] <_joe_> GRRRRR [16:09:17] certainly, shinkin sounds awesome [16:09:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] adding mysql_codfw hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/161469 (owner: 10Giuseppe Lavagetto) [16:10:22] <_joe_> yuvipanda: lemme fix icinga for today [16:10:30] <_joe_> on monday we can work on shinken [16:10:42] _joe_: oooh, cool [16:13:14] <_joe_> oh my finally [16:13:20] <_joe_> icinga reloaded [16:13:30] <_joe_> I can get on with the weekend [16:13:33] :D [16:13:52] _joe_: have fun! I'll hopefully have moved the custom checks + config into nagios_common by then (will keep the patches open) [16:14:52] PROBLEM - Host mathoid.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [16:15:52] should mathoid be paging ? [16:16:02] (cuz it totally is, not sure if its fully online) [16:16:03] PROBLEM - mathoid on sca1001 is CRITICAL: Connection refused [16:16:04] PROBLEM - LDAPS on labcontrol2001 is CRITICAL: Connection refused [16:16:14] PROBLEM - mathoid on sca1002 is CRITICAL: Connection refused [16:16:19] <_joe_> not sure [16:16:35] I just handed over the servers for it not htat long ago, but I suppose it could be in service.... [16:16:40] <_joe_> robh: long story short: icinga did not reload since forevew [16:16:54] ah [16:17:13] so it may now be generating some alerts that have been in an alert state for awhile. [16:17:21] <_joe_> btw if you see an 'HHVM rendering' alarm, just get on the server and 'restart hhvm' [16:17:29] <_joe_> yes [16:17:29] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 2 failures [16:17:39] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.123 second response time [16:17:55] <_joe_> like this one ^^ [16:18:03] <_joe_> it will happen often I guess [16:18:19] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:19] PROBLEM - LDAP on labcontrol2001 is CRITICAL: Connection refused [16:18:20] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.166 second response time [16:21:09] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:34] !log restarted hhvm on mw1019 + 1021 [16:21:39] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.224 second response time [16:21:41] Logged the message, Master [16:21:57] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 1.010 second response time [16:23:10] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:25:14] _joe_: what's your take on lowering the job runner parallelism to roughly match the # of hyperthreads? [16:26:05] <_joe_> gwicke: +1, but my brain is completely fried now [16:26:15] <_joe_> I'm off for realz :) [16:26:21] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:35] okay, I'll look into it [16:26:39] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:42] _joe_: enjoy your weekend! [16:28:49] PROBLEM - check if dhclient is running on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:00] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.173 second response time [16:29:30] RECOVERY - DPKG on fenari is OK: All packages OK [16:29:40] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.171 second response time [16:29:49] RECOVERY - check if dhclient is running on fenari is OK: PROCS OK: 0 processes with command name dhclient [16:30:20] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.164 second response time [16:33:29] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 5.986 second response time [16:33:39] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [16:37:53] godog: btw, i'm trying to take it easy with puppet merges, so if you're up for merging https://gerrit.wikimedia.org/r/#/c/161332/ or https://gerrit.wikimedia.org/r/#/c/133274/ (both of which you +1'd), that'd be cool. but also cool if you want to wait. [16:40:59] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:09] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:59] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:09] RECOVERY - DPKG on fenari is OK: All packages OK [16:44:04] (03PS1) 10GWicke: Lower the basic job runner parallelism [puppet] - 10https://gerrit.wikimedia.org/r/161473 [16:44:50] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:58] (03CR) 10Awight: [C: 032] Change donate cookie to 250 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161402 (owner: 10Ejegg) [16:45:05] (03Merged) 10jenkins-bot: Change donate cookie to 250 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161402 (owner: 10Ejegg) [16:47:10] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:51] (03PS2) 10Ori.livneh: Don't manipulate the environment to determine TZ offset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161439 (https://bugzilla.wikimedia.org/71036) [16:48:00] (03CR) 10Ori.livneh: [C: 032] Don't manipulate the environment to determine TZ offset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161439 (https://bugzilla.wikimedia.org/71036) (owner: 10Ori.livneh) [16:48:00] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:05] (03Merged) 10jenkins-bot: Don't manipulate the environment to determine TZ offset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161439 (https://bugzilla.wikimedia.org/71036) (owner: 10Ori.livneh) [16:48:09] (03PS2) 10GWicke: Lower the basic job runner parallelism [puppet] - 10https://gerrit.wikimedia.org/r/161473 [16:48:28] ori: sure! next week tho :) [16:48:35] godog: np! have a good weekend [16:48:46] ori: you too! [16:48:49] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [16:48:59] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:50:13] (03CR) 10Ori.livneh: [C: 031] "Gabriel, thanks for the patch and the thoughtful analysis in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [16:50:19] RECOVERY - DPKG on fenari is OK: All packages OK [16:50:29] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [16:55:30] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:36] whatsup [16:55:39] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:29] not fenari [16:59:31] RECOVERY - DPKG on fenari is OK: All packages OK [16:59:52] ori++ [17:05:34] poor fenari [17:05:52] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:52] PROBLEM - check if dhclient is running on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:06:51] RECOVERY - check if dhclient is running on fenari is OK: PROCS OK: 0 processes with command name dhclient [17:06:52] RECOVERY - DPKG on fenari is OK: All packages OK [17:07:05] !log restarting apache on fenari [17:07:11] Logged the message, Master [17:07:41] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:08:32] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [17:08:51] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [17:09:17] (03CR) 10BBlack: [C: 032] Lower the basic job runner parallelism [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [17:09:31] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.094 second response time [17:10:42] hmmm no grrrit-wm ? [17:10:58] it's here [17:11:01] perhaps lagged [17:13:01] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:16:04] !log initiating controlled shutdown of kafka broker analytics1021 to test some kafkatee weirdness, as well as a potential kafka/zookeeper bug [17:16:11] Logged the message, Master [17:24:11] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 15.0 [17:24:21] ah, thought i marked that as ok [17:24:23] in icinga [17:26:59] (03PS1) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [17:27:09] ottomata: ^ is the direction of the refactor [17:27:35] akosiaris: ^ [17:28:04] it's a first pass, probably has bugs, but I suppose nagios_common::check_command can be used for most / all of our custom check commands [17:28:15] and then they can be aggregated in a check_commands class that I can then include in shinken [17:28:30] _joe_: ^ if you are around (which I hope you aren't) [17:30:52] !log turned down apache prefork procs on fenari to reduce swapping [17:30:58] Logged the message, Master [17:31:43] (I donno why apache children keep eating so much memory there, probably a leak of some kind, but fewer leaky children -> don't try to swapdie when puppet runs) [17:31:44] (03PS1) 10Ori.livneh: mediawiki::syslog: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161480 [17:32:12] (03CR) 10Ottomata: "There is already a monitoring module (albeit only for a git_merge) check. I'm not sure of the history of this, but that might be a good '" [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [17:32:22] (03PS2) 10Ori.livneh: mediawiki::syslog: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161480 [17:32:31] (03CR) 10Ori.livneh: [C: 032 V: 032] "docs-only" [puppet] - 10https://gerrit.wikimedia.org/r/161480 (owner: 10Ori.livneh) [17:32:47] uh-oh, just got "all servers are busy" on enwiki [17:33:35] (03CR) 10Yuvipanda: "I'm ok with moving this into the monitoring class instead of this. That's mostly just a naming bikeshed, and can be whatever :)" [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [17:34:42] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.173 second response time [17:34:57] (03CR) 10Ottomata: "As for your subsequent check_command commit, we could thne have:" [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [17:35:06] don't restart hhvm on mw1020 please [17:35:07] i want to look [17:35:30] (03CR) 10Ottomata: "I love naming bikesheds!" [puppet] - 10https://gerrit.wikimedia.org/r/161446 (owner: 10Yuvipanda) [17:36:07] ottomata: bikeshed moved to the ops@ list :) [17:36:20] (03CR) 10Ottomata: nagios_common: Refactor custom command definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [17:36:22] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [17:37:38] (03PS2) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [17:37:42] ottomata: it was a typo! fixed [17:38:17] (03CR) 10Dzahn: add redirects for wikimania.org/.com (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [17:38:20] danke! [17:38:22] :) [17:39:01] ottomata: :D but does the general define based approach look ok? [17:40:51] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.511 second response time [17:41:22] (03CR) 10Andrew Bogott: [C: 032] labs-vagrant: Fix initial clone and ownership [puppet] - 10https://gerrit.wikimedia.org/r/161353 (owner: 10BryanDavis) [17:42:23] (03CR) 10Dzahn: "but another question now to all: redirect wikimania.org still to wikimania2014.wikimedia.org or already to wikimania2015.wikimedia.org ?" [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [17:42:42] yuvipanda: just to jog my memory, we add check commadns now just by editing the checkcommands.cfg.erb file, ja? [17:42:50] ottomata: ya [17:43:08] ottomata: with the refactor, checkcommands will only contain things that come with the icinga/shinken package by default. [17:43:12] and the define turns this into single file per command? [17:43:13] ottomata: all our custom ones will have their own .cfg [17:43:24] ottomata: well, per set of commands. check_graphite plugin provides 3 commands [17:43:39] ah because content => [17:43:46] you still have template control [17:43:49] ottomata: yea [17:43:56] i like very much :) [17:43:59] ottomata: the current code lets me use either templates or files [17:44:06] ja [17:44:27] ottomata: :D only some config files require templates (to specfy mysql password, for example) [17:45:08] aye [17:45:20] ottomata: no clear way to test this, tho :( [17:45:22] ja, reviewing... [17:45:24] ottomata: since no icinga in labs [17:45:27] ottomata: yay reviewing :) [17:45:41] ottomata: I'll move all the other custom checks one check at a time. [17:46:33] Where are we running authdns-update these days? Rubidium? [17:47:39] andrewbogott: any of the current 4x ns machines, it will update the rest [17:47:50] great [17:47:51] ns[012].wikimedia.org, or ns1-baham.wikimedia.org [17:47:55] andrewbogott: i lazily do "ssh root@ns0" [17:48:17] (or atlernatively, the real hostnames are mexia, rubidium, eeden.esams, and baham (all .wm.o)) [17:48:21] heh, which gets me rubidium [17:48:24] *alternatively [17:48:26] (03CR) 10Ottomata: nagios_common: Refactor custom command definitions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [17:48:32] yuvipanda: some comments on PS1 [17:48:39] * yuvipanda checks [17:48:48] (03CR) 10Andrew Bogott: [C: 032] Add service aliases for ldap servers. [dns] - 10https://gerrit.wikimedia.org/r/161461 (owner: 10Andrew Bogott) [17:51:50] (03CR) 10Yuvipanda: nagios_common: Refactor custom command definitions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [17:51:56] ottomata: responded! [17:52:44] manybubbles: btw, where's the ES code that writes to ganglia? [17:52:48] in cirrus, or part of ES? [17:53:44] manybubbles: we can send graphite es data with https://github.com/spinscale/elasticsearch-graphite-plugin [17:54:34] (03CR) 10Andrew Bogott: "Yep, I get a clean labs-vagrant run on a new box now -- works the first time. Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/161353 (owner: 10BryanDavis) [17:55:09] ottomata: I'm going to wait on more review before I convert the other custom commands [17:55:29] (03CR) 10Mark Bergsma: [C: 031] Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 (owner: 10Mark Bergsma) [17:56:01] <^d> yuvipanda: Some script in puppet pulls the stats from ES's /_cluster/stats/ api. [17:56:11] ^d: oh [17:56:35] ^d: it'll be good for betacluster monitoring of ES/Cirrus if we can put those in graphite instead [17:56:59] <^d> That plugin looks promising. [17:57:17] <^d> The bugs/todo are worrying ;-) [17:57:21] ^d: indeed :) [17:57:30] but... what's the worst that can happen? :D [17:57:36] !log ori Synchronized wmf-config/CommonSettings.php: I3e1bd5e4bb: Don't manipulate the environment to determine TZ offset (Bug: 71036) (duration: 00m 13s) [17:57:42] Logged the message, Master [17:58:11] <^d> yuvipanda: Pushing from the plugin seems a little nicer on ES too than querying from the rest api. [17:58:19] ^d: yeah [17:58:34] less fragile [17:58:49] <^d> Wanna plug it in beta and play with it? [17:59:11] ^d: hmm, don't have time anytime soon :( doing a biggish refactor of ops/puppet [17:59:32] <^d> Not today anyway, I'm not working ;-) [17:59:38] ^d: aah :D [17:59:54] * yuvipanda clues greg-g in to the above conversation, just to keep him notified [18:00:00] ^d: let me file ab ug [18:00:13] * greg-g is in back to back to back meetings for another hour [18:00:19] anything time sensitive? [18:00:22] <^d> Nope. [18:00:26] coolio, tah! [18:00:43] <^d> yuvipanda: Bug sounds good. We'll start having a look at it next week perhaps :) [18:01:18] (03CR) 10Ottomata: nagios_common: Refactor custom command definitions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [18:04:50] ^d: wheee [18:10:15] (03PS2) 10Dzahn: redirect wikimania.org/.com to wikimania2015 [puppet] - 10https://gerrit.wikimedia.org/r/161405 [18:13:01] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.177 second response time [18:14:55] (03CR) 10Yuvipanda: nagios_common: Refactor custom command definitions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [18:16:42] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.137 second response time [18:17:06] ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=71055 [18:17:39] !log restarting hhvm on mw1020 [18:17:45] Logged the message, Master [18:18:00] <^d> yuvipanda: sweet, thx. yeah I'll poke that plugin next week when I'm back :) [18:18:04] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.330 second response time [18:18:32] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.264 second response time [18:19:40] !log Jenkins: reverting job mwext-VisualEditor-qunit to previous state (i.e. without Zuul cloner) [18:19:46] Logged the message, Master [18:20:00] AH yuvipanda, but plugin_content, etc. won't work for the check_ganglia, use case, btw [18:20:03] which is a deb package + symlink [18:20:05] but ja :/ [18:20:40] maybe if you could set them both to false or undef to disable management of the plugin file :) [18:21:19] (03CR) 10Ottomata: nagios_common: Refactor custom command definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [18:21:41] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.195 second response time [18:21:52] lol [18:22:18] !log restarting hhvm on mw1020 (again!) [18:22:23] Logged the message, Master [18:22:24] <^d> hashar: Thanks for finishing off that elasticsearch-swift jenkins job for me. Very useful! [18:22:32] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.446 second response time [18:22:41] PROBLEM - check if dhclient is running on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:01] PROBLEM - DPKG on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:08] PROBLEM - RAID on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:12] PROBLEM - check configured eth on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:13] PROBLEM - Disk space on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:13] PROBLEM - puppet last run on virt1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:32] RECOVERY - check if dhclient is running on virt1009 is OK: PROCS OK: 0 processes with command name dhclient [18:23:46] RECOVERY - DPKG on virt1009 is OK: All packages OK [18:23:52] RECOVERY - RAID on virt1009 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [18:24:02] RECOVERY - check configured eth on virt1009 is OK: NRPE: Unable to read output [18:24:11] RECOVERY - Disk space on virt1009 is OK: DISK OK [18:24:11] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 368 seconds ago with 0 failures [18:24:43] ^d: I have just deployed it hehe :] [18:24:56] ^d: if you know of other repos that could use jobs, be bold! [18:25:17] <^d> All my repos are covered :) [18:25:23] \O/ [18:25:34] time for me to disappear [18:31:48] ottomata: for ganglia we just don't have to use this :) [18:32:04] :) [18:32:18] ottomata: or I can add a plugin_target, command_target :) [18:32:23] (03CR) 10JanZerebecki: [C: 031] redirect wikimania.org/.com to wikimania2015 [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [18:32:24] err [18:32:25] not target [18:32:26] src? [18:32:29] or something [18:32:29] :) [18:32:33] oh yeah [18:32:36] target :) [18:33:21] (03CR) 10Yuvipanda: nagios_common: Refactor custom command definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [18:33:41] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.169 second response time [18:41:41] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.166 second response time [18:42:24] (03PS1) 10Dzahn: rename mw-rc-irc not from rc-pmtpa to rc-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/161497 [18:43:01] ori: so are you still owrking on any particular hhvm server [18:43:07] or do i need to start restarting proceses? =] [18:43:26] (you mentioned troubleshooting it earlier so i didnt wanna mess with it if you were) [18:44:11] (03CR) 10Dzahn: "11:32 -!- rc-pmtpa [~rc-pmtpa@special.user]" [puppet] - 10https://gerrit.wikimedia.org/r/161497 (owner: 10Dzahn) [18:44:47] robh: the issue is known and we have a fix queued up, just working out the deployment mechanics. i don't think restarting would help much, but i'm not opposed to it either [18:45:07] well, you know more about it than me ;] [18:45:12] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.176 second response time [18:45:20] heh, i didnt do anything. [18:45:23] it'll flap until we push the fix, hopefully within an hour or so [18:45:36] i need to get greg's OK and he's in a meeting [18:45:37] ok, i just wanted to make sure that we werent neglecting it or something [18:45:40] cool [18:45:44] nod, thanks! [18:45:47] (03CR) 10Legoktm: [C: 04-1] "See https://gerrit.wikimedia.org/r/#/c/136965/" [puppet] - 10https://gerrit.wikimedia.org/r/161497 (owner: 10Dzahn) [18:48:13] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.169 second response time [18:49:29] (03CR) 10Dzahn: [C: 04-2] "Legoktm: oh haha, thanks for pointing that one out. then all that's left is wondering about how the migration to rcstream is going" [puppet] - 10https://gerrit.wikimedia.org/r/161497 (owner: 10Dzahn) [18:50:38] (03Abandoned) 10Dzahn: rename mw-rc-irc not from rc-pmtpa to rc-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/161497 (owner: 10Dzahn) [18:51:13] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.184 second response time [18:51:24] (03CR) 10Dzahn: "it already said "rename .. not" instead of "rename .. bot" :)" [puppet] - 10https://gerrit.wikimedia.org/r/161497 (owner: 10Dzahn) [18:56:28] (03CR) 10Dzahn: "alright, so let's have one separate SSL cert that covers all monitoring tools and make it a general "don't put monitoring tools behind mis" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [18:58:45] (03CR) 10Dzahn: "i'll give up on this one and make new smaller changes. it's a lot of rebasing to keep up with icinga changes, because currently we have mu" [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [18:59:02] <_joe_> !log rolling restart of hhvm servers [18:59:02] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 71509 bytes in 1.517 second response time [18:59:04] (03CR) 10RobH: "right now we have the policy of no wildcard certificates outside of misc-web-lb or mail ssl termination. (so a single cert for all monito" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [18:59:08] Logged the message, Master [18:59:22] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.228 second response time [18:59:33] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.168 second response time [18:59:33] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.212 second response time [18:59:43] (03Abandoned) 10Dzahn: turn icinga into module [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [19:02:33] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.164 second response time [19:06:15] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [19:06:41] (03CR) 10Dzahn: "if it would be a single cert with all the individual service names as SANs but not a "star" cert, how problematic would it be to add addit" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [19:06:54] i'll check out analytics1021 [19:07:02] jgage [19:07:03] i'm on it [19:07:06] cool ok [19:07:10] i'm messing with it (see SAL) [19:07:16] i keep setting it as in maintenance mode... [19:07:21] maybe the time has past [19:07:23] passed [19:07:37] ah, i didn't check that one, [19:07:39] yeah i shut it down [19:07:41] it sbeing a jerk [19:07:49] typical :P [19:08:22] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [19:08:59] !log restarted hhvm on mw1021 [19:09:05] Logged the message, Master [19:09:47] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.175 second response time [19:11:20] (03CR) 10RobH: "that would be preferred, as it locks it down to those specific service fqdn. costwise it is identical to buying individual certs for each" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [19:13:51] (03CR) 10Dzahn: "ok, thanks.:) i guess we should continue either on ops list as suggested by _joe_ or on an ticket right away. let me abandon this specific" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [19:14:13] (03Abandoned) 10Dzahn: tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [19:18:58] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [19:34:06] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Epic puppet fail [19:34:12] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5954.83058067 [19:46:47] (03PS1) 10RobH: adding ldap-[codfw|eqiad].wikimedia.org certificates [puppet] - 10https://gerrit.wikimedia.org/r/161518 [19:48:15] (03CR) 10RobH: [C: 032] adding ldap-[codfw|eqiad].wikimedia.org certificates [puppet] - 10https://gerrit.wikimedia.org/r/161518 (owner: 10RobH) [19:53:17] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:01:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:03:32] (03PS1) 10Andrew Bogott: Switch to new service-specific ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/161528 [20:04:18] robh: so, the obvious thing to do is just ^. But from what you said about globalsign vs. rapidcert… there must be more to it? [20:04:35] e.g. specifying a 'provider' arg to install_certificate? [20:04:58] yea, the ca_name = 'Equifax_Secure_CA.pem' [20:04:59] (03CR) 10MaxSem: [C: 031] redirect wikimania.org/.com to wikimania2015 [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [20:05:04] that has to change to the globalsign one [20:06:10] (03CR) 10RobH: [C: 04-1] "see inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:06:18] updated with inline comment [20:06:56] (03CR) 10RobH: "pls note my recommended change will break the cert chain on virt0, as it wasnt reissued on globalsign." [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:07:00] but it will totally break virt0 [20:07:06] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.165 second response time [20:07:07] unless you break it out on its own [20:07:43] andrewbogott: So, I'd move the ca_name statement to within the case statement per hostname [20:07:48] rather than before it [20:08:14] and then have current one just for virt0, and the new addition of the globalsign CA to just virt1000 and labcontrol2001 [20:09:29] (03CR) 10RobH: Switch to new service-specific ldap certs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:09:43] that leaves everything working. [20:11:16] (03PS2) 10Andrew Bogott: Switch to new service-specific ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/161528 [20:11:27] robh: ok, this patch also adds an arg to install_certificate which I'm unsure about [20:11:32] why did it work leaving that empty before? [20:12:01] huh... [20:12:14] good question. [20:12:31] oh, i think these may have it defined in the vhost? [20:12:41] that is the other place you can point at it maybe, but dunno [20:12:44] checkin. [20:13:13] no, its ldap, duh, wont matter [20:13:14] hrmm [20:13:16] It looks to me like if that arg is empty it just adds… everything [20:13:17] andrewbogott: I dunno =P [20:13:27] oh, yea, that is messy but would work [20:13:31] i guess [20:13:34] So I guess I'll leave it empty -- worked before, will probably still work :) [20:14:01] well, adding all the intermediary certs isnt a ton of overhead on a system [20:14:08] and certs are non private anyhow [20:14:21] so i suppose it doesnt matter, im cool if you left it in to try out too ;] [20:14:33] its not going to remove certs off exisitng hosts. [20:14:38] heh, why not? Presumably applying this will leave the old... [20:14:41] yep [20:14:43] yeah, what you said [20:14:50] i'd say leave it and see how it goes iwth labcontrol2001 [20:14:55] So, happy with that patch as is? [20:15:20] (03CR) 10RobH: [C: 031] "looks good to me, just putting +1 to leave for andrew to +2 and merge." [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:15:30] yep [20:16:33] (03CR) 10Andrew Bogott: [C: 032] Switch to new service-specific ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:16:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:18:10] (03PS5) 10Ori.livneh: wmflib: add ensure_service() [puppet] - 10https://gerrit.wikimedia.org/r/149778 [20:18:51] (03CR) 10Ori.livneh: [C: 032] "Chase said he's OK with this, and on reflection I think it's better to have this as a complement to ensure_directory / ensure_link, becaus" [puppet] - 10https://gerrit.wikimedia.org/r/149778 (owner: 10Ori.livneh) [20:23:18] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL error: [Errno 185090050] _ssl.c:340: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib [20:23:27] (03CR) 10Dzahn: "there is also a global place where we configure which certs belongs to which CA. in manifests/certs.pp from line 150 which i used before. " [puppet] - 10https://gerrit.wikimedia.org/r/161528 (owner: 10Andrew Bogott) [20:25:03] !log Deployed Ic71064e08 (type hint fix for Wikidata) to wmf21/22. [20:25:09] Logged the message, Master [20:34:18] RECOVERY - LDAPS on labcontrol2001 is OK: TCP OK - 0.044 second response time on port 636 [20:34:27] RECOVERY - LDAP on labcontrol2001 is OK: TCP OK - 0.043 second response time on port 389 [20:34:27] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:34:37] PROBLEM - Certificate expiration on labcontrol2001 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [20:34:47] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.191 second response time [20:37:01] andrewbogott: Coren: labcontrol2001 having some ssl issue apparently ^^^ [20:37:22] hashar: labcontrol2001 didn't exist until today so I'm not worried :) [20:37:46] andrewbogott: proactive monitoring of non existent box. We are a good org! [20:40:27] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.176 second response time [20:40:37] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.166 second response time [20:43:54] (03PS1) 10Ori.livneh: ensure_service(): fix docs, add more usage [puppet] - 10https://gerrit.wikimedia.org/r/161581 [20:47:40] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.328 second response time [20:47:45] <_joe_> !log restarted hhvm on mw1018, cleaning the cache as well [20:47:52] Logged the message, Master [20:49:52] (03PS1) 10Dzahn: include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 [20:49:55] andrewbogott: ^ [20:50:27] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.190 second response time [20:50:31] (03CR) 10jenkins-bot: [V: 04-1] include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 (owner: 10Dzahn) [20:50:38] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.231 second response time [20:50:52] !log restarted HHVM and cleared bytecode cache on all HHVM app servers [20:50:57] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.243 second response time [20:50:58] Logged the message, Master [20:51:03] (03PS2) 10Dzahn: include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 [20:52:22] (03CR) 10Andrew Bogott: [C: 031] include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 (owner: 10Dzahn) [20:55:30] (03PS3) 10Dzahn: include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 [20:55:58] (03CR) 10Dzahn: [C: 032] include GlobalSign CA on labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161583 (owner: 10Dzahn) [20:56:33] (03CR) 10Dzahn: "same as in manifests/role/cache.pp: include certificates::globalsign_ca" [puppet] - 10https://gerrit.wikimedia.org/r/161583 (owner: 10Dzahn) [21:10:35] (03PS1) 10Ori.livneh: Set a 'php' Salt grain, indicating the name of the PHP runtime [puppet] - 10https://gerrit.wikimedia.org/r/161587 [21:21:57] PROBLEM - Certificate expiration on labcontrol2001 is CRITICAL: SSL error: [Errno 185090050] _ssl.c:340: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib [21:25:43] (03PS2) 10Ori.livneh: Set a 'php' Salt grain, indicating the name of the PHP runtime [puppet] - 10https://gerrit.wikimedia.org/r/161587 [21:36:45] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.178 second response time [21:36:54] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.171 second response time [21:37:17] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.179 second response time [21:37:33] hm [21:37:39] it's interesting that it happens more or less at the same time [21:37:54] (03PS1) 10Aaron Schulz: Removed redundant config due to new job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161590 [21:38:58] (03PS1) 10Dzahn: include GlobalSign CA cert on neon for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/161592 [21:40:07] (03PS2) 10Dzahn: include GlobalSign CA cert on neon for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/161592 [21:41:44] (03PS3) 10Dzahn: include GlobalSign CA cert on neon for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/161592 [21:42:25] (03CR) 10Dzahn: [C: 032] "@neon:/etc/icinga# /usr/lib/nagios/plugins/check_cert labcontrol2001.wikimedia.org 636 GlobalSign_CA.pem" [puppet] - 10https://gerrit.wikimedia.org/r/161592 (owner: 10Dzahn) [21:43:07] akosiaris: Hey so I hear you're the one that set up the sca cluster? [21:43:24] greg-g OK'd Flow deploying a trivial JS fix, is anyone deploying? [21:43:30] I'm supposed to do the puppetization of Citoid because Gabriel doesn't have time to do it, and I have some questions [21:43:46] I'm looking at the mathoid puppetization but I'm not sure which parts should get cleaned up now that we have sca [21:44:24] spagewmf: shouldn't be. All's clear [21:44:40] ori: any reasons spagewmf shouldn't deploy a js fix for flow? [21:44:49] (I can't imagine it'd affect you, but, just checking) [21:45:00] Like, should we still have mathoid.svc.eqiad.wmnet and citoid.svc.eqiad.wmnet and everyotherservice.svc.eqiad.wmnet , or just have one sca.eqiad.wmnet ? [21:45:10] greg-g: nope [21:45:15] greg-g: as in, no reason not to go ahead [21:45:23] * greg-g counts the negatives.... [21:45:25] cool [21:45:25] what happened to /a/common 8-) thx ori [21:46:15] gwicke: Also do you know anything about what I just asked akosiaris? ---^^ [21:46:21] cit.oid.eqiad.wmnet all the oid's :) *jk* [21:46:53] oid.svc.eqiad.wmnet [21:47:14] mutante: actually...... [21:47:17] ;) [21:47:21] We really should have used oid1001 instead of sca1001 and bacronymed OID :P [21:47:51] <^demon|away> Or we could stop calling everything oid. [21:47:53] <^demon|away> ^ my vote [21:48:03] ^demon|away: you're no fun. :P [21:48:05] not sure if parav-oid agrees [21:48:20] ^demon|away: but then you'd be "demonoid" and we'd all be surveilled for piracy [21:48:58] <^demon|away> I guess I should move search.svc.eqiad.wmnet to searchoid.svc.... [21:48:58] dear anthrop-oid [21:49:07] see?! [21:49:53] <^demon|away> Except it's not an -oid and it won't run on sca :) [21:50:58] <^demon|away> oh welloid. [21:51:07] <^demon|away> guess we can't have everythingoid. [21:51:32] can't av-oid it [21:52:47] (03CR) 10Dzahn: "this changed it from" [puppet] - 10https://gerrit.wikimedia.org/r/161592 (owner: 10Dzahn) [21:53:30] where's marktraceur when you need him? [21:53:32] (03PS1) 10Aaron Schulz: Added refreshLinks to $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161603 [21:54:19] Hi [21:54:28] greg-g: What I can do with you [21:54:31] for you [21:54:39] <^demon|away> mutante: I'm going to come up with a service I can acronym as OPI, then add the -oid suffix. [21:54:45] <^demon|away> So we'll have opiod. [21:54:58] <^demon|away> *opioid. [21:55:02] <^demon|away> Bleh, can't spell [21:55:04] Oh, puns, not fixing things [21:55:31] I'm starting to get annoid at gwicke's naming schemes. [21:55:36] <^demon|away> http://www.scrabblefinder.com/ends-with/oid/ - playing with this. [21:55:39] >.< [21:55:55] <^demon|away> You know what also ends in -oid? [21:55:56] ^demon|away: hehe [21:55:56] <^demon|away> Hemorrhoid [21:57:27] If we ever set up a laser notification system in the office for icinga alerts we can call it PinkFloid [21:57:57] !log spage Synchronized php-1.24wmf21/extensions/Flow/modules/new/components/flow-board.js: Flow bug 71054 backport (duration: 00m 04s) [21:58:04] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.122 second response time [21:58:04] Logged the message, Master [21:58:04] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.178 second response time [21:58:19] zoidberg [21:58:23] marktraceur: you're good witht he 70s rock band puns [21:58:25] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 71508 bytes in 0.180 second response time [21:58:40] see also: LEDzpln [21:58:55] Heh [21:59:30] <^demon|away> greg-g: Doesn't end in -oid, fail. [22:00:16] ^demon|away: it's an irc bot that controls a LED rope-light thingy at a friends house [22:00:31] (03CR) 10Andrew Bogott: [C: 031] Set a 'php' Salt grain, indicating the name of the PHP runtime [puppet] - 10https://gerrit.wikimedia.org/r/161587 (owner: 10Ori.livneh) [22:00:55] <^demon|away> Oh, I thought it was an airship made of heavy metals. [22:01:01] heh [22:01:22] (03PS3) 10Ori.livneh: Set a 'php' Salt grain, indicating the name of the PHP runtime [puppet] - 10https://gerrit.wikimedia.org/r/161587 [22:01:30] (03CR) 10Ori.livneh: [C: 032 V: 032] Set a 'php' Salt grain, indicating the name of the PHP runtime [puppet] - 10https://gerrit.wikimedia.org/r/161587 (owner: 10Ori.livneh) [22:02:01] (03PS1) 10Catrope: Fix typo in mathoid LVS group description [puppet] - 10https://gerrit.wikimedia.org/r/161609 [22:02:35] greg-g: much thanks, Flow fix is deployed. (As always, I'm unsure how to get ResourceLoader to serve the new code beyond "wait a bit".) [22:03:06] spagewmf: wait a bit! [22:03:11] *bits [22:06:53] RoanKattouw: you can add '.scv.eqiad' to the file 'typos' in the repo root, and jenkins will check for it if it ever creeps back in [22:07:10] ori: Oooh cool. Will do that [22:07:31] hashar set that up iirc [22:07:46] (03PS2) 10Catrope: Fix typo in mathoid LVS group description [puppet] - 10https://gerrit.wikimedia.org/r/161609 [22:07:54] I want that in my repositories [22:08:00] Then I can put "urinary operator" in it :P [22:09:42] haha [22:10:49] jenkins stuck? https://integration.wikimedia.org/zuul/ [22:13:05] legoktm: not sure, that huge string on the right in gate-and-submit is probably due to hashar's change he made and announced [22:13:28] but they're all stuck at "queued" [22:13:49] even the test jobs are stuck [22:13:50] yeah... [22:13:55] or check-voter [22:14:01] the only jobs running are browser tests: https://integration.wikimedia.org/ci/ [22:14:07] (and scap in beta) [22:14:11] I'll restart Jenkins [22:14:20] It'll wait until those browsertest jobs are done though [22:14:47] * greg-g nods [22:15:21] hah now it is running an oojs-ui-npm job [22:15:57] Actually you know what I'm gonna kill those browsertest jobs because they take such a long time [22:16:26] !log Restarting Jenkins [22:16:31] Logged the message, Mr. Obvious [22:16:44] RoanKattouw: I propose '↷' as the symbol for the urinary operator [22:16:49] x↷y [22:16:56] read: "x urinates on y" [22:17:11] lol [22:18:24] it means that x accepts y as a parameter, and that x superficially appears to be side-effect free but in fact modifies y in subtle ways [22:18:51] Amazing [22:19:17] ori: James_F|Away asks if you Flow-ified the talk page for your BetaFeature [22:19:24] I have no idea what he's talking about but maybe you do [22:19:25] <^demon|away> RoanKattouw: "depreciated" would be a nice one too. [22:19:30] Oh yes [22:19:33] <^demon|away> We basically never depreciate our code :p [22:19:34] RoanKattouw: is that cockney slang? [22:19:45] derpecated [22:19:55] <^demon|away> herp derp. [22:20:19] i haven't flowified it. how do you flowify a talk page? [22:20:23] I don't know [22:20:29] configuration setting [22:20:32] maybe you just have to feel it [22:20:32] <^demon|away> There's a config variable. [22:20:42] <^demon|away> You move the old page (if there's one) to some /Archive or w/e [22:20:45] would it help to flowify it? [22:20:46] <^demon|away> Then enable it in config. [22:20:52] <^demon|away> Before someone edits again :) [22:21:18] Hm... 'SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed ' [22:21:19] And then it's useful to move all your updates to another subject page, because otherwise watchlist is not amused [22:21:22] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=lab [22:21:29] <^demon|away> ori: It's that occupy setting you like :) [22:21:30] what's the config var? [22:21:35] 'flowify' sounds like it would involve glow sticks [22:21:37] oh [22:21:41] <^demon|away> $wgFlowOccupyPage or w/e. [22:21:54] oh right, taht [22:21:55] that [22:21:57] <^demon|away> $wmgFlowOccupyPages [22:22:25] that array is going to get awfully long [22:22:52] <^demon|away> Probably. [22:23:04] James_F|Away: why should I add it? (I'm not opposed, just trying to understand.) [22:23:19] akosiaris: I guess this depends on whether you can have an LVS IP for multiple ports though. The way the config is set up makes it look like you can't [22:23:55] ori: there's a rule that all Beta Features' talk pages are Flowified [22:24:14] I messed up getting all the t's dotted and i's crossed for your beta feature :/ [22:24:26] greg-g: ah, OK. I can take care of it, if you like. [22:24:44] sure, probably easy, especially if there's no content there yet [22:24:59] ori: RSS also needs config tweak https://www.mediawiki.org/wiki/HHVM#Current_work [22:25:34] Nemo_bis: ack [22:25:39] greg-g: what's the reason for the rule? [22:25:46] <^demon|away> While we're on the subject...is there some view in Flow that allows you to really hide topics that are hidden/deleted. [22:25:54] ori: consistency and dogfooding [22:25:59] <^demon|away> It makes it basically impossible to follow along on a page that gets spammed. [22:26:06] greg-g: it's a bit strange, no -- presumably one reason to go to a beta feature's talk page is to complain that it isn't working well for you [22:26:16] so it's strange to force another beta feature on that [22:26:39] flow's not a "beta feature" just "in-development" [22:26:44] <^demon|away> I'm not a huge fan of the rule. I like Flow, but I think it's turned Talk:Search into a tad of a mess. [22:26:47] but I understand [22:27:06] it seems to me that an interface for assessing a proposed change to the software should itself be kept stable and free of such changes [22:27:50] or, to make sure you get quality feedback from all users (including new) you make it as easy as possible to give it? [22:28:11] well, ok, but it's "in-development" [22:28:13] it was a product decision that I'm OK with [22:28:33] i'll just make the change; i'm just a bit surprised. [22:28:37] * greg-g nods [22:28:58] i like flow, fwiw, and am optimistic about its future [22:29:26] sorry, I failed to follow: https://www.mediawiki.org/wiki/Beta_Features#Creating_Your_Own [22:30:18] <^demon|away> Maybe I can de-flowify talk:search after it's no longer a beta feature? [22:30:39] greg-g: nah, it's my oversight. [22:31:06] ori: I'm the one who's supposed to be walking around with a clipboard, not you ;) [22:31:24] greg-g: :-) [22:31:33] * James_F blames himself. [22:31:41] BLAME EVERYONE! [22:31:52] WFM. [22:32:11] Also, I'm British. It's my national ur-type to apologise for everything. [22:33:23] <^demon|away> James_F, greg-g: Nope, Jimmy's fault according to BLAMEWHEEL [22:35:06] * James_F grins. [22:36:25] marktraceur: latest name is 'restbase', so clean break right there [22:38:29] RoanKattouw: I'm not sure what your context was, but you can have LVS do different things for different ports for the same service IP [22:38:51] How do I do that in puppet? [22:38:53] That's how SSL currently operates for our primary prod traffic in eqiad. same front-end IP, 443 goes to different hosts than port 80. [22:39:13] The context is that there's mathod.svc.eqiad.wmnet, and I'm creating citoid.svc.eqiad.wmnet, but both will point to the same pool of boxes just a different port [22:39:41] oh that's exactly the opposite of what I thought you meant. [22:40:08] What did you think I meant? [22:40:26] svc.eqiad pointing to two different boxes depending on port [22:40:28] on the front edge of things (whatever is trying to reach (math|cit)oid.svc), are they both port 80 services? [22:41:20] or is the "user" of (math|cit)oid.svc using alternative port numbers for those services? [22:43:57] I'm kind of assuming the answer is "yes, they both need to be reachable by the service consumer on their respective port 80". In which case they need two separate service IPs in LVS. [22:47:07] for that matter, I'm not sure that LVS will rewrite the port number, either. [22:48:08] (so you couldn't share the backend pool just because the port number differentiates there) [22:48:23] yeah the way we implement it, you can't [22:49:04] RoanKattouw: ^ ? [22:51:14] now I'm reviewing the current mathoid and ocg setups, it makes me wonder why they have odd port numbers to begin with. [22:51:53] I mean, it's ok if it works (if the consumers already know the magic port number to use with that service address), but why not use 80 and make it default? [23:03:10] (03PS1) 10Aklapper: Adjust list of uninstalled Phabricator applications. [puppet] - 10https://gerrit.wikimedia.org/r/161624 [23:05:18] (03CR) 10Aklapper: "I am entirely clueless if the format "PhabricatorApplicationDiviner" is correct or whether this need to be "PhabricatorDivinerApplication"" [puppet] - 10https://gerrit.wikimedia.org/r/161624 (owner: 10Aklapper) [23:12:39] bblack: They can easily be both non-port-80 services [23:13:29] yeah I see that now, but I'm wondering why you'd want that. unless the goal was explicitly to share backend pool hosts between services. [23:13:47] e.g. in the ocg case, only ocg lives on the ocg backend boxes. why port 8000 when 80 would work fine? [23:15:09] in any case, assuming you're happy with oddball port numbers, and mathoid+citoid use the same backend pool on those different port numbers, you can set it up in puppet by defining a second LVS service stanza with a different port number, but the same everything else [23:15:20] and just CNAME citoid.svc.eqiad.wmnet -> mathoid.svc.eqiad.wmnet [23:15:34] (or give them separate service IPs, either way) [23:15:40] Aaaah OK [23:15:42] Nice [23:15:56] Yeah I was going to say, I'd be happy for them to both be non-80 [23:16:09] I just don't necessarily see the point of setting up identical but duplicative LVS groups [23:16:27] Because it's two services now, but what happens when we have more? [23:16:35] well the only reason anything would be required to be non-port-80 is for the express support of sharing backend hosts between two distinct LVS services. [23:16:56] Id' say e.g. in the ocg case this isn't the case, so port 80 would have been less-confusing [23:17:01] Coren: I believe we have job de-duplication. [23:17:08] Coren: So a series of successive edits shouldn't really matter... [23:18:28] bblack: What's weird about ocg? [23:19:13] ocg is the only service on that frontend IP (ocg.svc.eqiad.wmnet), and it's the only service using its pool of backend machines. Nothing else is using port 80 there. So it would make sense to use port 80, as that's the default port for HTTP-based things anyways. [23:19:24] it seems odd it's using 8000 instead for no apparent reason. [23:20:38] someday someone will want to use your service API from some new language and be working from scratch and have to remember "The location of this service isn't the obvious http://ocg.svc.eqiad.wmnet, it's http://ocg.svc.eqiad.wmnet:8000" [23:21:04] bblack: it'll all be nicely proxied from 80 Real Soon Now [23:21:22] but other than that those are all purely internal services [23:21:48] not using port 80 avoids the need to start the daemon as root [23:21:59] and do all the privilege dropping dance [23:22:04] yeah but if we didn't care to be descriptive to make developers' lives easier, we could've just named them svc0x5ef67a.wmnet or whatever. I'm just saying, use defaults where applicable. [23:22:27] bblack: developers shouldn't have to care about the backend [23:23:12] the URL is location of the API, not a library written in one language that wraps said URL [23:23:18] that's at least the idea behind https://www.mediawiki.org/wiki/Requests_for_comment/PHP_Virtual_REST_Service and https://github.com/gwicke/restbase [23:23:53] gwicke: What are your future plans for proxying there? [23:23:59] ok [23:24:15] restbase is basically two things: 1) a proxy, and 2) a storage service [23:24:21] Because right now I'm just copying the mathoid stuff around but that doesn't seem like something we should do every single time [23:24:28] gwicke: so all of our PHP REST APIs would funnel through a single IP address in that plan and differentiate on URL path or Host header or something? [23:24:41] bblack: single external REST API [23:25:00] for example, you ask for HTML, which normally means that you get it straight from storage [23:25:16] if that's not found, the restbase handler calls parsoid to generate it [23:25:22] well yes but /wiki/ and /math/ will hit different code, so the service IP or hostname is no longer the differentiator, right? [23:25:41] it's path based [23:25:49] so yes [23:25:57] so all such services would just live at e.g. api.svc.eqiad.wmnet, or whatever [23:25:58] possibly host name if we like [23:26:07] *nod* [23:27:52] hmm the virtual rest service moves service IPs down into itself though. I'm not super-enthused about that, as they're low-level details we'd normally find in puppet [23:28:35] I guess it could source that config from stuff deployed via puppet, or mediawiki-config (which already has service IP stuff in it too), or something [23:29:17] well, or use hostnames there and assign the IPs in DNS [23:29:21] bblack: it could just as well send most of that to restbase [23:29:39] the point is that it's a config issue, not something a developer needs to deal with directly [23:30:09] /services/foo can map to anything in the backend [23:30:28] it might as well be local code [23:30:44] yeah if it's local code that's fine. [23:31:38] I worry a little if we end up saying, for example, that /services/ocg maps over to ocg100[123].eqiad.wmnet IP addrs directly in that config somewhere that it's not obvious when looking at this stuff from an ops perspective. It would be nice to source that config from somewhere opsy. [23:32:15] bblack: agreed [23:32:18] or people will renumber networks and refactor infrastructure without realizing what they're breaking [23:32:24] mediawiki-config? [23:32:30] and/or puppet [23:32:37] yeah maybe, since we already know to double-check mediawiki-config for such things [23:33:18] we should definitely avoid to manually pass around host names and such in every bit of code using a service [23:34:37] bblack: completely unrelated question -- what's the latest on ESI in Varnish? [23:34:50] is that still unstable with gzip? [23:35:27] yeah, unstable in general I think, not just gzip [23:35:34] but definitely with gzip [23:35:59] I haven't yet built varnish4, that's the next time we'll try it out and see if the situation is improved basically [23:36:02] (03CR) 10Ori.livneh: [C: 032] "no-op." [puppet] - 10https://gerrit.wikimedia.org/r/161581 (owner: 10Ori.livneh) [23:36:11] (well, varnish4 + whatever hacks we need to make it do all the things our varnish3 does) [23:36:18] *nod* [23:36:33] how far out do you see that roughly? [23:36:57] uhm I think mark said it was being done this quarter in general? [23:37:08] ahh, nice [23:37:29] I haven't revisited varnish seriously in a while, but if that's in his quarterly plan, than yeah I need to make time for it somehow. [23:37:59] the reason I'm asking is that we have this longer-term idea of including static page content in a page view [23:38:27] basically the entire content area [23:39:00] could do that in PHP too, but Varnish could be nicer for purging [23:39:22] yeah [23:39:33] ESI in general is powerful and attractive, if it's reliable. [23:39:35] as the chrome will still change with user rights etc for now [23:40:21] k, I'll follow your Varnish 4 work [23:40:24] thanks! [23:40:48] RoanKattouw: rewinding to a much earlier unanswered question: when it's 3 or 4 or 300 services sharing backend hosts with separate ports, it'll be that many LVS service definitions. You can factor out the commonality with variables and defines though. [23:41:27] after all the munging by various templates and tools, it comes down to each service port needs its own separate set of ipvsadm rules on the LVS hosts. [23:42:06] bblack: another option could be to actually use different internal ip ranges for each service [23:42:25] that might be easier to map to containers in the future [23:42:40] Right [23:42:52] For now my strategy is to just duplicate everything and see how much I get yelled at [23:43:00] ok [23:43:31] for now since they're on separate ports and such anyways, you might get yelled at less if you just re-use the front IP and CNAME citoid to mathoid in DNS, instead of allocating another separate IP. [23:54:40] (03PS1) 10Dzahn: fix SSL cert monitoring on LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161631 [23:56:15] (03PS2) 10Dzahn: fix SSL cert monitoring on LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/161631 [23:59:13] (03CR) 10Dzahn: [C: 032] "fixing https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=expiration without touching monitoring of certs on nginx clust" [puppet] - 10https://gerrit.wikimedia.org/r/161631 (owner: 10Dzahn)