[00:09:27] (03PS1) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:15:46] hey, can someone look up for me the version of jetty on one of solr* hosts, please? [00:23:22] paravoid: heads up; we're launching another onslaught [00:25:06] mwalker: did you find the cause for the 4xx? [00:25:52] no; I think it's in a banner -- but I cant prove that -- I'm guessing the 4xx will return; but I sort of want to make sure [00:26:09] It could have been in that particular banner; or it could be in our backend workflow [00:26:10] (03PS2) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:27:03] okay, I'll get a sample and try to find outliers [00:27:03] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 (owner: 10Yuvipanda) [00:27:41] Is it just me, or does https://test.wikipedia.org/wiki/Main_Page not have any CSS? [00:28:27] not just you [00:29:18] paravoid: where is the tap for the 4xx plot? [00:29:28] is it all cluster wikis? or everything served through varnish? [00:29:54] the latter [00:30:09] (03PS3) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:31:03] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 (owner: 10Yuvipanda) [00:33:20] (03PS4) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:34:19] paravoid: I'm not seeing it happen again; so it might have just been that banner... do you know what the unit on the Y axis is? count / sec? count / min? [00:35:12] 4xx/min [00:35:44] what you mean by "that banner"? [00:36:57] the centralnotice banner / object we were serving at the time [00:36:59] it's variable [00:37:10] (03PS5) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:37:11] Coren: can you merge? ^ [00:37:15] Coren: tested! [00:38:42] anyone else to give me a +2 on a labs specific patch that's not being used anywhere yet? [00:41:39] (03PS1) 10Springle: depool slaves for packages upgrade, including mariadb 5.5.34 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97656 [00:42:06] (03CR) 10Springle: [C: 032] depool slaves for packages upgrade, including mariadb 5.5.34 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97656 (owner: 10Springle) [00:43:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [00:43:24] !log springle synchronized wmf-config/db-eqiad.php 'depool slaves for packages upgrade' [00:43:40] Logged the message, Master [00:45:36] (03PS6) 10Yuvipanda: dynamicproxy: Make 404 fallback servers configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 [00:45:38] * yuvipanda_ gently pokebegs ori-l [00:46:33] I don't know anything about the setup [00:47:26] it isn't really running anywhere, and I tested it, but I'll just wait up for one of the labs folks to show up tomorrow I guess [00:47:32] thanks anyway ori-l [00:47:53] let me read the patch [00:47:56] * yuvipanda_ puts a pox on the bigass one-size-fits-all puppet-repo [00:48:13] ori-l: no trailing commas in this one :) [00:48:18] ori-l: and thanks :) [00:49:09] it looks safe to merge, but I think you're better off waiting. I don't have the attention span to meaningfully review it, and it looks substantial enough that it could very plausibly be improved by having someone with some context look at it [00:49:22] * paravoid concurs fwiw [00:49:40] alright :) [00:49:55] working... slightly odd hours with the puppet repo is just frustrating. [00:50:32] the most recent patchset is 5 minutes old [00:50:44] ori-l: I edited the commit message [00:51:05] ok, 40 minutes old :P [00:51:27] ori-l: okay, maybe I'm being whinier than I should be :P [00:56:24] (03PS5) 10Dzahn: bugzilla module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 [01:00:38] ori-l: btw, in your nginx patch I think you mentioned that you tested the patch on vagrant. [01:00:51] ori-l: did you just clone the module into vagrant and test it out? [01:01:04] s/clone/cp/ [01:01:07] PROBLEM - Puppet freshness on db1034 is CRITICAL: No successful Puppet run for 3d 15h 42m 42s [01:01:09] yes [01:01:32] ori-l: ah, okay. [01:01:41] nice that that was modular enough :) [01:02:18] ori-l: should perhaps split it out to its own module, when you have the time (might also make the license situation less murky) :) [01:02:58] (03CR) 10coren: [C: 032] "Yuvi says it works. If it breaks, blame him. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97655 (owner: 10Yuvipanda) [01:03:16] thanks Coren :D [01:03:28] PROBLEM - mysqld processes on db1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:04:07] oops, should have silenced [01:07:57] (03PS6) 10Dzahn: bugzilla module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 [01:09:03] (03CR) 10Dzahn: "fixed issues found in labs testing and some comments by Alex" (039 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [01:11:27] (03CR) 10Dzahn: "see inline comments on PS4, many comments have been addresses but not all. though now i can say as long as you comment the require "passwo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [01:15:58] PROBLEM - Puppet freshness on db1034 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [01:16:38] RECOVERY - Puppet freshness on db1034 is OK: puppet ran at Tue Nov 26 01:16:32 UTC 2013 [01:16:58] PROBLEM - Puppet freshness on db1034 is CRITICAL: No successful Puppet run for 0d 0h 0m 25s [01:17:58] PROBLEM - Puppet freshness on db1034 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [01:19:38] RECOVERY - Puppet freshness on db1034 is OK: puppet ran at Tue Nov 26 01:19:27 UTC 2013 [01:20:56] (03PS1) 10Springle: warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97661 [01:21:24] (03CR) 10Springle: [C: 032] warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97661 (owner: 10Springle) [01:21:59] hey paravoid, how would i disable puppet freshness checks for professor? [01:22:12] (03CR) 10Dzahn: bugzilla module - WIP (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [01:22:17] it's what you and akos were suggesting initially, i think it was right in hindsight [01:22:30] !log springle synchronized wmf-config/db-eqiad.php 'warm up slaves after package upgrade' [01:22:45] Logged the message, Master [01:22:45] oh, it's 3:20 AM for you. don't answer:P [01:23:14] ori-l: disable in what way? really not known by icinga or just not reporting anywhere besides web ui [01:23:42] not reporting anywhere besides web ui [01:24:02] ori-l: got icinga-admin login? [01:24:20] no. i can look in the usual places [01:24:32] https://icinga-admin.wikimedia.org/icinga try it [01:24:52] oh, that worked [01:25:01] :) [01:25:17] sweet, and it's already disabled [01:25:20] so, you can either do 'scheduled downtime' or just "disable notifications" [01:25:22] or both [01:25:30] so probably pvoid or akos had the good sense to just go ahead and do it [01:25:37] and you can do so for the host and /or all the services on it [01:25:51] it should give you a checkbox for that if you from the host overview page [01:26:05] and you then put an RT ticket in the comment field and send :P) [01:26:08] do what? [01:26:31] * AaronSchulz wonders why loginwiki is on s3 instead of s7 [01:26:59] now it's just 2 SPOFs (loginwiki/centralauth) ;) [01:27:21] experience has shown that if we lose a shard, we lose everything [01:27:41] so it might matter less in practice [01:27:46] not that I disagree [01:28:15] and it'd be nice to not lose appservers altogether when a shard gets overloaded [01:41:51] marktraceur: when you were doing your upload wizard stuff earlier; what exactly were you doing? [01:42:12] ... or more precisely; was it something that would've caused a bunch of HTTP 4xx's? [01:45:16] (03CR) 10GWicke: "Alternative approach (gzip support in Parsoid) is implemented in https://gerrit.wikimedia.org/r/#/c/97652/. It looks like that works, but " [operations/puppet] - 10https://gerrit.wikimedia.org/r/97647 (owner: 10GWicke) [01:45:31] mwalker: It, um...probably shouldn't have? [01:45:42] mwalker: Do you have logs of what the 4xxs are trying to reach? [01:46:01] Wikimedia Errors? [01:46:04] Getting HTTP 503 [01:46:07] paravoid: ^ [01:46:09] Wuuuuh oh. [01:46:37] Krenair: on what site? enwiki seems to be fine [01:46:52] enwiki... it's fine now. [01:48:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [01:50:24] eh, WTF is https://nah.wikipedia.org/wiki/Tehtlahtol - it seems to be causing a lion's share of PHP timeouts [01:51:58] (03PS1) 10Springle: slaves to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97663 [01:55:43] (03PS1) 10Ori.livneh: Configure gdash to be served by nginx::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/97664 [01:56:05] (03CR) 10Springle: [C: 032] slaves to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97663 (owner: 10Springle) [01:56:19] (03PS1) 10Tim Starling: wikipedia.com.il -> wikipedia.co.il [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97665 [01:56:58] !log springle synchronized wmf-config/db-eqiad.php 'slaves to full steam' [01:57:11] (03CR) 10Ori.livneh: [C: 032] Configure gdash to be served by nginx::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/97664 (owner: 10Ori.livneh) [01:57:22] Logged the message, Master [01:57:46] springle: did you see the conversation between me and ori yesterday about the pc1001 overload? [01:57:58] TimStarling: no [01:58:03] channel? [01:58:05] here [01:58:27] I was intending to summarize it and reply on ops@, but I haven't gotten to it yet [01:58:32] I can grab the log for you, hang on [01:59:02] the TL;DR is that it didn't seem like MediaWiki was running a higher-than-normal rate of queries on that database host [01:59:29] there was no increase in replace, delete or select query rate that we could see, in the ganglia data [01:59:53] in fact there was a reduction in query rate as the server slowed down [02:00:32] the bit that confused me was the spike in mysql_bytes_received [02:00:40] yeah, that was a metric bug [02:00:56] same as the spikes in all the other mysql_* metrics [02:01:08] it only affects a single sample, when the server restarts [02:01:08] oh [02:01:45] someone should probably fix that [02:01:53] if a counter resets to 0, gmetad assumes it was due to overflow [02:02:05] gmetad is not at fault [02:02:32] oh, i see -- it's the gmond module [02:02:33] ganglia gives you a few different ways to deal with such issues [02:02:43] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:02:54] it's the fault of the metric script for not implementing one of them [02:03:47] OK, I'll file a bug for it and assign it to myself [02:05:14] I love that there's a pc1001 and a cp1001 [02:05:24] (03CR) 10Tim Starling: [C: 032 V: 032] Generate redirects.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [02:05:38] given that, i'm tempted to reduce pc100[123] max_connections. plus aaron's SqlBagOStuff commit last week to reduce txn times would presumably help [02:06:05] ori-l: what does cp stand for anyway? [02:06:14] it was fairly obvious when we called them sq* [02:06:48] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201387) [02:06:52] you mean, the ambiguity was confined to 'squid' and 'sql'? :) [02:07:05] (03CR) 10Tim Starling: [C: 032 V: 032] wikipedia.com.il -> wikipedia.co.il [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97665 (owner: 10Tim Starling) [02:07:48] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:07:49] I mean, given the function of the server, it was easy to guess where the name came from [02:08:26] (03PS1) 10Ori.livneh: Fix template file path in Gdash role [operations/puppet] - 10https://gerrit.wikimedia.org/r/97667 [02:10:34] where are the parsercache dbs configured at the mediawiki level? not db-eqiad.php [02:11:30] $pcServers in CommonSettings.php [02:13:04] thanks [02:13:18] load is balanced via consistent hashing of keys [02:13:52] but the slow query log on pc1001 shows a wide distribution of keys [02:14:25] atm pc100[123] only replicate to the old pmtpa boxes. is there a need for an eqiad slave each? if only for fail over [02:14:54] definitely [02:15:38] $pcTemplate says 'type' => 'mysql' . is there another parsercache type? [02:16:08] i mean, besides gwicke's cassandra ideas [02:16:11] !log LocalisationUpdate completed (1.23wmf4) at Tue Nov 26 02:16:11 UTC 2013 [02:16:26] Logged the message, Master [02:17:20] AFAIK any of the classes implementing the object cache interface can be used; current this includes redis, memcached, APC, a few other things [02:17:22] that type is the $type parameter to DatabaseBase::factory(), i.e. the DBMS [02:17:39] right [02:17:40] so it can be pgsql, sqlite, oracle, etc. [02:18:31] but you see that there is also a class option which is set to SqlBagOStuff [02:18:58] that can instead be RedisBagOStuff etc. to get non-SQL storage [02:19:55] the last item from the investigation yesterday is that cpu_wio climbed to ~5% starting on the 21st, which could correspond to 1 of 24 cores being maxed; I didn't see anything unusual around that time in syslog. It doesn't correlate with a cron job or a Puppet config change [02:20:46] it could have been a slow query, or many slow queries [02:20:58] they would only be logged in the slow query log if they finished before the server was restarted [02:22:17] That seems like a bug. [02:22:51] heh, good luck with that [02:23:23] The query has to finish successfully to be logged as a slow query? [02:23:25] oh, maybe it's not clear, we are talking about the slow query log which is part of MySQL [02:23:29] Isn't there a query slayer? [02:23:40] yes, springle [02:24:13] !log LocalisationUpdate completed (1.23wmf5) at Tue Nov 26 02:24:12 UTC 2013 [02:24:18] presumably you would be laughed at if you asked Oracle to ensure queries running at the time of a server kill were entered into the slow query log [02:24:28] Logged the message, Master [02:24:30] The server kill I understand. [02:24:36] But it seems reasonable to log long-running queries. [02:24:48] Especially if they don't get logged if they're killed by something other than a restart. [02:24:51] I don't know MySQL internals well enough if they'd be in the binlog [02:24:52] Which was really my question, I think. [02:25:08] !log deploying new generated redirects.conf [02:25:12] there was nothing in pc1001 processlist runing longer than the 1000+ SqlBagOStuff txns siting waiting to COMMIT. the other 4000 or so were running slow but as a result of whatever allowed all those writers to pile up [02:25:21] Logged the message, Master [02:25:39] innodb was fighting to get undo slots and unable to get ahead again [02:26:20] hence wanting to decrease max_connections. normal traffic is only hundreds concurrent, so it should be safe enough [02:26:49] maybe domas has an idea, he's online in #mediawiki_security [02:28:58] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 3d 0h 10m 56s [02:30:32] He's going to recommend using mysql-facebook. ;-) [02:32:32] pc1001 still is the facebook branch [02:32:36] although the 5.1 one [02:33:02] Ah, nice. [02:33:09] Facebook: pwn3d [02:45:34] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15977 MB (3% inode=99%): [03:10:04] (03PS3) 10Tim Starling: Normalise the path part of URLs in the text frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 [03:12:49] !cp is caching proxy (squid or varnish) [03:12:49] Key was added [03:13:39] (03CR) 10Tim Starling: [C: 032] Normalise the path part of URLs in the text frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [03:14:08] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 26 03:14:08 UTC 2013 [03:14:24] Logged the message, Master [03:16:42] !pc is parser cache (mysql) [03:16:42] Key was added [03:18:31] RECOVERY - Disk space on cerium is OK: DISK OK [03:19:51] <^d> !es is external store (and sometimes elasticsearch, but we try not to) [03:19:51] Key was added [03:19:54] <^d> mutante: ^ [03:20:32] :) [03:21:03] !mw [03:21:06] !mw is what we never call mw but appserver [03:21:06] Key was added [03:21:32] <^d> But MW is MediaWiki ;-) [03:21:59] !mw is mediawiki but we usually call them appserver [03:22:00] This key already exist - remove it, if you want to change it [03:22:09] !mw del [03:22:09] Successfully removed mw [03:22:12] !mw is mediawiki but we usually call them appserver [03:22:12] Key was added [03:23:51] !osm is open street map [03:23:51] Key was added [03:24:01] !tmh is timed media handler [03:24:01] Key was added [03:24:39] !srv are the old appservers, now "mw" [03:25:21] !srv is old app servers, now "mw" [03:25:22] Key was added [03:29:30] !log deploying varnish normalize_path (I687f6773) [03:29:44] Logged the message, Master [03:38:59] (03PS1) 10Springle: Reduce max_connections for parsercache dbs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97671 [03:40:18] (03CR) 10Springle: [C: 032] Reduce max_connections for parsercache dbs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97671 (owner: 10Springle) [03:47:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [03:48:27] (03PS1) 10Springle: Add mysqld process icinga check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97672 [03:49:57] (03CR) 10Springle: [C: 032] Add mysqld process icinga check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97672 (owner: 10Springle) [04:04:19] (03PS1) 10Springle: depool slaves for package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97674 [04:04:54] (03CR) 10Springle: [C: 032] depool slaves for package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97674 (owner: 10Springle) [04:05:38] !log springle synchronized wmf-config/db-eqiad.php 'depool slaves for package upgrade' [04:05:52] Logged the message, Master [04:13:09] (03PS1) 10MZMcBride: Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 [04:15:13] (03CR) 10Dzahn: role and module structure for ishmael (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [04:29:38] (03CR) 10Dzahn: "since this broke once before i'd really appreciate if somebody else could take a look here and confirm" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91209 (owner: 10Reedy) [04:32:53] (03CR) 10Dzahn: "if anyone has good or new ideas here.. please go ahead, i'm currently not sure what to do with it" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 (owner: 10Dzahn) [04:34:32] (03CR) 10Dzahn: "this would move dsh to a module, should be ok, but we don't seem to use that monitoring check (yet) somebody started and how much longer a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [04:35:21] (03CR) 10Dzahn: "not needed anymore but a reminder for me to change the update method to use curl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95317 (owner: 10Dzahn) [04:36:37] (03CR) 10Dzahn: "the related RT #5928 is stalled. JeremyB? odder? wanna pick it up?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 (owner: 10Dzahn) [04:38:08] (03CR) 10Dzahn: "thanks for the info, so.. hmm.. what do we want regarding those NS changes then?" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [04:41:22] mutante: you have mail... [04:42:28] (03CR) 10Dzahn: "hrmm.. meanwhile it's just" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [04:44:58] (03CR) 10Dzahn: "so did i poossibly some too fast then that work again meanwhile? wanna just amend this change?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [04:45:01] jeremyb: got it, thanks [04:45:03] and good night [04:47:11] you too :) [04:49:36] PROBLEM - Puppet freshness on tungsten is CRITICAL: No successful Puppet run for 3d 19h 31m 10s [04:50:34] (03PS1) 10Springle: depool the correct slave this time! [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97677 [04:51:01] (03CR) 10Springle: [C: 032] depool the correct slave this time! [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97677 (owner: 10Springle) [04:51:45] !log springle synchronized wmf-config/db-eqiad.php 'depool slaves for package upgrade' [04:52:00] Logged the message, Master [05:03:55] (03CR) 10Jeremyb: "waiting on legal, repoked them" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 (owner: 10Dzahn) [05:11:09] (03CR) 10Swalling: [C: 04-1] "Like I said on the bug, I think there are unanswered questions regarding functionality and permissions for this namespace that are (for no" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [05:20:32] (03CR) 10Ori.livneh: [C: 032] Fix template file path in Gdash role [operations/puppet] - 10https://gerrit.wikimedia.org/r/97667 (owner: 10Ori.livneh) [05:22:08] RECOVERY - Puppet freshness on tungsten is OK: puppet ran at Tue Nov 26 05:22:00 UTC 2013 [05:29:51] (03PS1) 10Ori.livneh: tungsten: serve gdash on gdash.wm.o vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97678 [05:29:58] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 3d 3h 11m 57s [05:31:31] (03PS1) 10Springle: warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97679 [05:31:53] (03CR) 10Springle: [C: 032] warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97679 (owner: 10Springle) [05:32:38] !log springle synchronized wmf-config/db-eqiad.php 'warm up slaves after package upgrade' [05:32:54] Logged the message, Master [05:34:22] (03CR) 10Ori.livneh: "err: Failed to apply catalog: Could not find dependency File[/etc/init/ocg-collection.conf] for Service[ocg-collection] at /etc/puppet/man" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96811 (owner: 10Mwalker) [05:35:11] thanks ori-l -- the plan is to fix that tomorrow [05:36:07] (03PS1) 10Ori.livneh: role::ocg: fix typo in config file name [operations/puppet] - 10https://gerrit.wikimedia.org/r/97680 [05:36:10] carpe diem [05:37:09] (03CR) 10Ori.livneh: [C: 032] tungsten: serve gdash on gdash.wm.o vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97678 (owner: 10Ori.livneh) [05:37:19] (03CR) 10Ori.livneh: [C: 032] role::ocg: fix typo in config file name [operations/puppet] - 10https://gerrit.wikimedia.org/r/97680 (owner: 10Ori.livneh) [05:38:05] that works too [05:38:06] :) [05:38:53] you wouldn't also happen to have the power to create git repositories too would you? [05:39:03] mutante: 5343 could use an update too (seems to now be available for anyone to register) but I don't want to say anything while the squatter's still listed on CC [05:39:05] RECOVERY - Puppet freshness on rhodium is OK: puppet ran at Tue Nov 26 05:38:57 UTC 2013 [05:40:13] mwalker: what's the repo? [05:40:54] operations/ocg-config [05:42:02] oh [05:42:08] i'd prefer to leave that to jeff [05:42:22] if i create it with the default ACLs, you won't be able to push to it anyway [05:42:39] jeff doesn't know how to do it / may not have privs [05:42:40] if i give you push rights, I'd be overstepping what I'm supposed to do [05:42:50] oh [05:43:02] if you at least create it; I can get jeff to give me privs [05:43:26] or; we can leave it as an operations only +2 repo [05:43:32] something to hash out tomorrow [05:43:52] https://dpaste.de/33mU/raw [05:44:01] specifically err: /Stage[main]/Role::Ocg::Collection/Service[ocg-collection]/ensure: change from stopped to running failed: Could not find init script for 'ocg-collection' [05:45:01] luckily I do have shell on rhodium! [05:45:04] * mwalker rummages [05:45:51] I wonder if it needs a /etc/init.d/ocg-collection script as well as an upstart config [05:46:14] no, it doesn't [05:46:27] Upstart-job does that, but it's not necessary [05:46:29] hurm; initctl list shows the service [05:46:37] it needs provider => upstart, [05:46:44] ahhh [05:46:46] on the service resource [05:47:00] what should the short description of the repo be? [05:47:18] it's usually a single sentence [05:47:25] Configuration files for Offline Content Generation servers? [05:47:37] Haughty Capital Letters [05:47:43] :) [05:47:51] it is the name of the role [05:48:10] could mention something about PDF rendering and mathoid [05:48:32] so, Configuration files for offline content generation servers (pdf rendering, mathoid) [05:49:33] https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/ocg-config [05:49:35] oh [05:49:41] I already went with the Haughty Capital Letters [05:50:34] Jeff_Green: fwiw, I ran: ssh -p 29418 gerrit gerrit create-project --require-change-id --owner=opssoftware --parent=operations/software --description='"Configuration files for Offline Content Generation servers"' operations/software/ocg-config [05:51:31] is there an existing gerrit group for ocg stuff? [05:52:26] we were just using extension-collection so far [05:52:40] ori-l: i don't know [05:52:49] morning [05:52:50] Jeff_Green: what are you still doing up! [05:53:04] Jeff_Green: sorry, I didn't mean to ping; I figured you'd get it tomorrow sometime [05:53:06] trying to detangle the fundraising mysql privs snafu [05:53:18] morning pv [05:53:25] oh :'( [05:58:00] (03PS1) 10Mwalker: Stating that OCG is an upstart job and setting config repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/97681 [05:58:29] (03PS1) 10Springle: slaves to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97682 [05:58:55] (03CR) 10Springle: [C: 032] slaves to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97682 (owner: 10Springle) [05:59:45] !log applying workaround for Ganglia XSS https://github.com/ganglia/ganglia-web/issues/218 [05:59:49] !log springle synchronized wmf-config/db-eqiad.php 'slaves to full steam after package upgrade' [05:59:54] no CVE yet [05:59:58] Logged the message, Master [06:00:14] Logged the message, Master [06:01:15] (03PS1) 10Ori.livneh: Add missing trailing semicolon to Gdash Nginx config [operations/puppet] - 10https://gerrit.wikimedia.org/r/97683 [06:02:30] (03PS2) 10Mwalker: Stating that OCG is an upstart job and setting config repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/97681 [06:02:38] silly whitespace [06:03:17] (03CR) 10Ori.livneh: [C: 04-1] "Small nit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97681 (owner: 10Mwalker) [06:03:31] oh [06:04:08] (03CR) 10Ori.livneh: [C: 032] Add missing trailing semicolon to Gdash Nginx config [operations/puppet] - 10https://gerrit.wikimedia.org/r/97683 (owner: 10Ori.livneh) [06:04:21] ori-l: what was your small nit? [06:05:31] the whitespace, and ensure not being the first property [06:06:35] ah; ok; I'll flip the ensure [06:06:37] but, um, [06:06:52] * mwalker waits [06:07:03] you asked for operations/ocg-config, rather than operations/software/ocg-config [06:07:16] so the mistake was mine. you shouldn't change the manifest [06:07:40] oh; I figured you had a reason to put it under software [06:07:48] I just guessed at where it should live [06:08:04] paravoid: ... where should a configuration repo live? [06:08:06] gah, I rushed it without thinking. [06:08:13] I should delete the repo. [06:09:44] (03PS1) 10Ori.livneh: Add missing trailing semicolon to Gdash Nginx config [operations/puppet] - 10https://gerrit.wikimedia.org/r/97684 [06:11:18] I deleted it [06:11:30] good to get faidon's two cents on repo namespace [06:13:17] (03CR) 10Ori.livneh: [C: 032] Add missing trailing semicolon to Gdash Nginx config [operations/puppet] - 10https://gerrit.wikimedia.org/r/97684 (owner: 10Ori.livneh) [06:13:29] fatal: Project not found: operations/software/ocg-config [06:13:29] fatal: The remote end hung up unexpectedly [06:13:31] :-P [06:19:49] hurm; that doesn't sound good [06:20:01] but jeremyb; maybe you know -- where should a configuration repo live? [06:20:16] Isn't there a wmf-config directory? [06:20:47] there is; but it doesn't seem like that's where this should live [06:20:51] but... maybe it should [06:21:17] a single file (or maybe two files); for a server that wont host a MediaWiki install [06:21:27] mwalker: jeremyb was just confirming i deleted it [06:21:43] shouldn't it be in puppet? [06:22:06] לַיְלָה טוֹב [06:22:14] you too [06:22:18] I was about to say that. [06:22:23] !log pwercycled labstore1001, unreachable, nothing on mgmt console [06:22:38] Logged the message, Master [06:22:53] mwalker: https://git.wikimedia.org/repositories/ [06:23:01] operations/puppet probably, though I'm not really paying attention in here. [06:23:14] neither am I, evidently [06:23:15] :/ [06:23:16] If that ocg repo was deleted, it's still lingering on GitHub and git.wm.o. [06:23:24] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 16%, RTA = 0.29 ms [06:23:33] i deleted it on github [06:23:38] git.wm.o is just a caching thing [06:23:55] Fair enough. [06:24:07] Strangely, my GitHub news feed only announces the creation. [06:24:25] I toyed with putting it into puppet as an erb; but that didn't seem correct to have service configuration in with server configuration [06:25:14] apache conf, puppet, mediawiki conf are all directly under operations [06:25:35] how does parsoid get configured now? or does it not have configuration? [06:26:28] good night [06:26:30] :) [06:26:32] good night [06:26:38] mwalker, there's a distinction drawn between modules and roles [06:27:15] a module -- in its platonic form, at least -- provides a set of generic abstractions for provisioning and configuring a particular application [06:28:53] a role declares how a module is configured on some particular node or nodes [06:29:57] so; should ocg actually be written as a module? [06:30:04] yes [06:30:10] ideally you'd have a generic ocg module [06:30:26] ah; ok -- I'll work that in [06:30:33] and you'd configure it by instantiating the module from a role file in manifests/role/ocg.pp or whatever [06:31:05] configuration files are typically either managed by passing all the specific values as parameters to the module [06:31:15] and having the module itself generate the config file based on the parameters [06:31:27] or, when it is impractical to do it that way [06:31:28] this makes it much easier for a labs instance owner to re-use your module by setting up their own role with their own config params [06:31:39] (for example) [06:32:12] yeah [06:33:10] when it's impractical to have the module generate the configuration files (because the set of configuration options supported by the application is too large for each one to be represented as a parameter) [06:33:23] you can have the role deploy files via the files/ dir and templates via the templates/ dir [06:33:57] it's entirely appropriate for files in either of those locations to contain site-specific configs [06:35:02] TL;DR: blah blah puppet blah blah blah [06:35:30] puppet == evil [06:35:30] heh, I wanted to bring up those ocg issues myself, but I was thinking of doing it after the sprint is over as to not block it [06:35:33] thanks ori-l :) [06:35:48] I'd rather do it right the first time [06:35:58] possibly the second time [06:36:14] in case of large ice cube in the way of the first time [06:36:16] :P [06:37:16] now you are eligible to participate in the deep philosophical conversations we have about module/role edge-cases [06:37:24] (03CR) 10Nemo bis: "As said on the bug, I don't see there is any unanswered question here. The requested thing is actually very simple: titles that used a sub" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [06:37:41] (03CR) 10TTO: "The only unanswered question I see is that of permissions, and the obvious answer there is that anyone should be allowed to create pages i" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [06:39:07] mumble mumble shadows on cave walls mumble butterflies mumble [06:40:35] we also have role modules! [06:40:37] (no kidding) [06:41:08] hmm... that gets very meta [06:41:45] * paravoid waits for reqerror to start increasing again [06:42:06] I prefer role modules to having those config-specific templates and files live in the generic module, my 2 cents [06:42:36] i do too [06:42:49] for non-trivial software, the possibilities are: [06:43:02] either the generic module *really* tried to be generic, and you have a wall of 500 parameters [06:43:07] and a template files that reads [06:43:15] <% if @foo %> [06:43:20] foo = <%= @foo %> [06:43:23] <% end if %> [06:43:27] x500 [06:43:40] or the person writing the module cheated and put only the config vars they needed [06:43:40] (03PS1) 10Faidon Liambotis: reqstats 5xx: shorten period to -1hours-now [operations/puppet] - 10https://gerrit.wikimedia.org/r/97685 [06:43:45] so you get laughably specific parameters [06:44:31] (03CR) 10Faidon Liambotis: [C: 032 V: 032] reqstats 5xx: shorten period to -1hours-now [operations/puppet] - 10https://gerrit.wikimedia.org/r/97685 (owner: 10Faidon Liambotis) [06:44:43] ori-l: I've seen the pattern, sure [06:44:56] ori-l: sometimes even the latter is useful too, though [06:45:54] yeah, it depends [06:46:39] I don't think I've had a single positive experience with any form of ruby packaging ever [06:46:51] :-D [06:47:15] (03PS1) 10ArielGlenn: fix typo that broke decomm racktables host check [operations/software] - 10https://gerrit.wikimedia.org/r/97686 [06:48:22] (03CR) 10ArielGlenn: [C: 032] fix typo that broke decomm racktables host check [operations/software] - 10https://gerrit.wikimedia.org/r/97686 (owner: 10ArielGlenn) [07:06:40] (03PS1) 10ArielGlenn: in host reports, skip rejected salt keys for salt source [operations/software] - 10https://gerrit.wikimedia.org/r/97687 [07:13:36] (03PS1) 10Ori.livneh: Gdash module: remove professor-specific hacks; use canonical locations [operations/puppet] - 10https://gerrit.wikimedia.org/r/97688 [07:14:17] oh that was fast, I was just looking at it [07:14:53] it being professor or the module? [07:15:01] (03CR) 10Ori.livneh: [C: 032] Gdash module: remove professor-specific hacks; use canonical locations [operations/puppet] - 10https://gerrit.wikimedia.org/r/97688 (owner: 10Ori.livneh) [07:15:03] professor was whining [07:15:10] hence I was about to look at the module :-) [07:15:16] the check is disabled in icinga [07:15:20] where is it whining? [07:15:34] I run a report to see what's going on [07:15:42] my report pointed at professor [07:15:42] ahh [07:16:04] basically, it's running lucid and it has a bunch of things that are unpuppetized / undebianized [07:16:10] tungsten is slotted to replace it [07:16:18] I'll just put in my notes that professor is on its way out and tungsten is n its way in [07:16:22] it was getting to be a bit much to write puppet stuff that was compatible with both [07:16:26] yep, thanks [07:17:00] thanks for looking after it [07:17:11] heh sure [07:25:07] (03PS2) 10Matanya: wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 [07:26:01] (03CR) 10jenkins-bot: [V: 04-1] wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 (owner: 10Matanya) [07:28:10] (03PS3) 10Matanya: wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 [07:32:20] (03CR) 10ArielGlenn: [C: 032] in host reports, skip rejected salt keys for salt source [operations/software] - 10https://gerrit.wikimedia.org/r/97687 (owner: 10ArielGlenn) [07:40:35] (03CR) 10Matanya: "Change was tested in labs, and it is a noop." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 (owner: 10Matanya) [07:52:08] (03PS1) 10Ori.livneh: Specify hasrestart => true for nginx and uwsgi services [operations/puppet] - 10https://gerrit.wikimedia.org/r/97689 [07:53:41] (03CR) 10Ori.livneh: [C: 032] wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 (owner: 10Matanya) [07:53:57] (03CR) 10Ori.livneh: [C: 032] Specify hasrestart => true for nginx and uwsgi services [operations/puppet] - 10https://gerrit.wikimedia.org/r/97689 (owner: 10Ori.livneh) [07:56:50] thanks ori-l [07:56:55] !log depooling ssl1003 for quick test of puppet config [07:56:56] np [07:57:12] Logged the message, Master [08:00:21] !log re-pooled ssl1003 [08:00:34] Logged the message, Master [08:02:08] ori-l: note that after depooling you have to wait a bit for connections to finish [08:02:12] not sure if you knew, just making sure [08:02:38] i waited for the network graph in ganglia to go flat [08:02:43] is that enough? [08:02:47] where do you usually look? [08:02:59] I usually just do netstat [08:03:14] ganglia going flat is also enough, sure [08:03:14] heh, yes, that's easier than fishing around in ganglia [08:03:16] good call [08:04:46] paravoid: curl -H 'Host: gdash.wikimedia.org' tungsten.eqiad.wmnet [08:04:48] \o/ [08:05:06] yay [08:12:56] (03PS1) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [08:33:53] (03PS2) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [08:39:54] (03PS3) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [08:41:56] wow, this actually works! [08:41:58] * yuvipanda fistpumps [08:42:24] all the tabs! :) [08:42:40] thanks matanya for cleaning up our puppet stuff [08:43:19] my plesure aude :) [08:47:23] hashar: around? [08:49:07] matanya: yup [08:51:36] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 3d 23h 33m 10s [08:51:37] (03PS1) 10Ori.livneh: Serve gdash.wikimedia.org on misc varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/97693 [08:56:36] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:57:03] (03CR) 10Odder: [C: 031] "Didn't get any reply from Stuart, so please go ahead. We might always re-add his blog in the future." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [08:58:46] (03CR) 10Odder: "Can't pick it up, @Dzahn, I don't have an account on RT and can only participate in tickets I'm specifically CC'd to." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 (owner: 10Dzahn) [08:59:47] (03CR) 10Odder: "I'd suggest changing wikimedia.li and wikimedia.pl to ns*.wikimedia.org, and removing the others." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [09:03:51] hashar: in modules/zuul/manifests/init.pp line 21, can you please explain why do you have a value there rather than UNSET? it is declared in the role anyway [09:05:19] (03PS4) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [09:08:15] matanya: at which commit ? [09:08:31] hashar: on prod [09:09:19] so that should be 6241272 "wmf: soften requirements" [09:09:25] err no [09:10:16] 1e3adfd WMF: debug time to launch a build in Jenkins [09:10:35] arhgh [09:10:38] in puppet [09:10:39] (03PS1) 10Ori.livneh: add self to icinga auth groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/97694 [09:10:54] ori-l: hey [09:11:01] hey [09:11:25] (03CR) 10Faidon Liambotis: [C: 032] add self to icinga auth groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/97694 (owner: 10Ori.livneh) [09:11:31] matanya: the Zuul module comes from OpenStack infra [09:11:38] paravoid: thanks, appreciate it [09:11:42] pgehres is still on there, btw [09:11:43] matanya: so yeah it ends up with some default values, no point in cleaning them up though [09:11:53] matanya: don't waste your time on zuul module :] [09:12:14] matanya: a memcached module, though...... [09:12:23] another 503 spike [09:12:26] who the fuck knows why [09:12:35] hashar: I want to change it to UNSET for lint pass. I hate lint warnings [09:13:00] ori-l: memcached is in m, i'm in z now :) [09:13:09] ah, eqiad-esams packet loss event [09:13:33] how did you notice packet loss? [09:13:39] er, rather, how did you notice the 503 spike? [09:13:43] you just have it open at this point? [09:13:50] I just have it open... [09:13:53] heh [09:14:11] no time to proactively fix issues, so all I can do is just be reactive now :) [09:17:37] (03PS1) 10Matanya: zuul : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97695 [09:18:32] come on jenkins [09:20:41] ori-l: I wonder if I should just make the check be "5xx the past XX minutes" [09:20:47] or if it'll just be too spammy [09:21:05] not yet, i don't think [09:21:06] otoh, if we did have errors the past hour, it might make sense to just monitor it manually after that [09:22:01] (03PS1) 10Matanya: webperf :lint-clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97697 [09:27:08] (03PS1) 10Ori.livneh: gdash.wm.o: noc -> misc varnish [operations/dns] - 10https://gerrit.wikimedia.org/r/97698 [09:27:56] (03CR) 10Ori.livneh: [C: 04-1] "Needs I088a655a1" [operations/dns] - 10https://gerrit.wikimedia.org/r/97698 (owner: 10Ori.livneh) [09:29:10] good night [09:33:09] bye ori-l [09:34:33] (03CR) 10Hashar: [C: 04-1] "Please do not set invalid default values, ideally they should be sane value and not UNSET." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97695 (owner: 10Matanya) [09:34:41] matanya: ^^^^ [09:34:57] matanya: can't split strings in puppet exec command [09:35:05] looking [09:36:44] thanks hashar. Any suggestion on better defualt, and a way to cut the over 80 cars, if not, i can just leave this alone? [09:38:09] matanya: iirc pupae forge recommends having a default.pp that list all the defaults [09:38:12] and passing them to the class [09:38:33] as for the values, we could look them up but honestly I don't really want to spend 1 + hour figuring sane values for each parameters [09:38:38] so yeah just leave it alone sorry [09:38:50] no problem hashar thanks [09:39:00] we also have a lot of class not having default, so maybe we can fix those lints later on, but I don't think it should be a priority right now. [09:39:12] (03Abandoned) 10Matanya: zuul : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97695 (owner: 10Matanya) [09:39:21] as for splitting long lines, I have NO idea how it can be done in puppet. Ruby might have a way to do so. [09:39:43] what is your proirity hashar ? [09:40:06] triggering browser tests and preparing integration tests for VisualEditor / Parsoid :D [09:41:17] :) yes, and puppet wise? [09:44:31] matanya: puppet is never a priority to me [09:44:52] that is merely the bottleneck / entry gate toward deploying to production [09:45:18] so puppet manifests are only a priority if it is preventing me from deploying something [09:45:25] nice defenition [09:46:26] as an example, I recently had to add a new firewall rule on a machine. The previous system (Augeas) was not maintained anymore by anyone so had to refactor a bunch of manifests to use the new firewall system (ferm). [09:46:33] and add my rule there :] [09:46:47] while at it we firewalled out ssh as well [09:48:22] oops [10:05:11] matanya: that was expected :-D [10:05:33] so ok, then :) [10:05:37] matanya: we have bastion as entry point, so we should connect via the proxy instead of directly to the host [10:06:12] yeah, that is smoewhat an annoting way to work, but i see where i comes from [10:06:18] *y [11:01:48] (03PS1) 10Jforrester: Split up dblists for VisualEditor with comments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97710 [11:01:49] (03PS1) 10Jforrester: Enable VisualEditor as opt-in on svwikitionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 [11:02:45] (03CR) 10jenkins-bot: [V: 04-1] Split up dblists for VisualEditor with comments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97710 (owner: 10Jforrester) [11:02:46] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor as opt-in on svwikitionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 (owner: 10Jforrester) [11:04:26] (03PS5) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [11:05:58] (03CR) 10Catrope: [C: 04-1] Enable VisualEditor as opt-in on svwikitionary (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 (owner: 10Jforrester) [11:06:12] (03PS2) 10Jforrester: Enable VisualEditor as opt-in on svwikitionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 [11:06:33] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor as opt-in on svwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 (owner: 10Jforrester) [11:06:52] (03PS1) 10Yuvipanda: dynamicproxy: Separate api configuration into own file [operations/puppet] - 10https://gerrit.wikimedia.org/r/97712 [11:06:53] (03PS1) 10Yuvipanda: dynamicproxy: Ensure that default site config is always enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/97713 [11:07:22] (03Abandoned) 10Jforrester: Split up dblists for VisualEditor with comments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97710 (owner: 10Jforrester) [11:07:37] (03PS1) 10Jforrester: Enable VisualEditor as opt-in on Swedish Wikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97714 [11:07:58] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor as opt-in on Swedish Wikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97714 (owner: 10Jforrester) [11:09:28] (03PS3) 10Jforrester: Enable VisualEditor as opt-in on svwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 [11:11:35] (03PS4) 10Jforrester: Enable VisualEditor as opt-in on svwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 [11:12:06] (03PS1) 10ArielGlenn: add amssq31-46 to dhcp, guess they never made it in [operations/puppet] - 10https://gerrit.wikimedia.org/r/97715 [11:13:24] (03CR) 10ArielGlenn: [C: 032] add amssq31-46 to dhcp, guess they never made it in [operations/puppet] - 10https://gerrit.wikimedia.org/r/97715 (owner: 10ArielGlenn) [11:19:40] (03PS2) 10Jforrester: Enable VisualEditor as opt-in on Swedish Wikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97714 [11:32:00] !jenkins operations-mw-config-tests [11:32:01] https://integration.wikimedia.org/ci/job/operations-mw-config-tests [11:42:40] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:30] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [11:46:15] (03CR) 10Lydia Pintscher: [C: 04-1] "I just talked to DanielK about this. I'll have to put more thought into the whole thing. I will try to do this over the next week." (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 (owner: 10Dzahn) [11:51:50] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 3h 0m 14s [14:04:04] !log jenkins / zuul: cherry picked a change from upstream https://review.openstack.org/#/c/43978/ and deployed it in production. 1e3adfd..1b47026 (tagged wmf-deploy-20131126). That should slightly speed up Zuul when it receives a change, previously it would issue git remote update whenever an event is received which takes a few seconds for mediawiki/core. [14:04:18] Logged the message, Master [14:31:24] grumble grumble [14:47:48] (03PS2) 10Yuvipanda: dynamicproxy: Separate api configuration into own file [operations/puppet] - 10https://gerrit.wikimedia.org/r/97712 [14:47:49] (03PS2) 10Yuvipanda: dynamicproxy: Ensure that default site config is always enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/97713 [14:47:50] (03PS6) 10Yuvipanda: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 [14:48:22] (03CR) 10Yuvipanda: [C: 04-1] "Need to test wit wikitech with help of andrew before merging" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97712 (owner: 10Yuvipanda) [14:52:13] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 6h 0m 37s [14:53:03] (03PS7) 10coren: dynamicproxy: Add URL based router [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 (owner: 10Yuvipanda) [14:53:15] yuvipanda: hey [14:53:19] hey paravoid [14:53:24] yuvipanda: you're based in india, right? [14:53:28] paravoid: yup [14:53:39] are you there now? [14:53:42] paravoid: yup [14:53:56] cool, can you help me with something? [14:53:58] sure! [14:54:02] do you see fundraising banners? [14:54:10] let me check [14:54:47] paravoid: nope [14:54:47] (03CR) 10coren: [C: 032] "Seems reasonably sane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97690 (owner: 10Yuvipanda) [14:54:53] hm [14:54:54] so [14:55:08] something started yesterday (the 25th) at approx 12:00 UTC [14:55:23] and we have tons of requests for http://en.wikipedia.org\ [14:55:23] am I supposed to be? [14:55:27] i.e. "GET \" [14:55:27] oh [14:55:44] hm, most of this is coming from Google [14:55:46] that's strange [14:55:54] 70% is coming from google.co.in [14:56:02] that's weird [14:56:45] paravoid: I tried a bunch of queries, seem fine... [14:57:14] hm, ok [14:57:22] hmm, why is this even 503ing [14:57:29] okay, I'll debug a bit further, I might ping you in a bit [14:57:46] paravoid: sure! [14:58:22] yuvipanda, what happened? [14:58:44] MaxSem: paravoid is investigating a bunch of requests for '\' that seem to be coming from google.co.in [14:59:47] hm [14:59:52] I think I know... [15:00:42] honorary mention in iptables?:P [15:01:35] damn, not it [15:02:18] 2566 GET http://en.wikipedia.org\ - text/html; charset=utf-8 https://www.google.co.in/ - Mozilla/5.0 (Windows NT 6.1; W [15:02:21] OW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36 en-US,en;q=0.8 - [15:03:00] 26837 times yesterday [15:03:19] 51854 today [15:05:26] 4897 RxHeader c ki/Sati_(practice) HTTP/1.1 [15:05:28] omfg [15:06:02] 2909 RxRequest c GET [15:06:02] 2909 RxURL c \ [15:06:02] 2909 RxProtocol c [15:06:02] 2909 RxHeader c ki/Irandaam_Ulagam HTTP/1.1 [15:06:02] 2909 RxHeader c Host: en.wikipedia.org [15:06:07] what. the. [15:08:09] paravoid: So MediaWiki:Foo.js pages ... [15:08:43] RoanKattouw: sorry, putting off another 3 fires atm [15:08:55] er [15:09:02] putting out even :) [15:10:17] this is one of the strangest things I've ever seen [15:10:28] seems to be random linebreak? [15:10:34] yes [15:10:48] not very random [15:11:10] these headers start with 'ki/' [15:12:46] all kinds of UAs, but isolated mostly in India [15:12:48] hmm, I just clicked through from google to sati_(practice) a couple of times [15:12:52] no 503 [15:13:02] someone wrote their crawler in hardcore C and missed a couple of bytes? [15:13:14] referrer google.co.in? [15:15:06] yuvipanda: about 70% of them, yes; although it could just be lots of hits from India [15:15:11] and lots from Google [15:15:25] MaxSem: all kinds of UAs, 78k hits since yesterday [15:16:06] might be because google is directing them to badly crawled links;) [15:17:09] I initially thought it was https://gerrit.wikimedia.org/r/#/c/96941/3/modules/varnish/templates/vcl/wikimedia.vcl.erb [15:17:15] but the timeline doesn't fit [15:17:27] it's off by 15 hours [15:18:21] it's a good fit though [15:18:36] MaxSem: badly crawled links wouldn't explain browsers sending completely invalid HTTP requests [15:19:43] (03PS4) 10Jforrester: Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [15:20:41] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [15:22:03] by coincidence, are these requests coming via https? [15:22:42] (03PS5) 10Jforrester: Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [15:22:46] not that I can see of [15:22:54] but it was a good idea [15:25:07] I even grepped for IPs, fearing it might be a proxy of some sorts [15:33:27] 08.33 < ori-l> -tech is basically dead <-- why not add it to the Wikimedia error then ;) 500 million worldwide-distributed users' eyeballs are more reliable than one icinga-wm :P [15:40:27] Nemo_bis: so true :D [15:56:39] (03CR) 10Odder: "Also, CC'ing Saper who's been directly involved in daily management of whatever's located under *.wikimedia.pl." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [16:03:03] (03CR) 10MZMcBride: "TTO: I wonder if we should give "createpage" back to everyone and simply restrict main namespace page creation for anons." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [16:09:55] (03PS1) 10Ottomata: Updating varnishkafka.conf.erb with stats configs [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/97748 [16:10:43] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka.conf.erb with stats configs [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/97748 (owner: 10Ottomata) [16:23:36] yay to the disappearance of old etherpad, it's been a long time coming! [16:28:56] (03CR) 10Swalling: "re: "The only unanswered question I see is that of permissions, and the obvious answer there is that anyone should be allowed to create pa" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [16:39:53] (03CR) 10Nemo bis: ""The permissions issue is not just page creation, but also publishing to mainspace."" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [16:40:11] (03CR) 10Legoktm: "@TTO, the discussion on the village pump right now seems that people understood that anons would be allowed to create pages there, even if" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [17:07:32] (03CR) 10Swalling: "Anon creation of drafts is one of the more fundamental things that we should be adding after the fact in some vague way." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [17:13:06] (03CR) 10Nemo bis: "Steven, that's not what I read in the bug comments. :) If that's your interpretation, I suggest to add it on bugzilla so that they can con" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [17:15:18] (03CR) 10Swalling: "@Nemo: it's pretty damn clear." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [17:20:58] (03CR) 10Nemo bis: "Steven, sure, it's pretty damn clear to me too, that it means the opposite of what you're saying. ;) Let's not be talibans. I added a clar" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [17:44:23] (03PS1) 10Yuvipanda: dynamicproxy: Prioritize url routes by length [operations/puppet] - 10https://gerrit.wikimedia.org/r/97758 [17:46:10] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Prioritize url routes by length [operations/puppet] - 10https://gerrit.wikimedia.org/r/97758 (owner: 10Yuvipanda) [17:52:39] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 9h 1m 3s [17:53:51] any idea why jenkins -1'd this? https://gerrit.wikimedia.org/r/#/c/97758/ [17:54:14] (03CR) 10Yuvipanda: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97758 (owner: 10Yuvipanda) [17:55:50] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15882 MB (3% inode=99%): [17:58:49] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15919 MB (3% inode=99%): [18:02:20] (03PS3) 10Yuvipanda: dynamicproxy: Separate api configuration into own file [operations/puppet] - 10https://gerrit.wikimedia.org/r/97712 [18:02:27] (03PS3) 10Yuvipanda: dynamicproxy: Ensure that default site config is always enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/97713 [18:02:49] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15638 MB (3% inode=99%): [18:19:46] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15659 MB (3% inode=99%): [19:00:48] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15842 MB (3% inode=99%): [19:05:43] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15765 MB (3% inode=99%): [19:07:12] I wonder how hard it would be to interpolate the system role definition for the node into the IRC alerts [19:08:09] gwicke: it looks like you're administering cerium, right? [19:16:31] !log reedy updated /a/common to {{Gerrit|Idb0e7956e}}: slaves to full steam [19:16:47] Logged the message, Master [19:17:01] lies [19:17:08] (03PS1) 10Reedy: Non Wikipedias to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97772 [19:19:08] ori-l: yes [19:20:24] one of the partitions is filling up in cassandra load testing, no biggie [19:20:56] !log Reindexing GeoData [19:21:11] Logged the message, Master [19:21:52] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97772 (owner: 10Reedy) [19:22:00] (03Merged) 10jenkins-bot: Non Wikipedias to 1.23wmf5 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97772 (owner: 10Reedy) [19:22:18] that was quick [19:22:43] RECOVERY - Disk space on cerium is OK: DISK OK [19:23:37] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All non wikipedias to 1.23wmf5 [19:23:52] Logged the message, Master [19:23:53] ori-l: there are a lot of content handler exceptions in the log [19:24:19] not that it's flooding, but still [19:33:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Revert to 1.23wmf4 on wikisources [19:33:54] Logged the message, Master [19:43:15] (03PS8) 10Ottomata: Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [19:43:49] ottomata: the final patch :P [19:44:03] ha, maybe! [19:44:32] * drdee keeps fingers crossed [19:46:01] (03CR) 10Edenhill: [C: 031] Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [19:52:18] Hello! I'm having an issue with spam-bots on a MediaWiki-based website, and I'm told I should block requests with empty user-agent strings. I'm also told that Wikimedia blocks empty user-agent strings, and I dropped in to ask what WMF's particular method is. [19:52:43] (I'm assuming Apache's mod_rewrite, but.. I don't know for sure, so that's why I'm asking here.) [19:52:53] Any info you can give me would be appreciated. :D [19:53:14] I doubt that helps [19:53:22] also, this is a question for #mediawiki [19:53:57] (03CR) 10Ori.livneh: "To make it possible to use Kafka elsewhere, it'd be nice if the class name was specific to the use-case, rather than 'varnish::kafka'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [19:56:01] Nemo_bis: it's not related to mediawiki, it's related to WMF's configurations, because I'm assuming them to use a sort of best-practice [19:56:58] (03CR) 10Ottomata: "Ori, this patch is specifically for varnishkafka. Kafka can still be used elsewhere using the kafka module or role::analytics::kafka* rol" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [19:57:09] No_one_at_all: wrong assumption [19:57:14] Nemo_bis: and at least 50% of our current issue is traceable to spambots using empty useragents, so it's a simple band-aid that might do quite a bit of good [19:57:39] Nemo_bis: ok. I'm just asking, though. Kent hoyt. [19:57:47] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Separate api configuration into own file [operations/puppet] - 10https://gerrit.wikimedia.org/r/97712 (owner: 10Yuvipanda) [19:58:10] (03PS1) 10Reedy: Revert "Non Wikipedias to 1.23wmf5" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97782 [19:58:23] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Ensure that default site config is always enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/97713 (owner: 10Yuvipanda) [19:58:37] Reedy: what went boom? [19:58:48] Wasn't supposed to deploy today [19:58:58] ProofreadPage is/was broken on wikisources on wmf5 though [19:59:09] ah, right, no deploys this week [19:59:59] (03CR) 10Reedy: [C: 032] Revert "Non Wikipedias to 1.23wmf5" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97782 (owner: 10Reedy) [20:00:24] (03Merged) 10jenkins-bot: Revert "Non Wikipedias to 1.23wmf5" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97782 (owner: 10Reedy) [20:01:54] paravoid: fundraising just reran the test from yesterday that we thought was causing 4xx errors -- there was no repeated spike; so it came from something else [20:05:17] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Not wmf5 day today [20:05:31] Logged the message, Master [20:11:40] paravoid: got a minute for https://gerrit.wikimedia.org/r/#/c/97693/ ? [20:16:34] !log csteipp synchronized php-1.23wmf5/extensions/TimedMediaHandler 'bug56699' [20:16:49] Logged the message, Master [20:17:39] (03PS1) 10Odder: Add a MassMessage-related user group on Meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97790 [20:21:01] !log csteipp synchronized php-1.23wmf4/extensions/TimedMediaHandler 'bug56699' [20:21:16] Logged the message, Master [20:21:56] (03CR) 10John F. Lewis: [C: 031] Add a MassMessage-related user group on Meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97790 (owner: 10Odder) [20:24:28] PROBLEM - Varnish HTTP mobile-frontend on cp3012 is CRITICAL: Connection timed out [20:29:48] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: Connection timed out [20:30:14] pages [20:30:48] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.292 second response time [20:32:18] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:48] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: Connection timed out [20:34:48] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.326 second response time [20:35:08] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.295 second response time [20:35:21] (03PS1) 10Ottomata: Adding JsonLogster parser. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 [20:37:44] (03CR) 10PiRSquared17: [C: 031] Add a MassMessage-related user group on Meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97790 (owner: 10Odder) [20:37:58] (03PS1) 10Andrew Bogott: Add admins::labs class to other labs boxes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97831 [20:38:18] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [20:39:08] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.289 second response time [20:43:18] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [20:44:08] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.290 second response time [20:47:49] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: Connection timed out [20:48:06] (03CR) 10Milimetric: [C: 031 V: 031] "just one small style thing" (031 comment) [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 (owner: 10Ottomata) [20:49:18] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [20:49:48] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.288 second response time [20:50:08] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.293 second response time [20:52:38] (03PS2) 10Ottomata: Adding JsonLogster parser. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 [20:53:09] (03PS3) 10Ottomata: Adding JsonLogster parser. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 [20:53:38] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 12h 2m 2s [20:53:55] (03PS4) 10Ottomata: Adding JsonLogster parser. [operations/debs/logster] - 10https://gerrit.wikimedia.org/r/97830 [20:54:28] PROBLEM - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /mnt/tmp 1474 MB (1% inode=99%): [20:54:48] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: Connection timed out [20:56:18] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [20:56:28] what is "testdb" on praesodymium? [20:56:46] praseodymium damn it [20:56:48] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.297 second response time [20:56:59] (03CR) 10Andrew Bogott: [C: 032] Add admins::labs class to other labs boxes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97831 (owner: 10Andrew Bogott) [20:57:17] cassandra test host according to puppet [20:57:33] I saw cerium geeetting pretty full earlier [20:58:08] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22809 bytes in 0.297 second response time [20:59:05] apergos: gwicke is on it, talking to him [20:59:08] RECOVERY - Varnish HTTP mobile-frontend on cp3012 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.194 second response time [20:59:11] great [21:00:33] mhoover: OK, access to the other labs boxes should trickle in over the next hour. [21:00:47] Have you had a chance to look at the official openstack puppet modules? [21:01:18] half hour [21:01:48] if you mean production boxes that service labs [21:01:55] ok, half hour :) [21:02:08] I'm just trying to keep expectations low [21:02:08] what fixed his access the other day? [21:02:19] or her [21:02:22] mutante, previously I had given him sudo but no actual account. [21:02:38] ah:) got it [21:04:20] !log killed attached gdb on cp3012's varnishd frontend, which restarted it... [21:04:35] Logged the message, Master [21:04:51] oh so I remember you had given him the actual account on whatevr box [21:05:08] and he couldn't get on, and I check perms (which were ok) but by thn he had given up and gotten off irc [21:05:12] andrewbogott: ok, cool. thx andrew [21:05:14] so I never heard the end of the story [21:05:20] i've read through the docs yes [21:05:50] apergos: i hear he existed in sudo but the account just didnt [21:05:56] so unknown users in auth.log [21:06:04] apergos: Oh, I guess I never heard the end either -- when I left last night it still wasn't working... [21:06:33] works now, yes. at least on virt1000 [21:06:34] but then .. i could su to the user later [21:06:41] and i can sudo [21:06:42] and there were no further attemps in the log [21:06:46] well that was earlier, I saw an attempt after the unknown user [21:06:47] ah, nice [21:06:48] who knows [21:06:50] ok [21:07:08] at least it was frm the right ip, but yet not unknown user and not pubkey either [21:07:14] oh well, if solved, good [21:07:30] the virt* servers are all openstack? [21:07:47] are there several or just virt1000 [21:08:07] pretty much. In pmtpa virt0 is runing keystone and the web interface and such... [21:08:26] and virt1000 is va? [21:08:35] virt[7..11] are compute notes [21:08:56] And, yeah, in eqiad (virginia) we'll have virt1000 - virt1009. [21:09:07] Still wrangling the other hosts, but virt1000 should be all set. [21:09:20] ok [21:09:24] It might be due for a fresh install of ubuntu, I'm not sure. [21:09:47] For puppet testing and development you probably want to use labs though. Have you done any of that yet? [21:10:11] looks like virt1000 is 12.04 [21:10:33] i haven't done anything with the labs yet [21:10:34] that's fine then, Precise is what we run everywhere [21:10:37] but would like to [21:10:45] ok. Have a look at this page: https://wikitech.wikimedia.org/wiki/Special:NovaInstance [21:10:54] ACKNOWLEDGEMENT - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /mnt/tmp 0 MB (0% inode=99%): daniel_zahn not production - test hosts, ask gwicke, dont need to be monitored [21:11:00] In the filter box up top, type 'openstack' and hit 'Set filter' [21:11:01] do i join a project or create a new one? looks like the pref method is to join [21:11:37] (I added you as an admin to the 'openstack' project, which is probably the right place for you to do this.) [21:11:46] got it thx :) [21:12:44] So, if you visit that page you see an 'add instance' link for the openstack project? [21:14:03] looking for add instance [21:14:39] oh ok [21:14:40] yes [21:14:43] small link heh [21:14:58] pmtpa add instance ok [21:14:59] (03PS2) 10Ottomata: [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [21:15:08] !log praseodymium, cerium, xenon: disabled icinga notifications and scheduled 1yr downtime for host and all services per gwicke, they are test hosts and not prod. [21:15:11] gwicke: apergos ^ [21:15:21] Logged the message, Master [21:15:45] ok (but if puppet doesn't run I shall be annoyed, so no filling the root partition :-P) [21:15:49] mhoover: Create yourself an instance and then configure as per https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [21:16:29] That will let you tinker with puppet locally and try out various upstream packages before actually merging anything into our production repo [21:16:43] sweet. doing... [21:16:57] You can make yourself a little test cluster that way, do some trial runs. [21:16:57] let me test the new trigger for that [21:17:01] !self | mhoover [21:17:01] mhoover: https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [21:17:49] also links from the puppet coding page now [21:17:57] mhoover: of course you're welcome to test on the live eqiad machines as well, but for puppet purposes it's best to get your patches ready and tested beforehand... [21:18:19] otherwise it's edit/review/merge/test/repeat [21:18:33] (03CR) 10jenkins-bot: [V: 04-1] [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [21:18:39] right [21:19:21] (03PS3) 10Milimetric: [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 [21:20:10] (03CR) 10jenkins-bot: [V: 04-1] [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [21:20:51] mhoover: I'd recommend that you look at how virt0 is puppetized currently but start more-or-less from scratch (rather than copying) when puppetizing virt1000 because no one likes the status quo. [21:21:03] apergos: I promise root won't be filled ;) [21:21:33] yeah, sounds like a plan. i'd rather run through it step by step [21:21:35] worksforme [21:21:39] wanna see everything that's going on [21:21:59] (good thing that root has 5% space reserved ;) [21:22:20] gzips gwicke [21:23:07] :-D [21:24:10] mhoover: you can use that self-hosted-puppetmaster stuff on a per-instance basis or you can set up a little cluster with one puppetmaster and several clients. [21:24:44] ok [21:24:59] and the default base setup should have my key on there? [21:25:04] or is that another process [21:25:36] If your key is in wikitech then it should just show up on new instances. [21:36:39] (03PS2) 10Dzahn: more broken planet feeds [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 [21:40:28] (03PS3) 10Dzahn: more broken planet feeds [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 [21:43:03] (03CR) 10Dzahn: [C: 032] more broken planet feeds [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [21:48:00] (03PS1) 10Ottomata: Initial deb packaging [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 [21:51:42] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [21:52:40] hey [21:53:02] Ryan_Lane , paravoid is the git-deploy you use this one ? https://github.com/git-deploy/git-deploy [21:53:16] or this one ? https://github.com/mislav/git-deploy [21:54:11] andrewbogott: ^^ ? [21:54:22] mutante: ^^ ? [21:54:26] no idea, sorry [21:54:38] https://wikitech.wikimedia.org/wiki/Git-deploy [21:54:43] ah, thanks [21:54:50] hmm not helpful [21:54:54] just use instructions [21:55:00] oo https://wikitech.wikimedia.org/wiki/Trebuchet/Design [21:57:42] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [22:00:22] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:42] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [22:02:18] average: well.... [22:02:34] Ryan_Lane: yes ? [22:02:39] * apergos raises an eye at labstore4 [22:02:52] is someone looking at that? (hint, it's midnight, I probably won't) [22:02:57] average: we use two things [22:03:07] average: https://github.com/sjn/git-deploy [22:03:09] as a frontend [22:03:22] and https://wikitech.wikimedia.org/wiki/Trebuchet as a backend [22:03:32] I'll be rewriting the frontend eventually [22:04:02] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [22:04:09] apergos: looks like Coren is handling it [22:04:15] ah cool [22:04:20] I'm signing off so... [22:04:28] have a good rest of the day folks [22:04:41] * Ryan_Lane waves [22:06:42] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [22:09:12] average: git-deploy is a thin wrapper around what actually does the deployment, which is trebuchet (a two-phase salt stack based system) [22:20:22] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [22:21:42] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [22:58:13] !log deploying gerrit 92925 & 91124 (apache-config), makes /entity/ URLs on wikidata 303 and removes non-existent noncom wiki [22:58:28] Logged the message, Master [22:59:04] makes that nomcom [23:00:04] (03CR) 10Ori.livneh: "DNS change: I85dffdc8a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97693 (owner: 10Ori.livneh) [23:00:37] Ryan_Lane: got a minute to review https://gerrit.wikimedia.org/r/#/c/97693/ ? [23:01:03] I'm moving gdash to misc-varnish at p-void's suggestion [23:01:53] looks fine to me, though I haven't really dealt with this much [23:02:21] should I wait for pv then? [23:05:22] (03CR) 10Dzahn: "15:02 < mutante> !log deploying gerrit 92925 & 91124 (apache-config), makes /entity/ URLs on wikidata 303 and removes non-existent noncom " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [23:10:32] (03PS1) 10Yurik: Zero: Changed 470-01 to whitelist all languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/97860 [23:10:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [23:17:39] heyhey, I'm from the Flow team. We're drafting an email about our planned deployment on the 4th – basically to try and make sure that we've thought of all the things that might go wrong and have backout/mitigation plans for them. [23:17:40] But I guess we're not quite sure what the most likely things for you guys to deal with are. [23:17:41] We've started here: http://etherpad.wikimedia.org/p/Flow-ops-l_mail [23:17:42] For example, we've thought of the possibility of a memcached overload, but since we have a 1.5T cluster it seems very unlikely. [23:17:43] We've looked at all of our hooks, and none of them seem to run DB queries, or, indeed, do much at all, unless the user is on a flow-enabled page. [23:23:32] how many complaints on a village pump does it take to bring the cluster down? :) [23:41:53] ksnider: hi, wanted to ask about how we could go about getting an independant memcache instance to attach Flow to. We imagine these should be able to run on the same machines as existing memcache with a rediculously low max memory (64M?). This is primarily to isolate any potential problems Flow could cause since it cache's all data models in memcache [23:49:26] Hello [23:50:03] anyone? [23:50:16] ebernhardson: annoyingly, memcached is a puppet class, so you'll run into a multiple definition conflict if you try to provision a second instance on an existing memcached node [23:50:24] a class rather than a resource, I mean. [23:50:46] ori-l: can you run maintenance scripts? [23:50:50] so that rules out using one of the mc* hosts, unless you want to do some puppet refactoring, which i don't imagine you do [23:50:56] PiRSquared: depends [23:51:03] namespaceDupes.php on the ang* wikis? [23:54:02] andre__: https://bugzilla.wikimedia.org/show_bug.cgi?id=56634#c15 [23:54:34] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 15h 2m 58s [23:54:40] Can you please fix it? [23:56:10] ori-l: heh, probably not :) Any suggestions? At this point we really only need a very small piece of memory i don't suppose it matters that much where it is [23:56:27] <^demon|away> If it's such a small piece of memory why not just keep it in the main cache? [23:56:32] I actually don't think that memcached usage will be an issue unless we have well over a million keys [23:56:42] <^demon|away> We have tools to flush bad entries if they became any sort of problem (which I can't possibly forsee) [23:56:50] ^demon|away: well, when i described to Ryan_Lane how our memcache worked he seemed worried [23:56:58] ^demon: so we're trying to find places that would make Flow cause trouble [23:57:02] and mitigating said trouble [23:57:21] PiRSquared: https://dpaste.de/V9Ma/raw [23:57:33] ^demon|away: so its been on our notes to 'resolve the worry' but not entirely sure how. Personally i would be very supprised if in total it used >50M of memory across the cluster [23:57:40] ori-l: thanks [23:58:01] ^demon|away: to summarize, every data model and every query within flow is cached in memcache [23:58:02] PiRSquared: note i didn't fix [23:58:24] (one exception, detecting red links still hits the page table) [23:58:31] ebernhardson: is the 50M estimate for the test or for full deployment across WMF? [23:58:41] ori-l: is it too intensive? [23:58:43] <^demon|away> ebernhardson: I got that much from reading docs :) [23:58:44] ori-l: test, for a full deployment on millions of pages it would certainly be much higher [23:58:58] ebernhardson: it's fine to just use the standard memcaches, then [23:59:09] Is there any way we can fix it ourselves? A bit confused... [23:59:53] hi rschen7754 [23:59:58] hi