[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem, spagewmf: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141104T0000). [00:00:10] I'll do it today [00:00:35] I have to test a patch, but it's low-impact technically [00:00:48] (03CR) 10Catrope: [C: 032] Enable Flow on officewiki on test page Talk:Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170383 (owner: 10Spage) [00:00:57] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=72%): [00:01:02] (03CR) 10Catrope: [C: 032] Enable Flow on mw.org usability research test page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170250 (owner: 10Spage) [00:01:07] (03Merged) 10jenkins-bot: Enable Flow on officewiki on test page Talk:Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170383 (owner: 10Spage) [00:01:12] (03Merged) 10jenkins-bot: Enable Flow on mw.org usability research test page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170250 (owner: 10Spage) [00:02:07] (03CR) 10Dzahn: "identical to Change-Id: I77cd9296d301b984f9" [puppet] - 10https://gerrit.wikimedia.org/r/170847 (owner: 10Dzahn) [00:05:07] !log catrope Synchronized wmf-config/InitialiseSettings.php: Flow on officewiki and mw.org research page (duration: 00m 04s) [00:06:16] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [00:06:45] spagewmf: ----^^ [00:07:15] kaldari, marktraceur: You guys around? [00:07:19] (03Abandoned) 10Reedy: Enable Flow on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 (owner: 10Reedy) [00:07:20] Yup [00:07:47] Tending some chicken, eating some more different chicken, but here [00:09:24] !log catrope Synchronized php-1.25wmf5/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:09:28] Logged the message, Master [00:09:38] SWAT all the things! [00:10:01] I have to wait for the RL cache, I guess. [00:10:46] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:11:17] marktraceur: ?debug=true is your friend. [00:11:31] Your mom is my friend [00:11:40] marktraceur: Also your thing isn't deployed yet, be patient [00:11:44] lol. [00:11:45] Oh. [00:11:47] lol. [00:11:48] marktraceur: !safeplace [00:12:08] Does fake marktraceur have James_Fs mum on facebook? [00:12:11] RoanKattouw, ebernhardson is taking spage's place today [00:12:13] You're right of course, it isn't. Not at all. Nobody is safe. *flashlight on face* [00:12:26] !log catrope Synchronized php-1.25wmf6/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:12:31] !log catrope Synchronized php-1.25wmf6/extensions/MobileFrontend: SWAT (duration: 00m 04s) [00:12:32] Logged the message, Master [00:12:35] !log catrope Synchronized php-1.25wmf6/extensions/MultimediaViewer: SWAT (duration: 00m 03s) [00:12:39] Logged the message, Master [00:12:41] (03PS1) 10BBlack: Switch misc SSL to new sni.wm.o cert [puppet] - 10https://gerrit.wikimedia.org/r/170864 [00:12:41] ebernhardson: OK then you should know that Flow as just deployed to officewiki is broken according to James_F [00:12:45] Logged the message, Master [00:13:09] RoanKattouw: :S has resource loader had a chance to pick up new assets yet? [00:13:17] kaldari, marktraceur: Your deploys are done [00:13:17] actually dont worry, i'll look into it :) [00:13:30] ebernhardson: That doesn't help with errors about contacting the Parsoid server, I'd imagine :S [00:14:00] Reedy: My mother is not on Facebook, so… no? :-) [00:14:04] hmm, interesting ok [00:14:15] heh [00:14:45] Either debug=true didn't help or I'm doing something wrong. [00:15:12] only wmf6, so did you try group0 wiki? [00:15:34] Ah, that would be why probably [00:15:39] Silly marktraceur not being prepared [00:16:45] OK, looks good [00:16:58] So say we all. [00:22:06] (03CR) 10BBlack: [C: 032] Switch misc SSL to new sni.wm.o cert [puppet] - 10https://gerrit.wikimedia.org/r/170864 (owner: 10BBlack) [00:51:03] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 304 seconds [00:52:08] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 364 seconds [00:53:06] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:53:37] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:36:36] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [01:55:46] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:20:08] (03PS1) 10BBlack: Enable (non-EC)DHE key exchange in compat cipher list [puppet] - 10https://gerrit.wikimedia.org/r/170879 [02:22:11] (03CR) 10BBlack: [C: 04-1] "This needs some careful review. There could be reasons we didn't want (non-EC)DHE key exchange that I'm not aware of. Possibly those rea" [puppet] - 10https://gerrit.wikimedia.org/r/170879 (owner: 10BBlack) [02:24:57] (03CR) 10BBlack: "Note: pinkunicorn is testing this ciphersuite setting now manually with puppet disabled if you want to look at it (in addition to the new " [puppet] - 10https://gerrit.wikimedia.org/r/170879 (owner: 10BBlack) [02:29:27] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [02:50:17] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:32:13] !log restart db2017 [03:32:25] Logged the message, Master [03:45:59] (03PS2) 10Springle: mysql - lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170486 (owner: 10Dzahn) [03:47:13] (03CR) 10Springle: [C: 031] mysql - lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170486 (owner: 10Dzahn) [03:50:16] (03CR) 10Springle: [C: 031] "Path merge conflict." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/170491 (owner: 10Dzahn) [03:51:19] (03CR) 10Springle: [C: 031] mysql_wmf - autoload layout and lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [03:53:55] (03PS2) 10Springle: WIP dbproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 [04:14:07] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [04:14:47] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:36] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [04:19:06] PROBLEM - Disk space on search1019 is CRITICAL: DISK CRITICAL - free space: /a 8013 MB (3% inode=99%): [04:33:06] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:52:47] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [05:31:09] (03PS3) 10Springle: WIP dbproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 [05:36:57] (03PS4) 10Springle: WIP dbproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 [06:08:10] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [06:09:43] (03PS9) 10Springle: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [06:13:04] (03CR) 10Springle: [C: 032] Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [06:21:47] What's up with the OCG disk space alerts? I've seen that acknowledged repeatedly and I could have sworn there were some patches too. [06:26:30] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:28:39] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:28:49] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:01] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] RECOVERY - Disk space on ocg1001 is OK: DISK OK [06:31:52] !log force logrotate ocg1001 [06:32:00] Logged the message, Master [06:44:20] springle: I pinged you for a review of https://github.com/gwicke/restbase-cassandra/pull/15, in case you'd like to check the branch before a merge to master [06:45:13] gwicke: ok, thanks [06:45:29] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:08] springle: it's a major clean-up, so might be more fun to review the entire thing after that's merged [06:46:09] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:35] <_joe_> gwicke: what's the reason why you chose cassandra? [06:48:15] _joe_: it's a fairly stable and widely used option that has good support for replication & compression [06:48:25] <_joe_> lemme rephrase that: is restbase going to have a R/W ratio that is going to be skewed towards writes? [06:48:50] no, I expect restbase to be very read-heavy [06:48:55] <_joe_> because cassandra is good at writing a lot, and sucks at reading a lot. At least, that was the situation last time I tried [06:49:24] <_joe_> gwicke: have you tried/evaluated couchbase? [06:49:50] I haven't evaluated it, but have been looking at it as an interesting alternative [06:50:19] the goal of restbase is very much to provide an interface abstraction so that we can swap out the backends without clients noticing [06:50:49] <_joe_> I got that [06:50:53] another interesting solution that springle brought up is http://hyperdex.org/ [06:52:12] there are basically free software solutions in that space, including Riak as well; additionally, there are the cloud services like DynamoDB, Google DataStore & Azure [06:52:56] the interface currently defines features that are implementable using most of those backends [06:53:03] <_joe_> I have no experience with hyperdex, I used riak and I wouldn't advise its use in a read-heavy context [06:53:34] which issues did you see with Riak or Cassandra reads? [06:53:46] <_joe_> anyways, ops need to operate and test backends extensively - all those nosql things tend to have horrible failure modes [06:54:01] (03PS1) 10Faidon Liambotis: varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 [06:54:20] _joe_: I did some stress testing with cassandra, see https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage/Cassandra_testing [06:54:21] <_joe_> gwicke: in the case of cassandra, both latencies and throughput (but it was 2012, so maybe that's better now) [06:54:26] (03PS2) 10Faidon Liambotis: varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 [06:54:52] <_joe_> as for Riak, I used it and had a few horrible crashes under high read load [06:54:52] that was before restbase was ready; I intend to do a bunch more once it's deployed to the test boxes [06:55:33] the latencies I've been seeing with Cassandra looked reasonable (<1ms for small queries) [06:55:57] <_joe_> gwicke: yeah my second point was "operational stability": you need someone in ops to be part of those tests and to understand how to operate those things under a failure scenario [06:56:01] and <10ms seems to be common in Cassandra production clusters [06:56:20] _joe_: yeah, agreed [06:56:36] he's tried :) [06:56:40] <_joe_> the one thing I like about swift is that failure of one node doesn't make the whole cluster sore. Try that with riak, and start crying [06:56:42] involving us [06:57:12] paravoid: I keep trying ;) [06:57:23] <_joe_> it's obviously our fault as always :P [06:57:30] gwicke: any further thoughts yet on starting with multiple backends for comparison and as fallback options? even an RDBMS backend as canonical implementation? [06:58:01] springle: I was thinking about adding a simple sqlite backend [06:58:02] <_joe_> springle: shhh you would find out mysql is better than most of those things under real-world load [06:58:19] useful for testing and very small installs [06:58:30] could add mysql too [06:58:34] (03CR) 10Ori.livneh: "I would probably move the if (req.url ~ "^/event\.gif") to the top of the subroutine, instead." [puppet] - 10https://gerrit.wikimedia.org/r/170883 (owner: 10Faidon Liambotis) [06:58:43] _joe_: well, maybe, but HA is an issue too [06:59:25] and complexity of manual sharding vs. having that handled by the DB [06:59:39] also other nice bits like compression [06:59:41] gwicke: sqlite or mysql should be simple right, as you have few wheels to reinvent. that might make it easier for restbase to gain traction without backend holding it back [06:59:55] <_joe_> gwicke: well, most of the complexity they keep away from you is usually "compute a ketama hash" [07:00:33] _joe_: yeah, and rebalancing / replicating dynamically when a new node joins etc [07:01:04] which is not that easy [07:01:24] <_joe_> gwicke: have you operated a production nosql cluster during a rebalance? [07:01:27] springle: yup [07:01:50] _joe_: did you try cassandra before vnodes were introduced? [07:01:58] <_joe_> if you did, you wouldn't be so enthusiastical about rebalancing automatically [07:02:01] <_joe_> gwicke: no [07:02:02] that makes a big difference for rebalancing [07:02:50] <_joe_> gwicke: I never got to the point where cassandra served a lot of prod traffic as I was able to tear it down to pieces by just powering down a VM while doing some heavy read traffic [07:02:51] hmm, if a mysql backend isn't too much work [07:03:00] it certainly adds load, but with vnodes the load of bootstrapping a node is well spread across the other nodes [07:03:14] I think it might allow us to at least deploy restbase in smaller steps [07:03:28] <_joe_> gwicke: what I saw, and couchbase was similar in this, is that latencies got 10x [07:03:34] paravoid: we'd soon run into perf issues for the ExternalStore use case [07:03:35] and evolve the architecture incrementally as we gain experience [07:03:39] <_joe_> the 99th percentile at least [07:04:07] it's fine for small datasets, but not so great to store all HTML for each revision ever [07:04:11] <_joe_> and under heavy read load it took hours to rebalance [07:04:35] (03PS3) 10Faidon Liambotis: varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 [07:04:58] I'll leave that to you/springle :) [07:05:28] _joe_: I guess the heavy read load also has something to do with it [07:05:44] <_joe_> gwicke: my advice to you/springle is do some semi-real-world load testing, and make restbase record the 99th percentile of the latency to the datastore [07:05:54] but yeah, if the load is already borderline then adding more nodes will make it even slower [07:05:59] (03CR) 10Ori.livneh: [C: 031] varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 (owner: 10Faidon Liambotis) [07:06:08] (03CR) 10Faidon Liambotis: "Sounds reasonable, done. Hopefully we'll get rid of HHVM & geoiplookup by the end of the year and we'll simplify all this :)" [puppet] - 10https://gerrit.wikimedia.org/r/170883 (owner: 10Faidon Liambotis) [07:06:23] HHVM conditionals, that is :) [07:06:46] _joe_: restbase has statsd logging built in already, so we'll get that data soon [07:06:58] <_joe_> paravoid: oh I thought your rewrite of mediawiki in spring framework was done :P [07:07:36] <_joe_> or was it in haskell? [07:07:54] groovy. [07:08:11] <_joe_> no, I bet you are moving to Swift [07:08:18] <_joe_> as I know you love apple [07:08:33] ;) [07:08:55] http://svn.codehaus.org/groovy/trunk/groovy/groovy-core/src/main/org/codehaus/groovy/runtime/ArrayUtil.java [07:09:27] <_joe_> ori: ROTFL [07:10:06] the mind boggles [07:10:13] gwicke: another one i'd like to see in action is http://www.rethinkdb.com/ . no clue about real world performance [07:10:24] springle: +1 [07:10:31] nice querying API [07:11:05] * _joe_ throws http://docs.spring.io/spring/docs/2.5.x/api/org/springframework/aop/framework/AbstractSingletonProxyFactoryBean.html at ori [07:11:07] springle: IIRC their replication stuff is pretty unfinished [07:11:24] their office is not far from ours [07:11:34] <_joe_> springle: last time I looked seemed to be quite unifinished [07:11:47] <_joe_> great ideas, implementation was still lagging behind [07:11:48] could visit them some time [07:12:05] us trialing it might light a fire under their collective ass [07:12:14] gwicke: could be fun [07:12:25] (also, datastax is two blocks the other direction) [07:13:36] ori: will look out for one of their meetups [07:13:59] <_joe_> or, we could ask google if they can let us use spanner [07:14:27] <_joe_> and this is one of the rare moments where I envy you for being in SF [07:15:34] <_joe_> springle: rethinkdb also allows you to do a lot of querying logic, I'm not sure how much of that is in the scope of restbase [07:15:46] yeah, spanner would be great to have [07:16:39] _joe_: you mean filtering? [07:16:55] <_joe_> gwicke: subquerying, joins, etc [07:16:59] <_joe_> do we need that? [07:17:07] _joe_: much is beyond restbase storage scope, true, but see the wheels gwicke is having to reinvent for cassandra. having the functionality available woud be just nice to have [07:17:24] * _joe_ nods [07:17:40] _joe_: a lot of that is hard to implement efficiently in a distributed setting [07:17:41] <_joe_> yeah whenever you have to run around the limitations of your data store, it sucks [07:17:43] that's like saying: don't use an RDBMS unless you want all of SQL 2003 ;) [07:18:12] <_joe_> gwicke:about spanner - it's probably absolutely awesome; what they claim about it is borderline implausible (XA global transactions with latencies within 10ms) [07:18:42] <_joe_> springle: no my point was: doing joins in a distributed env is very very hard to get right [07:18:50] it's plausible with stable paxos members [07:18:58] and async replication [07:19:11] _joe_: as mysql ndbcluster proved [07:19:24] <_joe_> so usually products with more sophisticated querying mechanisms are more fragile [07:19:51] <_joe_> it's like saying you use a formula one to go around rome; you may do, but you'd be better off with a smart [07:20:04] <_joe_> but wait [07:20:23] <_joe_> we're a bunch of fools. We could use MongoDB [07:20:29] ;) [07:20:33] hehe [07:21:04] <_joe_> well mongo got a bad rep, but most other products are not so much better [07:22:00] their replication & scaling story is not very compelling [07:22:34] <_joe_> btw, I think most issues have been fixed since, but http://aphyr.com/posts/294-call-me-maybe-cassandra/ [07:22:52] yeah, I know that article [07:23:04] <_joe_> (all of aphyr call-me-maybe series are quite interesting) [07:23:13] it's good that he found the bugs in the paxos implementation before we ran into them [07:23:21] <_joe_> gwicke: not that I care a lot about that [07:23:45] <_joe_> I mean if we lose some data during a tragic network partition, that's not a tragedy [07:24:29] <_joe_> but what I love about his articles is that he gives you an assessment of the storage solidity [07:24:30] the timestamp issue was actually also a problem in the node-uuid package [07:24:51] I fixed that in https://github.com/gwicke/node-uuid/commit/b7be71edc9aae7fa5314ccfb068eb35084506910 [07:30:02] <_joe_> springle: hyperdex seems interesting indeed. [07:31:43] yeah, it has some interesting architecture ideas [07:32:13] not many people loudly proclaiming its use in production [07:32:48] <_joe_> springle: he. [07:34:25] could be interesting to evaluate it for secondary indexing [07:36:17] bed time for me, see you tomorrow! [07:38:38] (03PS1) 10Ori.livneh: Consolidate debugging-related configurations in hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/170886 [07:38:57] _joe_: tempted to write a mariadb storage engine connector for hyperdex :) it has a C api [07:39:01] * YuviPanda|zzz waves [07:39:23] http://techblog.netflix.com/2014/11/introducing-dynomite.html [07:39:31] _joe_: i think it could work quite well with engine and index condition pushdown [07:39:35] look at the date :) [07:41:43] <_joe_> springle: eheh [07:50:15] Dynomite seems like it would make a good cache, but there are a few @todo tasks for proper HA and durability [09:47:51] (03CR) 10Glaisher: [C: 031] Enable global AbuseFilter on medium sized Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170311 (owner: 10Hoo man) [10:03:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] puppetmaster - lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170471 (owner: 10Dzahn) [10:42:52] !log depooled wtp1021,wtp1022,wtp1023 for re-installation with trusty [10:43:00] Logged the message, Master [10:44:52] <_joe_> akosiaris: are you using wmf-reimage? [10:45:14] <_joe_> helps when reimaging a few servers with the whole puppet/slalt shit [10:45:14] nope, just plain re-installation [10:45:28] well, let's see :-) [10:45:29] thanks [10:48:21] <_joe_> akosiaris: is a script you can run on palladium, first removes all keys/facts from puppet and salt, then loops polling for the new keys to sign [10:51:07] hehe, year I just read the code [10:51:10] yeah* [10:51:21] not bad :-) [10:51:59] (03CR) 10Zfilipin: "What is preventing this commit from being merged into master?" [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [10:52:44] useful indeed... Now if only I add an ipmtool chassis power cycle command somewhere in there ... [10:53:11] actually... hmmm I was joking but this is not a bad idea.... [10:54:40] <_joe_> LOLWUT APACHE [10:54:53] (03CR) 10Hashar: "> What is preventing this commit from being merged into master?" [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [10:55:41] <_joe_> mod_proxy_fastcgi will _not_ speak to hhvm when it correctly drains connections and reuse its own port [10:55:53] <_joe_> moar fun [10:56:16] <_joe_> akosiaris: feel free to do that [10:56:47] (03CR) 10Yuvipanda: [C: 032] contint: ruby2.0 on Trusty slaves [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [11:06:08] yay, wtp1023 seems to have a broken DRAC... [11:08:55] (03CR) 10Hashar: [C: 031] "I am not familiar with Apache 2.4 new rules system. But that seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/170792 (owner: 10Krinkle) [11:16:59] PROBLEM - Host wtp1021 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:20:13] PROBLEM - Host wtp1022 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:35] PROBLEM - Host db1017 is DOWN: PING CRITICAL - Packet loss = 100% [11:39:27] that's me (db1017) silencing [11:39:38] ah :) [11:39:53] hey sean, yep that's your loaner :) [11:40:24] godog: yeah, took me a minute to remember the box number :) np [11:41:53] godog: +1 to renaming it sometime, btw ;) [11:42:43] springle: yeah that's too misleading :( [11:48:24] RECOVERY - Host db1017 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [11:59:46] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 675 bytes in 0.069 second response time [12:04:12] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [12:26:10] (03PS1) 10Aude: Bump cache epoch on wikidata, due to more UI changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170908 [12:29:53] PROBLEM - Host wtp1021 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:22] RECOVERY - Host wtp1021 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [12:35:32] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 4 failures [12:41:07] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:41:29] !log repooled wtp1021,wtp1022,wtp1023 [12:41:36] Logged the message, Master [12:53:02] akosiaris: what drac version is on wtp1023 ? [12:54:02] its a poweredge r420 so iDRAC 7 [12:54:14] well known to have problems ... [12:54:31] matanya: see this http://www.softpanorama.org/Hardware/Dell/Servers/DRAC/can_not_connect_to_idrac7.shtml [12:55:44] akosiaris: i had that very problems a few days ago and upgrading the drac firmware resolved it [12:56:21] matanya: I 'd be happy if that works for us too. Wanna update the ticket ? or should I ? [12:56:38] i'm just suggesting :) [12:56:49] and it is a good suggestion anyway [12:57:29] if you are willing to invest some time after re-imaging the parsoid boxes, it might be good to test that [12:58:02] my will is not the issue here. I am unable to access the drac [12:58:36] hmmm perhaps a powercycle could fix it... [12:58:56] I kind of doubt it though... BMCs are not really connected to the rest of the box in that way [12:58:57] akosiaris: you don't have to access the drac in order to upgrade firmware [12:59:33] exactly. And the drac is not responding when I try to access it via the box's OS [13:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141104T1300). Please do the needful. [13:03:38] !log reedy Purged l10n cache for 1.25wmf4 [13:03:45] Logged the message, Master [13:04:26] akosiaris: just running the lastest updated (DELL iDRAC 1.57.57.bin) won't work ? [13:04:46] also maked as urgent :) [13:04:51] http://www.dell.com/support/home/us/en/19/product-support/product/poweredge-r420/drivers [13:05:15] (03PS1) 10Reedy: Non wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170912 [13:07:34] matanya: it might. Need to test it after (and if) we get drac responsive again [13:08:43] Not sure why would you need to get access to drac to test a fix to a drac problem, but whatever, i said enough [13:10:56] matanya: niah, I am intrigued. please continue [13:11:13] I am missing something perhaps, which is why I am asking [13:11:42] Reedy: https://gerrit.wikimedia.org/r/#/c/170908/ [13:12:03] the changes are more minor this time but we think this is still needed [13:12:14] akosiaris: wget the .bin to wtp1023, run ./nameofbin.bin, type q to skip the relnotes, let the update ran, and reboot if asked [13:12:53] after the server is back if reboot was needed, the drac is accessible again [13:13:07] at least this was the solution in my case [13:13:42] aude: Looks like the config below it can go too at some point [13:14:13] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170912 (owner: 10Reedy) [13:14:21] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170912 (owner: 10Reedy) [13:14:49] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf6 [13:14:55] Logged the message, Master [13:15:07] ah, yes [13:15:32] (03PS2) 10Reedy: Bump cache epoch on wikidata, due to more UI changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170908 (owner: 10Aude) [13:15:40] (03CR) 10Reedy: [C: 032] Bump cache epoch on wikidata, due to more UI changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170908 (owner: 10Aude) [13:15:48] (03Merged) 10jenkins-bot: Bump cache epoch on wikidata, due to more UI changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170908 (owner: 10Aude) [13:16:00] matanya: in the installations instructions it clearly says: " 2. Open the iDRAC7 Web interface using a supported Web browser. [13:16:00] 3. Log in as administrator. [13:16:00] 4. Go to Overview -> iDRAC Settings -> Update and Rollback -> Update. The [13:16:00] Firmware Update page is displayed. " [13:16:28] Is the supported browser IE6? :P [13:16:36] Reedy: :P [13:16:52] akosiaris: yeah, that doesn't work, my way or the high way :) [13:17:24] Reedy: Google Chrome version 28 :P [13:17:27] I am impressed [13:17:36] Version 28!? [13:17:40] who uses that anymore! [13:17:44] dell [13:17:58] akosiaris: sadly, i had to find that out through countless talks with support [13:18:09] (03PS2) 10Reedy: Removed $wgMaxArticleSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170527 (owner: 10Dereckson) [13:18:12] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:18:13] so take it as a gift to a valued ops member [13:18:21] (03CR) 10Reedy: [C: 032] "Looks like it's been 2048 for a long time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170527 (owner: 10Dereckson) [13:18:28] (03Merged) 10jenkins-bot: Removed $wgMaxArticleSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170527 (owner: 10Dereckson) [13:18:37] matanya: wikitech! :) [13:18:41] (03PS1) 10Aude: Remove obsolete wikidata cache key config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170914 [13:18:57] (03PS2) 10Aude: Remove obsolete wikidata cache key config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170914 [13:19:00] Reedy: will do, need to bring up 75 vm's first [13:19:20] (03CR) 10Reedy: [C: 032] Remove obsolete wikidata cache key config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170914 (owner: 10Aude) [13:19:21] matanya: thanks [13:19:28] (03Merged) 10jenkins-bot: Remove obsolete wikidata cache key config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170914 (owner: 10Aude) [13:19:55] (03PS3) 10Reedy: Improving comments for wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170526 (owner: 10Dereckson) [13:20:00] (03CR) 10Reedy: [C: 032] Improving comments for wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170526 (owner: 10Dereckson) [13:20:07] (03Merged) 10jenkins-bot: Improving comments for wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170526 (owner: 10Dereckson) [13:20:09] (03PS2) 10Reedy: Adding tools.wikimedia.pl to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170740 (https://bugzilla.wikimedia.org/72897) (owner: 10Dereckson) [13:20:21] (03CR) 10Reedy: [C: 032] Adding tools.wikimedia.pl to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170740 (https://bugzilla.wikimedia.org/72897) (owner: 10Dereckson) [13:20:22] akosiaris: if we are already at it, you can also burn the bin on a drive and boot off it and it works, but that is really top secret! :D [13:20:28] (03Merged) 10jenkins-bot: Adding tools.wikimedia.pl to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170740 (https://bugzilla.wikimedia.org/72897) (owner: 10Dereckson) [13:20:55] (03PS2) 10Reedy: Set timezone on cs.wikipedia and cs.wikinews to Europe/Prague [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170517 (https://bugzilla.wikimedia.org/71902) (owner: 10Dereckson) [13:21:02] (03CR) 10Reedy: [C: 032] Set timezone on cs.wikipedia and cs.wikinews to Europe/Prague [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170517 (https://bugzilla.wikimedia.org/71902) (owner: 10Dereckson) [13:21:10] (03Merged) 10jenkins-bot: Set timezone on cs.wikipedia and cs.wikinews to Europe/Prague [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170517 (https://bugzilla.wikimedia.org/71902) (owner: 10Dereckson) [13:21:24] (03PS2) 10Reedy: Typo in labs wgContentHandlerUseDB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170124 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [13:21:29] (03CR) 10Reedy: [C: 032] Typo in labs wgContentHandlerUseDB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170124 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [13:21:36] (03Merged) 10jenkins-bot: Typo in labs wgContentHandlerUseDB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170124 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [13:22:15] (03PS2) 10Reedy: Enable JPG thumbnail chaining on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170747 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [13:22:22] (03CR) 10Reedy: [C: 032] Enable JPG thumbnail chaining on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170747 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [13:22:29] (03Merged) 10jenkins-bot: Enable JPG thumbnail chaining on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170747 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [13:22:45] (03PS2) 10Reedy: eswikivoyage: Give sysops "abusefilter-modify-restricted" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170487 (https://bugzilla.wikimedia.org/62321) (owner: 10Hoo man) [13:22:49] (03CR) 10Reedy: [C: 032] eswikivoyage: Give sysops "abusefilter-modify-restricted" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170487 (https://bugzilla.wikimedia.org/62321) (owner: 10Hoo man) [13:22:57] (03Merged) 10jenkins-bot: eswikivoyage: Give sysops "abusefilter-modify-restricted" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170487 (https://bugzilla.wikimedia.org/62321) (owner: 10Hoo man) [13:23:29] (03PS2) 10Reedy: Enable global AbuseFilter on medium sized Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170311 (owner: 10Hoo man) [13:23:34] (03CR) 10Reedy: [C: 032] Enable global AbuseFilter on medium sized Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170311 (owner: 10Hoo man) [13:23:41] (03Merged) 10jenkins-bot: Enable global AbuseFilter on medium sized Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170311 (owner: 10Hoo man) [13:24:47] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 19s) [13:24:54] Logged the message, Master [13:29:24] (03PS2) 10Reedy: Bump Epochs to 20130601000000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170590 [13:29:29] * Reedy comtemplates [13:32:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:35:12] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 3 failures [13:35:33] PROBLEM - Host wtp1024 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:12] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 3 failures [13:36:12] RECOVERY - Host wtp1024 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [13:39:40] !log upgrading apache2 throught the mw cluster [13:39:46] Logged the message, Master [13:40:48] !log depool wtp1017, wtp1018, wtp1019, wtp1020 from trusty reinstall [13:40:52] Logged the message, Master [13:44:16] aude: bump all the epochs? :D [13:45:19] :D [13:46:08] I'm tempted to jfdi [13:46:18] What could go wrong? [13:46:31] more server load than expected [13:46:33] I still don't understand why the epochs aren't dynamic. [13:46:44] but I'm bumping it from start of 2013 to mid 2013 [13:48:02] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:50:21] (03CR) 10Reedy: [C: 032] Bump Epochs to 20130601000000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170590 (owner: 10Reedy) [13:50:27] (03Merged) 10jenkins-bot: Bump Epochs to 20130601000000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170590 (owner: 10Reedy) [13:51:25] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 15s) [13:51:32] Logged the message, Master [13:52:35] mw1144 and mw1193 need graceful [13:52:50] * aude goes to eliminate the closure in that wikibase settings file [13:53:03] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:53:55] Reedy: ^ is that something we can do? [13:54:02] I think so [13:54:07] I'm just trying to find it [13:54:17] ok [13:54:39] !bug 1 [13:54:49] aaah [13:54:52] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:55:03] I'm sure I've seen hoo do it for single boxes recently [13:55:36] /etc/sudoers.d/wikidev_apache [13:55:41] That sounds suspicious [13:55:44] some of those might be api boxes [13:55:48] or both [13:57:15] sudo /usr/sbin/apache2ctl -k graceful [13:57:30] ah, ok [13:57:34] I love how it gives absolutely no output [13:57:43] PROBLEM - DPKG on amssq62 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:57:53] !log graceful apache on mw1144 [13:58:01] Logged the message, Master [13:58:11] !log graceful apache on mw1193 [13:58:15] Logged the message, Master [13:59:26] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [13:59:32] Logged the message, Master [13:59:53] RECOVERY - DPKG on amssq62 is OK: All packages OK [14:00:43] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [14:01:21] thanks :) [14:01:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:02:12] PROBLEM - DPKG on amssq61 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:04:13] RECOVERY - DPKG on amssq61 is OK: All packages OK [14:06:13] PROBLEM - DPKG on amssq60 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:06:14] !log upgrading kernels on amssq* [14:06:21] Logged the message, Master [14:06:42] <_joe_> akosiaris: mmmh you might want to check that with brandon [14:08:13] RECOVERY - DPKG on amssq60 is OK: All packages OK [14:08:32] _joe_: kernels ? on those old boxes ? [14:08:38] bblack: ^ [14:08:59] <_joe_> oh sorry, I was thinking of varnish in esams [14:09:07] ok [14:09:11] bblack: disregard :-) [14:09:30] yeah, I made sure I would not be upgrading varnish :-) [14:09:32] not crazy [14:10:23] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 3 failures [14:11:03] PROBLEM - DPKG on amssq59 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:12:12] RECOVERY - DPKG on amssq59 is OK: All packages OK [14:14:33] PROBLEM - DPKG on amssq58 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:16:13] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [14:16:42] RECOVERY - DPKG on amssq58 is OK: All packages OK [14:18:22] PROBLEM - DPKG on amssq56 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:19:22] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:19:33] RECOVERY - DPKG on amssq56 is OK: All packages OK [14:22:39] PROBLEM - DPKG on amssq55 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:23:39] RECOVERY - DPKG on amssq55 is OK: All packages OK [14:26:38] PROBLEM - DPKG on amssq54 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:28:19] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:47] RECOVERY - DPKG on amssq54 is OK: All packages OK [14:30:39] PROBLEM - DPKG on amssq53 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:31:14] (03PS1) 10Filippo Giunchedi: add missing sdl1 during init [software/swift-ring] - 10https://gerrit.wikimedia.org/r/170918 [14:31:16] (03PS1) 10Filippo Giunchedi: codfw: add missing sdj1 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/170919 [14:31:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add missing sdl1 during init [software/swift-ring] - 10https://gerrit.wikimedia.org/r/170918 (owner: 10Filippo Giunchedi) [14:31:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: add missing sdj1 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/170919 (owner: 10Filippo Giunchedi) [14:32:47] RECOVERY - DPKG on amssq53 is OK: All packages OK [14:34:59] PROBLEM - DPKG on amssq52 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:37:00] RECOVERY - DPKG on amssq52 is OK: All packages OK [14:40:08] PROBLEM - DPKG on amssq51 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:41:28] RECOVERY - DPKG on amssq51 is OK: All packages OK [14:43:39] PROBLEM - check if dhclient is running on wtp1020 is CRITICAL: Timeout while attempting connection [14:43:45] PROBLEM - check if salt-minion is running on wtp1020 is CRITICAL: Timeout while attempting connection [14:43:46] PROBLEM - DPKG on amssq50 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:44:45] PROBLEM - check configured eth on wtp1017 is CRITICAL: Connection refused by host [14:44:45] PROBLEM - DPKG on wtp1017 is CRITICAL: Connection refused by host [14:44:56] PROBLEM - check if dhclient is running on wtp1017 is CRITICAL: Connection refused by host [14:44:56] PROBLEM - Disk space on wtp1017 is CRITICAL: Connection refused by host [14:45:16] PROBLEM - check if salt-minion is running on wtp1017 is CRITICAL: Connection refused by host [14:45:17] PROBLEM - parsoid disk space on wtp1017 is CRITICAL: Connection refused by host [14:45:27] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [14:45:35] PROBLEM - puppet last run on wtp1017 is CRITICAL: Connection refused by host [14:45:36] PROBLEM - RAID on wtp1017 is CRITICAL: Connection refused by host [14:45:56] RECOVERY - DPKG on amssq50 is OK: All packages OK [14:46:35] PROBLEM - RAID on wtp1019 is CRITICAL: Connection refused by host [14:46:36] PROBLEM - check configured eth on wtp1019 is CRITICAL: Connection refused by host [14:46:37] PROBLEM - DPKG on wtp1019 is CRITICAL: Connection refused by host [14:46:46] PROBLEM - check if dhclient is running on wtp1019 is CRITICAL: Connection refused by host [14:46:46] PROBLEM - Disk space on wtp1019 is CRITICAL: Connection refused by host [14:47:06] PROBLEM - check if salt-minion is running on wtp1019 is CRITICAL: Connection refused by host [14:47:06] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [14:47:16] PROBLEM - parsoid disk space on wtp1019 is CRITICAL: Connection refused by host [14:47:16] PROBLEM - puppet last run on wtp1020 is CRITICAL: Connection refused by host [14:47:25] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [14:47:25] PROBLEM - RAID on wtp1020 is CRITICAL: Connection refused by host [14:47:25] PROBLEM - puppet last run on wtp1019 is CRITICAL: Connection refused by host [14:47:35] PROBLEM - DPKG on wtp1020 is CRITICAL: Connection refused by host [14:47:36] PROBLEM - Disk space on wtp1020 is CRITICAL: Connection refused by host [14:47:36] PROBLEM - check configured eth on wtp1020 is CRITICAL: Connection refused by host [14:48:06] PROBLEM - parsoid disk space on wtp1020 is CRITICAL: Connection refused by host [14:48:06] PROBLEM - DPKG on amssq49 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:50:06] RECOVERY - DPKG on amssq49 is OK: All packages OK [14:51:55] PROBLEM - DPKG on amssq48 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:54:05] RECOVERY - DPKG on amssq48 is OK: All packages OK [14:56:36] PROBLEM - DPKG on amssq47 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:57:36] RECOVERY - DPKG on amssq47 is OK: All packages OK [14:58:06] PROBLEM - NTP on wtp1017 is CRITICAL: NTP CRITICAL: No response from NTP server [15:00:06] PROBLEM - NTP on wtp1019 is CRITICAL: NTP CRITICAL: No response from NTP server [15:01:05] PROBLEM - NTP on wtp1020 is CRITICAL: NTP CRITICAL: No response from NTP server [15:02:46] (03PS1) 10Glaisher: Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 [15:02:53] (03CR) 10jenkins-bot: [V: 04-1] Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [15:32:05] RECOVERY - RAID on wtp1017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:32:15] RECOVERY - check configured eth on wtp1017 is OK: NRPE: Unable to read output [15:32:16] RECOVERY - DPKG on wtp1017 is OK: All packages OK [15:32:25] RECOVERY - check if dhclient is running on wtp1017 is OK: PROCS OK: 0 processes with command name dhclient [15:32:26] RECOVERY - Disk space on wtp1017 is OK: DISK OK [15:32:45] RECOVERY - parsoid disk space on wtp1017 is OK: DISK OK [15:32:56] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:35:23] (03PS2) 10Glaisher: Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 [15:35:32] morebots, did you survive the netsplit? [15:35:32] I am a logbot running on tools-exec-14. [15:35:32] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:35:32] To log a message, type !log . [15:40:17] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:47:26] PROBLEM - Host wtp1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:16] RECOVERY - Host wtp1017 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [15:49:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:50:29] RECOVERY - parsoid disk space on wtp1019 is OK: DISK OK [15:50:31] (03CR) 10Andrew Bogott: "I want to install this on wikitech, but I'm unclear on whether or not the new horizon apache conf will play nice with the existing mediawi" [puppet] - 10https://gerrit.wikimedia.org/r/170340 (owner: 10Andrew Bogott) [15:50:36] RECOVERY - RAID on wtp1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:50:41] manybubbles, ^d, marktraceur: Who wants to SWAT today? Looks easy enough. [15:50:42] ori: if/when you have a moment, could you look at https://gerrit.wikimedia.org/r/#/c/170340/ ? I want that class to coexist with the mediawiki install on wikitech, and I'm unclear how the new apache conf will interact with the mediawiki vhosts [15:50:45] RECOVERY - check configured eth on wtp1019 is OK: NRPE: Unable to read output [15:50:46] RECOVERY - DPKG on wtp1019 is OK: All packages OK [15:50:51] anomie: I can take it! [15:50:55] manybubbles: Ok! [15:50:55] RECOVERY - check if dhclient is running on wtp1019 is OK: PROCS OK: 0 processes with command name dhclient [15:51:15] RECOVERY - Disk space on wtp1019 is OK: DISK OK [15:51:26] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:51:55] gi11es: has https://gerrit.wikimedia.org/r/#/c/170816/ already been deployed? [15:52:04] it looks like its been merged to core's release branch [15:52:15] RECOVERY - parsoid disk space on wtp1020 is OK: DISK OK [15:52:25] manybubbles: no idea, Reedy would know, I guess [15:52:35] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:38] RECOVERY - RAID on wtp1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:52:45] RECOVERY - DPKG on wtp1020 is OK: All packages OK [15:52:46] RECOVERY - check if dhclient is running on wtp1020 is OK: PROCS OK: 0 processes with command name dhclient [15:52:46] RECOVERY - check configured eth on wtp1020 is OK: NRPE: Unable to read output [15:52:56] RECOVERY - Disk space on wtp1020 is OK: DISK OK [15:53:07] (03PS1) 10Alexandros Kosiaris: servermon: Execute make_updates every hour [puppet] - 10https://gerrit.wikimedia.org/r/170934 [15:53:28] same question for https://gerrit.wikimedia.org/r/#/c/170860/ [15:53:31] (03CR) 10Krinkle: "The same pattern is already used by pybal/manifests/web.pp" [puppet] - 10https://gerrit.wikimedia.org/r/170792 (owner: 10Krinkle) [15:53:55] I think Roan did those last night [15:53:57] I think [15:54:12] ah! I'm looking at the super early SWAT@! [15:54:22] ignore me [15:54:26] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:54:27] heh [15:54:30] fsckin timezones [15:54:54] I'm just used to looking at the first thing on the list [15:55:21] it looks like the only thing for this SWAT window is https://gerrit.wikimedia.org/r/#/c/170747/ which Reedy already deployed [15:56:56] PROBLEM - Host wtp1019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:57:46] RECOVERY - Host wtp1019 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [15:59:45] (03PS1) 10BryanDavis: logstash: Drop spammy parsoid messages [puppet] - 10https://gerrit.wikimedia.org/r/170935 [15:59:57] PROBLEM - Host wtp1020 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141104T1600). Please do the needful. [16:00:24] Good luck manybubbles [16:00:26] RECOVERY - Host wtp1020 is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms [16:00:32] marktraceur: thanks! [16:00:53] gi11es: is https://gerrit.wikimedia.org/r/#/c/170747/ already live? [16:00:58] if so then there isn't a SWAT [16:01:53] I'll try to find out [16:02:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:04:10] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [16:04:43] RECOVERY - NTP on wtp1020 is OK: NTP OK: Offset -0.00492978096 secs [16:05:08] PROBLEM - check if salt-minion is running on wtp1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:09:12] _joe_: heya, are you going to try to do anything with Beta Cluster hhvm's and your testing of the restarts today? [16:09:41] RECOVERY - check if salt-minion is running on wtp1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:10:13] <_joe_> greg-g: nope [16:10:32] <_joe_> greg-g: if hhvm crashes in beta, that's another story [16:10:57] <_joe_> but I am trying to understand how to reload the server gracefully [16:11:00] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [16:11:36] _joe_: let me rephrase: I'd love it if what you're testing could be tested in Beta Cluster as well so that we can get rid of the 503s there as they're hurting dev teams (browser test failures due to them) [16:11:40] :) [16:12:11] <_joe_> greg-g: eh, we'll see [16:12:18] <_joe_> but I guess the problem there is different [16:12:44] there it's that scap restarts hhvm out from underneath users/browser tests [16:14:23] <_joe_> ok [16:15:00] It really is the same problem that we will see in production. The only difference is that in beta we have people who actually notice rather than just incrementing a 5xx counter in graphite. [16:18:04] bd808|voted: thanks for the logstash dashboard fix last night. any chance you could set up a matching dashboard on logstash-beta? [16:18:12] bd808|voted: also, i don't think parsoid has a dashboard yet. [16:19:51] cscott: I could probably do that today, or you could give it a shot and I can help if you get stuck. [16:20:16] learning to do new things is always good i guess! [16:20:37] Teach a person to fish and all that sort of thing :) [16:21:04] !log Jenkins: disconnecting/reconnecting gearman client , killing deployment-bastion.eqiad slave in an attempt to remove a deadlock {{bug|70597}} [16:21:04] (03CR) 10Subramanya Sastry: "Bryan: we have discussed this aspect over the last couple weeks .. whether to send these log events to logstash or not. On the rare occasi" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [16:21:11] Logged the message, Master [16:21:16] teach a person to fish and suddenly he's spending all his time fishing ;) [16:21:36] cscott: that's bd808's MO [16:22:00] RECOVERY - check if salt-minion is running on wtp1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:22:51] RECOVERY - check if salt-minion is running on wtp1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:23:37] anomie: it's deployed [16:23:46] bd808|voted: even better, write up how to do it on a wiki page somewhere ;) [16:23:52] manybubbles: ^ [16:23:53] gi11es: ? [16:24:03] (03PS1) 10Rush: phab only manage repo directory if set [puppet] - 10https://gerrit.wikimedia.org/r/170937 [16:24:26] anomie: force of habit since you handle most swats I attend :) [16:24:30] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:24:40] !log Zuul on hold, waiting for beta cluster related jobs to complete [16:24:41] cscott: http://www.elasticsearch.org/guide/en/kibana/current/index.html [16:24:45] Logged the message, Master [16:24:55] bd808|voted: that doesn't look like a wiki to me ;) [16:25:00] (03CR) 10Subramanya Sastry: "And, these come in pairs actually .." [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [16:25:12] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Puppet has 1 failures [16:25:46] gi11es: thanks! [16:25:52] ok so SWAT was a noop today [16:26:51] <^d> manybubbles: When do we want to put the new shard/replica config live? [16:26:56] !log Jenkins restarting Gearman client [16:27:04] Logged the message, Master [16:27:15] ^d: meh - whenever. you wanna do the rebuilds for it? [16:27:28] <_joe_> bd808|voted: the problem in prod is much smaller, and scap doesn't restart hhvm at the moment, right? [16:27:32] I think we actually have rebuilds needed for everything to get some no config any way [16:27:40] bd808|voted: http://www.elasticsearch.org/guide/en/kibana/current/saving-and-loading-dashboards.html looks like the link i was looking for. [16:28:27] _joe_: I think that scap restarts hhvm everywhere now, but let me check to see if that change has been deployed in production. Ori definitely merged it. [16:28:27] <^d|voted> manybubbles: Rebuilds everywhere, wheeee :) [16:28:57] paravoid, springle, _joe_: dynomite looks interesting [16:29:43] (03CR) 10Rush: [C: 032] phab only manage repo directory if set [puppet] - 10https://gerrit.wikimedia.org/r/170937 (owner: 10Rush) [16:29:50] the netflix folks are doing good work in that space in general; another example (very similar to restbase) is http://techblog.netflix.com/2013/12/staash-storage-as-service-over-http.html [16:30:13] _joe_: It looks like that patch has not been pulled in to production yet, but that is just because scap hasn't been updated there. I was actually going to update it sometime this week for another change that was merged. [16:30:21] RECOVERY - check if salt-minion is running on wtp1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:32:34] PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:33:01] (03PS1) 10Rush: phab should get standard [puppet] - 10https://gerrit.wikimedia.org/r/170943 [16:33:14] !log Shutting down Jenkins to remove a deadlock :-( [16:33:19] Logged the message, Master [16:33:42] manybubbles: Feel like adding another pair of backports to the SWAT? [16:33:56] anomie: can do. [16:34:07] swat was noop today so I should do something [16:34:29] what about writing a 101 write tests tutorial ? :D [16:34:36] manybubbles: Basically, backport of https://gerrit.wikimedia.org/r/#/c/170938/. I'll submit the patches momentarily. [16:35:06] (03CR) 10Ottomata: "Note that this change will give access to stat1002, not stat1003, but that is what is being asked for in the original RT ticket." [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [16:35:39] (03CR) 10Ottomata: [C: 04-1] "This will grant access to stat1002 AND the analytics hadoop cluster AND the private webrequest data in Hadoop. I do not think this is wha" [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [16:36:59] (03CR) 10BryanDavis: "If there is actual value to these messages I'm not against them. The "completed processing" messages seemed to contain some data (the elap" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [16:37:11] anomie: fine by me [16:37:25] hmm, Jenkins seems to be stuck again [16:37:45] * marktraceur greases Jenkins up with some lard [16:38:25] marktraceur: Does that mean you're fixing it, or just being humorous? [16:38:35] Humour, sorry [16:38:43] I can try to restart it, though, if hashar isn't already [16:38:50] see log [16:38:58] Looks like he already is [16:39:04] it is restarting already :d [16:39:05] There you go. [16:39:17] I had to shut it down to remove a deadlock that prevented jobs from executing on the beta cluster [16:40:45] manybubbles: First patch: https://gerrit.wikimedia.org/r/170947 [16:40:46] (03CR) 10Rush: [C: 032] phab should get standard [puppet] - 10https://gerrit.wikimedia.org/r/170943 (owner: 10Rush) [16:41:24] bah still broken [16:42:11] manybubbles: Second patch: https://gerrit.wikimedia.org/r/170948 [16:42:21] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [16:42:40] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:43:48] !log manybubbles Synchronized php-1.25wmf5/extensions/UniversalLanguageSelector/: SWAT update uls (duration: 00m 04s) [16:43:53] Logged the message, Master [16:44:03] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.006 second response time [16:44:09] anomie: synced wmf5 [16:44:20] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:44:32] manybubbles: Works on enwiki now [16:45:03] !log manybubbles Synchronized php-1.25wmf6/extensions/UniversalLanguageSelector/: SWAT update uls (duration: 00m 04s) [16:45:04] anomie: ^^^^ [16:45:09] Logged the message, Master [16:45:18] manybubbles: Works [16:45:29] yay SWAT success story [16:50:21] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.032 second response time [16:50:50] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:50:52] !log restarting Zuul/Jenkins entirely [16:50:58] Logged the message, Master [16:53:03] (03CR) 10Yuvipanda: [C: 04-1] mysql - lint fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170486 (owner: 10Dzahn) [16:53:23] !log repool wtp1017,wtp1018,wtp1019,wtp1020 [16:53:29] Logged the message, Master [17:01:56] (03PS1) 10Rush: phab explicity default setting is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/170951 [17:02:27] (03PS2) 10Andrew Bogott: Minor changes for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/169608 [17:02:29] (03PS3) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [17:03:59] (03CR) 10Cscott: [C: 04-1] "My general position is that we should figure out how to prune old entries from logstash, rather than logging less. The primary benefit of" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [17:07:54] (03CR) 10Rush: [C: 032] phab explicity default setting is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/170951 (owner: 10Rush) [17:07:59] (03CR) 10Ori.livneh: [C: 032] Consolidate debugging-related configurations in hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/170886 (owner: 10Ori.livneh) [17:08:50] (03CR) 10Giuseppe Lavagetto: [C: 031] "I fail to see how having 2 lines of log with redundant information is a benefit at all. If the completed parsing line has timing info, thi" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [17:11:16] (03CR) 10Christopher Johnson (WMDE): [C: 031] "I completely agree with Jan on this. Once the tag change in the manifest is merged, the update script should be run without concern. I do " [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [17:11:31] (03PS1) 10Ori.livneh: HHVM: install-pkg-src -> debug/install-pkg-src [puppet] - 10https://gerrit.wikimedia.org/r/170952 [17:11:41] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: install-pkg-src -> debug/install-pkg-src [puppet] - 10https://gerrit.wikimedia.org/r/170952 (owner: 10Ori.livneh) [17:11:43] (03CR) 10BryanDavis: "Logstash keeps data for ~31 days. We do this by creating a new index each day (think database tables) and dropping indexes each night (03CR) 10Subramanya Sastry: "Couple of questions then:" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [17:13:17] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:18] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:37] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:47] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:47] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:58] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [17:15:17] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:15:57] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:16:09] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:16:28] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:16:47] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:17:07] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:17:42] (03CR) 10Subramanya Sastry: "Ah, I posted my comments after Scott, and missed guiseppe and bryan's comments." [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [17:22:04] (03Abandoned) 10Ori.livneh: Add '-hhvm' to profiler ID when running on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161435 (owner: 10Ori.livneh) [17:26:18] (03PS1) 10Alexandros Kosiaris: Make wtp a ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/170954 [17:26:21] enwiki is taking a while for me.... [17:26:42] and now its fine. ok [17:30:56] (03PS1) 10Cmjohnson: Removing old public ip's for aluminium [dns] - 10https://gerrit.wikimedia.org/r/170955 [17:33:40] (03CR) 10Cmjohnson: [C: 032 V: 031] Removing old public ip's for aluminium [dns] - 10https://gerrit.wikimedia.org/r/170955 (owner: 10Cmjohnson) [17:35:17] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [17:48:59] (03Abandoned) 10Dzahn: ssl: sni.wm.org, add CA name for correct chain [puppet] - 10https://gerrit.wikimedia.org/r/170847 (owner: 10Dzahn) [17:52:57] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:00:35] akosiaris: hmm, I hadn't actually thought of having one shinken instance per project-with-own-puppetmaster. would still be somewhat complicated, though. [18:01:18] YuviPand: yeah it would... but it would solve some of these problems [18:01:31] (03CR) 10Dzahn: [C: 031] Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [18:01:38] and would narrow down false negatives [18:01:41] yeah, but then there's this remote execution thing... [18:01:52] well more like "user does not care about the alert" thing [18:02:08] well, that's just a matter of who gets alerted on what... [18:03:24] on another note, did you get your root access YuviPand ? [18:03:45] ok, connection dropped [18:03:47] akosiaris: even if we have project specific shinken, alerting rules would still be in puppet. [18:03:49] akosiaris: I did! :) [18:03:57] there is a postgresql labsadmin user password waiting for you :-) [18:04:06] with createdb and createrole privs :-) [18:04:09] aha! cool :) [18:04:30] I'll start playing with it in a couple of days. I'm primarily helping out qgil with some phab stuff atm. [18:04:37] will rescue my python code and adapt it... [18:04:39] should be fun [18:04:52] would also be nice to replace the current mysql account creation perl script with python [18:04:59] https://gerrit.wikimedia.org/r/#/c/170684/ [18:05:05] for the user privs, name etc [18:05:18] palladium private repo for the actual password [18:05:20] cool [18:05:38] and now that we talked about it I will also create a cname for the service [18:05:56] should make it better than labsdb1004 :P [18:06:05] yeaaaaah :) [18:06:20] we should also create entries for the labsdb wikis [18:06:29] /etc/hosts for that copy pasted manually is lame [18:07:05] not sure I wanna know what was being done and why but it sounds lame ... [18:07:18] so yeah, let's do it :-) [18:08:33] hmm... I did not remember having it done already [18:08:40] https://gerrit.wikimedia.org/r/#/c/153614/a [18:08:53] actually https://gerrit.wikimedia.org/r/#/c/153614/ [18:09:09] yeah [18:09:20] but ... I just kind of worried... it is not that long in the past, I should have remembered doing it :-( [18:09:21] (03PS6) 10Dzahn: Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [18:09:36] got worried* [18:09:36] well, this is the problem when you're too productive [18:09:49] yeah.. let's pretend it's that ... sounds nice [18:09:57] :) [18:14:22] (03CR) 10Dzahn: [C: 032] Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [18:14:28] (03CR) 10Yuvipanda: [C: 04-1] mysql_wmf - autoload layout and lint fixes (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [18:14:53] mutante: going through your and john's lint patches now [18:15:37] YuviPand: thanks, it's a mixed bag, because some of them have been made with the auto "--fix" magic, which isn't thaaat magic after all [18:16:05] puppet-lint has some mode where it tries to fix fo you [18:16:07] mutante: ah, heh :) [18:16:30] (03CR) 10Yuvipanda: [C: 04-1] "Minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170466 (owner: 10John F. Lewis) [18:18:01] (03PS2) 10Yuvipanda: ceph: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170488 (owner: 10John F. Lewis) [18:18:03] YuviPand: do you have all the logins yet? [18:18:08] mutante: I do! [18:18:26] YuviPand: great, and you found where the passwords for ops are stored? ok :) [18:19:14] also, let me know if questions about using private repo. now you dont have to ask for contacts changes anymore, heh [18:19:27] mutante: :D heh, yeah! [18:19:28] mutante: will do! [18:20:05] (03CR) 10Yuvipanda: [C: 032] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/170488 (owner: 10John F. Lewis) [18:20:52] (03PS2) 10Yuvipanda: extdist: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170497 (owner: 10John F. Lewis) [18:21:55] (03CR) 10Yuvipanda: [C: 032] "Thanks for the patch! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/170497 (owner: 10John F. Lewis) [18:22:01] :)) [18:25:10] do we still use ceph anywhere? [18:31:15] (03CR) 10Dzahn: "John F. Lewis: any comments on https://gerrit.wikimedia.org/r/#/c/170925/ ? that suggests to revert this one, i didn't look very close yet" [puppet] - 10https://gerrit.wikimedia.org/r/159356 (https://bugzilla.wikimedia.org/70579) (owner: 10John F. Lewis) [18:33:03] (03CR) 10Dzahn: "i see a freshly installed wiki at http://ve.wikimedia.org/wiki/P%C3%A1gina_principal but there is no content yet. what's up with that, is" [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [18:33:54] (03CR) 10Dzahn: "could you link to a Bugzilla if there is one for the entire ve.wikimedia wiki setup" [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [18:34:18] (03PS1) 10Yuvipanda: Kill ceph module [puppet] - 10https://gerrit.wikimedia.org/r/170974 [18:39:36] (03CR) 10Yuvipanda: base: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [18:41:36] cmjohnson: regarding https://rt.wikimedia.org/Ticket/Display.html?id=8809 [18:41:44] (03CR) 10Dzahn: [C: 032] "yep, this is an Apache 2.2->2.4 thing that is specifically in the upgrade docs. http://httpd.apache.org/docs/2.4/upgrading.html syntax lo" [puppet] - 10https://gerrit.wikimedia.org/r/170792 (owner: 10Krinkle) [18:41:55] i told akosiaris you might save the trip to eqiad [18:42:04] (03PS4) 10Yuvipanda: contint: Clean up order of statements [puppet] - 10https://gerrit.wikimedia.org/r/168629 (owner: 10Krinkle) [18:44:01] (03CR) 10Yuvipanda: [C: 032] contint: Clean up order of statements [puppet] - 10https://gerrit.wikimedia.org/r/168629 (owner: 10Krinkle) [18:44:38] (03PS4) 10Yuvipanda: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [18:45:00] (03CR) 10Dzahn: [C: 031] contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [18:45:57] (03CR) 10Dzahn: [C: 032] remove en2.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/170138 (owner: 10Dzahn) [18:46:19] (03CR) 10Yuvipanda: [C: 04-1] "minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [18:46:28] * Reedy waits for the community to call for mutantes head [18:46:40] (03PS3) 10Yuvipanda: contint: Fix allow/deny rules for Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/170792 (owner: 10Krinkle) [18:46:55] Reedy: nooo :o [18:47:03] matanya: i need to be there to check it [18:47:41] cmjohnson: i had the same issue and a frimware upgrade solved it [18:47:42] I have something else I need to get there so will be heading out in about 5 mins or so [18:47:53] never mind then :) [18:52:52] (03CR) 10Dzahn: [C: 04-2] svn - move from antimony to zirconium [puppet] - 10https://gerrit.wikimedia.org/r/170752 (owner: 10Dzahn) [18:55:12] (03PS1) 10Ori.livneh: hhvm: add hhvm_health Ganglia module and hhvmadm script [puppet] - 10https://gerrit.wikimedia.org/r/170987 [18:56:40] (03CR) 10Ori.livneh: [C: 032] hhvm: add hhvm_health Ganglia module and hhvmadm script [puppet] - 10https://gerrit.wikimedia.org/r/170987 (owner: 10Ori.livneh) [18:57:08] (03CR) 10Dzahn: [C: 04-2] puppetmaster - lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170471 (owner: 10Dzahn) [18:59:12] (03PS2) 10Dzahn: puppetmaster - lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170471 [19:00:06] (03CR) 10Dzahn: [C: 031] "now only touching labs.pp" [puppet] - 10https://gerrit.wikimedia.org/r/170471 (owner: 10Dzahn) [19:08:20] (03CR) 10Dzahn: mysql_wmf - autoload layout and lint fixes (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [19:10:01] (03CR) 10Yuvipanda: [C: 031] "Hmm, alright. I'll merge tomorrow if nobody else gets to it first :)" [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [19:10:19] (03CR) 10Dzahn: contint: Move /srv/localhost/qunit resource out of qunit_localhost class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [19:10:27] (03CR) 10Yuvipanda: "I can submit follow up patches for the concerns I cited as well, if nobody else gets to it first :)" [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [19:14:11] (03PS1) 10Filippo Giunchedi: jheapdump: gdb-based heap dump for JVM [puppet] - 10https://gerrit.wikimedia.org/r/170996 [19:14:47] (03CR) 10Dzahn: mysql - lint fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170486 (owner: 10Dzahn) [19:14:48] (03PS3) 10Dzahn: mysql: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170486 [19:14:58] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [19:16:06] (03CR) 10John F. Lewis: "The wiki was set up ages ago from what I can tell. Quite recently the chapter asked for the wiki to be opened which I did and then a few d" [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [19:18:33] matanya: at data center...there is more than just a problem than drac...dmesg has a bunch of segfault errors and getting amber leds. Prolly memory issue [19:18:55] ah, too bad :/ [19:19:37] (03PS2) 10Dzahn: mariadb - lint fixes [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/170491 [19:21:14] (03CR) 10Dzahn: [C: 032] mariadb - lint fixes [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/170491 (owner: 10Dzahn) [19:22:29] !log rebooting wtp1023 [19:22:39] Logged the message, Master [19:24:27] PROBLEM - Host wtp1023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:27:23] (03CR) 10Dzahn: [C: 032] "harmless aligning, no boolean stuff, labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/170471 (owner: 10Dzahn) [19:28:43] (03CR) 10Matanya: "Per style guides." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [19:29:27] RECOVERY - Host wtp1023 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [19:30:11] (03CR) 10Manybubbles: "Left some questions. Should we use puppet to make sure gcore is installed? It looks like that is part of gdb but I don't think we're mak" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [19:32:47] (03CR) 10Dzahn: "i was under the impression this is just about the web based part, "viewvc" to view old SVN files at http://svn.wikimedia.org/viewvc and re" [puppet] - 10https://gerrit.wikimedia.org/r/170752 (owner: 10Dzahn) [19:32:55] (03Abandoned) 10Dzahn: svn - move from antimony to zirconium [puppet] - 10https://gerrit.wikimedia.org/r/170752 (owner: 10Dzahn) [19:34:01] PROBLEM - Host elastic1022 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:34:42] (03CR) 10Ottomata: jheapdump: gdb-based heap dump for JVM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [19:36:09] (03PS4) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 [19:36:55] (03CR) 10jenkins-bot: [V: 04-1] openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 (owner: 10Dzahn) [19:38:09] (03PS5) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 [19:39:20] YuviPanda: still here? labmon server seems low disk, meta issue :p [19:40:02] nevermind, it's toollabs [19:43:07] (03PS1) 10Steinsplitter: Adding *.wikiportret.nl to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171012 (https://bugzilla.wikimedia.org/72953) [19:48:32] (03CR) 10Aaron Schulz: [C: 031] Revert "Set wgMathDisableTexFilter to fix performance regression" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [20:10:47] (03PS6) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 [20:16:13] (03PS1) 10Jforrester: enwiki: Add Draft: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 [20:26:33] (03CR) 10Dzahn: apache: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170466 (owner: 10John F. Lewis) [20:26:56] (03PS2) 10Dzahn: apache: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170466 (owner: 10John F. Lewis) [20:28:07] (03CR) 10Hashar: "The change is technically fine, but I would like people to confirm wikiportret.nl is actually suitable for whitelisting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171012 (https://bugzilla.wikimedia.org/72953) (owner: 10Steinsplitter) [20:29:27] (03CR) 10Dzahn: [C: 04-2] "needs to be changed to give access to stat1002, not stat1003, see the ticket has been renamed as well" [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [20:29:52] mutante: no access to the ticket, please do it :/ [20:30:32] matanya: don't know yet which class it is :p [20:30:42] just that the node was wrong [20:30:51] he needs eventlogging [20:30:56] flow event logging [20:31:01] it is in admin/data/data.yaml [20:31:04] ottomata: ^ [20:31:05] looking, ok [20:31:53] i wonder if the other for Reedy is ok or same thing [20:32:10] (03CR) 10Ottomata: "Sorry, the change is good, it is the commit message that is not correct." [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [20:32:11] hmm.. request says per https://bugzilla.wikimedia.org/show_bug.cgi?id=69277#c86 [20:32:49] (03PS2) 10Matanya: access: give spage access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/169990 [20:32:56] ah, mutante, if the request is for hadoop cluster access and webrequest data [20:33:01] then analytics-privatedata-users is correct [20:33:20] see updated commit messages [20:33:35] ottomata: thanks [20:36:42] (03PS3) 10Dzahn: access: give spage access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [20:38:41] (03CR) 10Dzahn: [C: 031] "lgtm, per comments above, removed trailing whitespace, just that the ticket doesn't have manager approval on it. i guess the 3 days are ov" [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [20:55:59] (03CR) 10Dzahn: "< ottomata> ah, mutante, if the request is for hadoop cluster access and webrequest data" [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [20:59:35] (03CR) 10Dzahn: "asked for manager approval, otherwise, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [20:59:48] (03CR) 10Sjoerddebruin: [C: 031] "Good site, see the category with uploads here: https://commons.wikimedia.org/wiki/Category:Wikiportrait_uploads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171012 (https://bugzilla.wikimedia.org/72953) (owner: 10Steinsplitter) [21:00:48] (03CR) 10Dzahn: [C: 031] "sounds like Venezuela chapter changed their mind then? eh, if they request it to be reverted, i guess we should then" [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [21:02:19] (03PS7) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 [21:02:57] (03CR) 10Dzahn: [C: 031] "added proof that compilation is identical in compiler on a few selected virt hosts from each role" [puppet] - 10https://gerrit.wikimedia.org/r/170499 (owner: 10Dzahn) [21:06:24] (03PS2) 10Dzahn: elasticsearch: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170496 (owner: 10John F. Lewis) [21:06:26] (03CR) 10Andrew Bogott: [C: 031] "Looks good -- thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/170499 (owner: 10Dzahn) [21:08:39] (03CR) 10Steinsplitter: "It is no related to GWT, it is some uncontroversial change (see Sjored's comment above)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171012 (https://bugzilla.wikimedia.org/72953) (owner: 10Steinsplitter) [21:09:35] (03PS3) 10Ori.livneh: apache: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170466 (owner: 10John F. Lewis) [21:09:42] (03CR) 10Ori.livneh: [C: 032 V: 032] apache: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170466 (owner: 10John F. Lewis) [21:11:08] (03CR) 10Dzahn: [C: 04-2] "i wouldn't do this one, the one thing looked better before and the other needs careful checking (quote booleans) separate from lint change" [puppet] - 10https://gerrit.wikimedia.org/r/170494 (owner: 10John F. Lewis) [21:11:46] ori: thx! [21:11:54] (03PS7) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [21:15:43] (03PS3) 10Dzahn: elasticsearch: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170496 (owner: 10John F. Lewis) [21:15:59] (03CR) 10Dzahn: "Hosts where compilation is identical:" [puppet] - 10https://gerrit.wikimedia.org/r/170496 (owner: 10John F. Lewis) [21:17:27] (03CR) 10Dzahn: [C: 031] elasticsearch: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170496 (owner: 10John F. Lewis) [21:17:43] (03CR) 10Dzahn: [C: 032] openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170499 (owner: 10Dzahn) [21:19:53] (03CR) 10Dzahn: "noop - reduced number of WARNs from 378 to 235, still stuff to do" [puppet] - 10https://gerrit.wikimedia.org/r/170499 (owner: 10Dzahn) [21:21:57] (03CR) 10Dzahn: [C: 032] "noop" [puppet] - 10https://gerrit.wikimedia.org/r/170496 (owner: 10John F. Lewis) [21:26:41] (03CR) 10John F. Lewis: "I'd like to see a patch closing the wiki before this is merged however. Having open wikis but the actual domain going somewhere different " [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [21:27:20] (03PS2) 10Dzahn: cxserver: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170490 (owner: 10John F. Lewis) [21:28:15] (03CR) 10Dzahn: "re: PS1 - don't really like that style, but if the lint check wants it so.." [puppet] - 10https://gerrit.wikimedia.org/r/170490 (owner: 10John F. Lewis) [21:36:58] (03CR) 10Dzahn: [C: 032] "diff is just whitespace, noop on iron" [puppet] - 10https://gerrit.wikimedia.org/r/170463 (owner: 10John F. Lewis) [21:41:30] (03CR) 10Dzahn: [C: 031] "should be noop, needs rebase fix though" [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis) [21:44:20] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%): [21:44:29] (03CR) 10Dzahn: [C: 032] "fwiw, this doesn't seem to be used in prod/site.pp (yet?) just for beta labs" [puppet] - 10https://gerrit.wikimedia.org/r/170490 (owner: 10John F. Lewis) [21:51:17] (03PS4) 10Dzahn: access: give spage access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [21:52:33] (03CR) 10Dzahn: [C: 032] "has approval now" [puppet] - 10https://gerrit.wikimedia.org/r/169990 (owner: 10Matanya) [21:54:05] (03Abandoned) 10John F. Lewis: diamond: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170494 (owner: 10John F. Lewis) [21:55:17] (03PS2) 10Yuvipanda: Fix typos [puppet] - 10https://gerrit.wikimedia.org/r/169981 (owner: 10Tim Landscheidt) [21:55:48] YuviPanda: we want to override you on https://gerrit.wikimedia.org/r/#/c/168630/ [21:56:25] YuviPanda: and another one https://gerrit.wikimedia.org/r/#/c/170484/ lgtm [21:56:48] spagewmf: @stat1002 - Admin::User[spage]/File[/home/spage/.ssh/authorized_keys]/ensure: created [21:57:08] (03CR) 10Yuvipanda: [C: 031] contint: Move /srv/localhost/qunit resource out of qunit_localhost class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [21:57:13] (03CR) 10Yuvipanda: [C: 032] "Thanks for the patch! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/169981 (owner: 10Tim Landscheidt) [21:57:27] :) yay [21:57:34] (03PS5) 10Yuvipanda: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [21:57:40] mutante: I'm going to merge 168630 [21:57:52] ok, go ahead, would have done it too [21:58:19] (03CR) 10Yuvipanda: [C: 032] contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [22:00:04] spagewmf, ebernhardson: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141104T2200). Please do the needful. [22:01:04] mutante: did you see https://gerrit.wikimedia.org/r/#/c/170974/ [22:04:06] YuviPanda: yes, i did. would like para-void to ack it [22:04:13] cool :) [22:04:19] (03PS2) 10Yuvipanda: fatalmonitor: up the number of lines processed [puppet] - 10https://gerrit.wikimedia.org/r/170787 (owner: 10MaxSem) [22:04:48] (03CR) 10Yuvipanda: [C: 032] "My russian colleague said it's ok!" [puppet] - 10https://gerrit.wikimedia.org/r/170787 (owner: 10MaxSem) [22:05:07] bah jenkins [22:05:08] Y U SO SLOW [22:07:41] (03PS3) 10Yuvipanda: contint: switch Zuul conf to new repository [puppet] - 10https://gerrit.wikimedia.org/r/166012 (owner: 10Hashar) [22:08:21] (03CR) 10Yuvipanda: [C: 032] contint: switch Zuul conf to new repository [puppet] - 10https://gerrit.wikimedia.org/r/166012 (owner: 10Hashar) [22:08:43] YuviPanda: on 170792 , jenkins is done, note how the button changed to "submit" [22:08:51] (but we dont see those actions on IRC) [22:10:03] (03PS2) 10Dzahn: svn - move certificate installation into role [puppet] - 10https://gerrit.wikimedia.org/r/170748 [22:10:18] (03PS2) 10Dzahn: gitblit - move certificate installation into role [puppet] - 10https://gerrit.wikimedia.org/r/170751 [22:11:06] (03CR) 10Yuvipanda: [C: 032] svn - move certificate installation into role [puppet] - 10https://gerrit.wikimedia.org/r/170748 (owner: 10Dzahn) [22:12:07] (03CR) 10Dzahn: "don't want these on node level. could even go one step further and put it inside the module, because $svnhost is already set to the variab" [puppet] - 10https://gerrit.wikimedia.org/r/170748 (owner: 10Dzahn) [22:12:20] YuviPanda: oh, you beat me to that as well. thanks [22:14:38] mutante: https://gerrit.wikimedia.org/r/#/c/170751/2 is why you do dependent commits :D [22:14:43] this should've depended on the svn one [22:15:12] actually i tried to _not_ make them dependent [22:15:15] because code-wise they are not [22:15:20] git wise they are [22:15:44] ok, fixing [22:19:36] (03PS3) 10Dzahn: gitblit - move certificate installation into role [puppet] - 10https://gerrit.wikimedia.org/r/170751 [22:20:16] (03CR) 10Yuvipanda: [C: 04-1] "I'm confused, it comments out a bunch of stuff as well?" [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [22:20:46] (03CR) 10Yuvipanda: [C: 032] gitblit - move certificate installation into role [puppet] - 10https://gerrit.wikimedia.org/r/170751 (owner: 10Dzahn) [22:23:17] for sale by Flow team cheap: 100 minutes of primetime deploy window [22:23:46] ottomata: ping [22:24:37] pong [22:26:12] gwicke: ^ [22:26:12] :) [22:26:22] hey [22:26:49] I was just looking at the cassandra puppetization, and wondered if an ntp client is implicitly installed & configured already [22:28:52] gwicke: yes, hosts get NTP client from "standard" [22:29:01] as long as there is an "include standard" on the node [22:29:05] they get this: [22:29:12] include role::ntp [22:29:49] okay, cool [22:30:22] (03CR) 10Tim Landscheidt: "Apparently, in the past several debugging statements were commented out, but not the variable assignments that lead up to them (cf. https:" [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [22:32:08] (03CR) 10Yuvipanda: "Ah, I see. Then we should just remove them rather than leave dead code lying around. It can always be restored from git if necessary." [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [22:34:15] (03Abandoned) 10Yuvipanda: First pass at a labsconsole puppet setup [puppet] - 10https://gerrit.wikimedia.org/r/53989 (owner: 10Andrew Bogott) [22:37:59] Reedy: what's up with the mass of patches from a while ago going 'Apache config for transitionteamwiki using mod_proxy_fcgi ' [22:38:45] YuviPanda: Ask ori or _joe_ ;) [22:38:47] I just did the legwork [22:39:17] I guess a lot of them could be replaced with includes files stuff [22:40:19] (03PS2) 10Reedy: Apache config for wikidatawiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147436 [22:40:24] aaah [22:40:26] I see [22:40:32] I shall let it be then [22:41:01] that one looks funny now [22:41:07] seems there's already a partial stanza for it [22:48:03] (03PS1) 10Dzahn: mariadb - update submodule [puppet] - 10https://gerrit.wikimedia.org/r/171145 [22:50:56] (03CR) 10Dzahn: [C: 032] mariadb - update submodule [puppet] - 10https://gerrit.wikimedia.org/r/171145 (owner: 10Dzahn) [22:55:14] (03PS4) 10Dzahn: contint: Fix allow/deny rules for Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/170792 (owner: 10Krinkle) [22:57:35] bblack: ping [22:59:37] (03CR) 10Subramanya Sastry: "See https://gerrit.wikimedia.org/r/#/c/171147" [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [23:00:58] (03CR) 10Subramanya Sastry: "Giuseppe .. and apologies for misspelling your name in my earlier comment." [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [23:05:33] (03CR) 10Dzahn: [C: 031] fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [23:08:02] (03CR) 10Ori.livneh: [C: 04-1] fix up ordering for salt-minion package, config, service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [23:08:10] gwicke: pong! [23:08:39] bblack: so I was just looking into our handling of leap seconds in the context of the cassandra & restbase puppetization [23:08:59] that sounds fun [23:09:15] I was wondering if we could enable smearing to keep the time monotonically increasing, but then noticed that it's not even sure that another leap second will be scheduled [23:09:26] https://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds [23:10:02] I'm not sure on what level you want to have this conversation. there is much to know about timekeeping in general :) [23:10:11] so I guess we can just keep an eye on that & deal with it in case another leap seconds is going to happen [23:10:13] what's the functional problem for cassandra? [23:10:42] (Linux does, in general, offer many clocking options, some of which are more monotonic than others) [23:11:14] one issue was that JVMs would generally lock up, as they are using time internally for concurrency control [23:11:47] would lock up under what condition? that the wallclock goes through a calendar leap second? [23:11:54] cassandra also uses timestamps for revisioning and conflict resolution (in the eventual consistency last-write-wins sense) [23:12:41] what timestamps does it use, though? [23:12:43] so while it's unlikely to matter for anything involving human interaction, it could cause some inconsistency for things that happen in sub-second timescales [23:13:29] bblack: it uses microsecond-resolution unix timestamp [23:13:31] s [23:13:47] so, under Linux there are several different ways to get a timestamp [23:14:27] usually if you want something that doesn't get hosed by things like leap seconds or system admins setting the date randomly or ntp step adjustments, it's clock_gettime() with CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW [23:14:35] (which shouldn't have these problems, unless there's a bug) [23:14:35] (03CR) 10Dzahn: [C: 031] servermon: Execute make_updates every hour [puppet] - 10https://gerrit.wikimedia.org/r/170934 (owner: 10Alexandros Kosiaris) [23:15:20] if java or cassandra are expecting monotonic behavior from the wall clock, that's their bug, IMHO [23:15:40] (03CR) 10Dzahn: [C: 031] Make wtp a ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/170954 (owner: 10Alexandros Kosiaris) [23:15:51] (03PS2) 10Dzahn: Make wtp a ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/170954 (owner: 10Alexandros Kosiaris) [23:16:04] RoanKattouw: Yes [23:16:04] bblack: IIRC, the monotonic clock isn't directly related to UTC though? [23:16:27] yeah, that's what the docs say [23:16:32] nevermind, that was an old ping [23:16:34] no, it's not, it's just a clock [23:16:45] UTC, by definition, has funny things like leap seconds, and is a matter of adjustment. [23:16:49] yeah, that's a problem for something like timeuuids [23:16:56] saying you want monotonic UTC is like saying you want dry water [23:17:19] it's doable, for example by smearing leap seconds as done by google [23:17:48] but in any case, right now it's not even clear that there will be more leap seconds in the foreseeable future [23:17:59] sure, but then that's something else entirely. given CLOCK_MONOTONIC + CLOCK_REALTIME, one could do that oneself too I guess. [23:18:25] the result isn't UTC, though. I guess what you're looking for is a network-synchronized monotonic clock more than a variant of UTC, right? [23:19:06] I'm looking for generally UTC (for the external usefulness), with the added constraint of being monotonic [23:19:11] right now the two are the same [23:19:55] well, kinda [23:19:56] or rather, UTC is also monotinic for the foreseeable future [23:20:14] except that the system's conception of UTC isn't monotonic [23:20:43] (there are other esoteric reasons, but the big one is the administrator can just change the date/time at will) [23:20:53] normally the ntp client smears out corrections to avoid non-monotonicity [23:20:56] whereas CLOCK_MONOTONIC is independent of that mess [23:21:22] the normal adjustment by NTP is called slewing, and that's used when the system is off official UTC, but not too far off, to avoid stepping. [23:21:31] yup [23:21:33] but NTP refuses to slew if the diff is too great, by default [23:21:54] so google's thing was they hacked NTP to do a network-wide slew in place of actual leap seconds. [23:21:55] in a DC stepping should be very rare [23:22:45] (you'd be surprised how often stepping ends up happening on reboots though. hw clocks get bad. sometimes something you care about happens before the step on bootup) [23:23:49] e.g. if you step on startup to correct for a bad hw clock, you've already got network at that point. so some software somewhere is going to go look at statefiles related to network interfaces and dhcpcd and notice a bad time on the fs relative to the now-correct time) [23:24:08] that's a good point, maybe I should look into making the startup dependent on ntp having synced already [23:24:33] well yeah, and ideally the init system dependencies should take care of making sure that NTP happens before most other things, one hopes. [23:24:55] how does clock_monotonic_raw fit in? it's the absolute linear no time smear ntp goodness to side step these issues sometimes? [23:25:02] over my head :) but curious [23:25:12] yeah, even CLOCK_MONOTONIC is good enough for most things [23:25:27] _RAW is even better [23:25:51] I think the crux of the issue, and the reason we can't just tell cassandra "hey use the monotonic clock" is that they expect timestamps to agree network-wide. [23:26:05] yup [23:26:16] and the monotonic clock isn't gauranteed to have any sane absolute value, the rules are all relative there [23:26:35] and timestamps are used to implement functionality like 'state of x at time y', in which case it's more useful to use something like UTC [23:26:49] but still, from the authors-of-cassandra perspective, this is a solvable problem using an NTP-synced CLOCK_REALTIME + CLOCK_MONOTONIC and adjusting internal timekeeping on your own. [23:27:19] bblack: the spanner paper has fun stuff on clocks [23:27:28] I think the google blanket fix of slewing UTC for their datacenter sounds like basically a workaround for the fact that they expect so much so software won't try to do that at all, or won't do it right. [23:27:39] they also consider clock uncertainty [23:27:55] cassandra is relatively ghetto in comparison [23:28:08] fortunately it doesn't matter too much for our current use cases [23:28:24] http://en.wikipedia.org/wiki/Vector_clock <- this is the underlying bit of cassandra that cares [23:28:28] (or something very much like it) [23:28:41] cassandra doesn't use vector clocks [23:29:01] yeah that's what's puzzling [23:29:11] because vector clocks should be ok with per-machine monotonic time [23:29:33] so what does cassandra use? [23:29:50] on issue with vector clocks is that they don't necessarily capture the app semantics unless you use them there as well [23:30:04] the other is performance [23:30:09] (or unless app semantics are constrained to a single clocked transaction) [23:31:03] http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks [23:31:31] ^ seems relevant. also seems to indicate this only ends up affecting the "who wins on concurrent update" part of cassandra if leap seconds bleed through? [23:32:09] it depends on the data you are looking at [23:32:38] for time series data that's keyed off a timestamp non-monotonic time is not ideal [23:32:59] well [23:33:02] but you also want a time that makes global sense to the user [23:33:26] we should probably differentiate between different ways of breaking monotonicity. [23:34:15] besides aren't all leap seconds forwards? how did we ever get a technically-not-monotonic leap second update? [23:34:28] I think for now we are fine courtesy of the discussion to abolish leap seconds [23:34:42] but that aside, it's never moved backwards, right? [23:34:50] It did once [23:34:52] I think [23:34:52] IIRC it did [23:35:48] I read an article about google manipulating their ntp servers so that this leap second would actually happen as a slow down over a day (or so) [23:36:10] bblack: it looks like you are right, good point [23:37:00] hmm, actually no [23:37:01] oh I see [23:37:06] it's an extra second that's inserted [23:37:08] http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html [23:37:10] the problem is in the POSIX wallclock representation [23:37:27] so it does create non-monotonicity [23:37:35] because it goes 59.01 -> 59.99 -> 59.01 -> 59.99 -> 60 [23:37:54] because it goes 59.01 -> 59.99 -> 59.01 -> 59.99 -> 0 [23:37:56] ^ more correct [23:39:27] lets just keep an eye on the leap second discussion & do something about it in case the decision is to schedule another one [23:42:15] well, I mostly agree it might be a good idea to smear ourselves because software sucks, but that aside I still tend to contend that there's no such thing as "monotonic UTC", and that apps should use the timescale that makes sense for what they're doing, and if what you're doing needs something like "monotonic UTC", that's a problem for your app to solve (which can be solved in software based o [23:42:21] n those two clocks), not something to blame the OS or NTP for. [23:42:43] "your app" being cassandra in this case [23:43:07] it's not really fair of cassandra to blame the underlying layer if it wants behavior that wasn't gauranteed by the well-described existing behavior. [23:43:10] I think there is a good use case for monotonic time that's closely matching UTC [23:43:29] ideally exactly, which is true right now [23:43:31] there is, but it doesn't exist from a software author's perspective all packaged up for you and ready to use [23:44:01] but, importantly, it's kind of hard to do that and avoid corner cases, too. [23:44:39] you'd have to ensure that servers are stepped over the network before any software using such a thing starts, and there would be no provision for bad time or time corrections. [23:45:31] e.g. if the sysadmin goes and does something dumb like execute the date command and send us backwards 23 years, what does your CLOCK_UTC_MONOTONIC do on the next call 23ns later? does it hang? does it halt the machine? or does it give a non-monotonic answer? [23:46:51] there are systems that rule out issues like this as part of their protocols (see spanner for example) [23:47:47] this hasn't made it into popular open source implementations yet, but I'm hopeful that it eventually will [23:48:33] yeah [23:49:20] I think there's room for this, but if I were implementing a generic library to provide what you're looking for, I probably wouldn't mention UTC as being part of the output, because of all the corner cases. [23:50:36] you'd need some API to initialize such a clock, which blocks if NTP isn't already in sync down to a certain precision, and then looks at both clocks. then it can offer gaurantees from there going forward that it's giving monotonic counts based on an initial UTC time that are network-valid. [23:51:32] but if that initial time was wrong, everything's wrong. and if the system goes through a leap second or some such thing, your network-monotonic clock will no longer be counting strict UTC [23:52:19] (03PS1) 10Ori.livneh: memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 [23:52:57] either way, slewing UTC for the system clocks is a hack to get around app software that hasn't bothered to solve this problem at their own layer :) [23:52:59] (03CR) 10jenkins-bot: [V: 04-1] memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 (owner: 10Ori.livneh) [23:54:21] this whole thing is probably why djb advocates TAI timestamps in his software [23:55:36] http://cr.yp.to/proto/utctai.html [23:56:53] it's sort of like how most sane developers now accept that all database/internal timestamps should be UTC and timezones are an artifact of the presentation layer, because it's the only way to make things sane. [23:57:22] <^d|voted> James_F: Ping for swat [23:57:35] DJB is taking that a step further and saying that even UTC is question of presentation layer and only TAI is consistent enough for internal use. You'd adjust leap seconds for display based on a history of UTC leap seconds, much like we adjust timezones for display with tzdata. [23:58:42] (since TAI is basically UTC without leap seconds) [23:58:53] <^d|voted> Time is hard :\