[00:30:14] 3ops-esams: Dear esams@rt.wikimedia.org, No Publication Fee for AASCIT Members - https://phabricator.wikimedia.org/T86500#969539 (10emailbot) [00:44:36] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [01:02:17] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:06:37] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [02:10:50] !log l10nupdate Synchronized php-1.25wmf13/cache/l10n: (no message) (duration: 00m 01s) [02:10:54] !log LocalisationUpdate completed (1.25wmf13) at 2015-01-12 02:10:54+00:00 [02:11:02] Logged the message, Master [02:11:11] Logged the message, Master [02:17:44] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:17:48] Logged the message, Master [02:17:48] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-12 02:17:48+00:00 [02:17:51] Logged the message, Master [02:25:37] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [02:43:46] PROBLEM - HHVM queue size on mw1126 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [80.0] [03:36:02] 3ops-requests, operations: Configure twemproxy to bind a unix domain socket - https://phabricator.wikimedia.org/T83328#969608 (10ori) I made the mode of the UNIX domain socket configurable in . Could we apply this patch and repackage? [03:54:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jan 12 03:54:34 UTC 2015 (duration 54m 33s) [03:54:39] Logged the message, Master [04:05:42] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: use our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183423 (owner: 10Faidon Liambotis) [04:14:35] (03PS1) 10Andrew Bogott: Remove the install_sudo script. [puppet] - 10https://gerrit.wikimedia.org/r/184287 [04:16:37] (03CR) 10Andrew Bogott: [C: 032] Remove the install_sudo script. [puppet] - 10https://gerrit.wikimedia.org/r/184287 (owner: 10Andrew Bogott) [04:23:12] (03CR) 10Andrew Bogott: "This fact seems to not be present on Debian. Have a look on labs instance labs-bootstrapvz-jessie." [puppet] - 10https://gerrit.wikimedia.org/r/183209 (https://phabricator.wikimedia.org/T86297) (owner: 10Faidon Liambotis) [04:49:53] (03PS6) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [04:50:40] (03CR) 10jenkins-bot: [V: 04-1] Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [04:53:57] hrm, that console output is not helpful and my change was trivial [04:58:50] (03PS7) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [04:59:40] (03CR) 10jenkins-bot: [V: 04-1] Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [05:02:43] (03PS8) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [05:03:25] (03CR) 10jenkins-bot: [V: 04-1] Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [05:06:11] (03PS9) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [05:07:00] (03CR) 10jenkins-bot: [V: 04-1] Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [05:09:49] (03PS10) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [05:11:50] yay found the accidental single deleted character [05:16:56] :) [05:56:07] andrewbogott: [05:56:07] root@labs-bootstrapvz-jessie:~# dpkg -L facter |grep ec2.rb [05:56:07] /usr/lib/ruby/vendor_ruby/facter/ec2.rb [05:56:29] let's see why it's not working, though :) [05:57:38] yeah, I didn't dig very deep [05:58:54] doh [05:59:06] confine do [05:59:06] Facter.value(:virtual).match /^xen/ [05:59:06] end [05:59:18] how incredibly stupid [05:59:58] https://github.com/puppetlabs/facter/commit/add124f348fcb74576d09a0951c95477ae8e5f9f [06:00:00] fixed upstream :/ [06:00:10] $ git describe --contains add124f348fcb74576d09a0951c95477ae8e5f9f [06:00:11] 2.3.0~25^2 [06:00:23] we have 2.2.0 [06:00:40] Date: Thu Oct 30 13:09:41 2014 -0700 [06:00:40] (packaging) Update FACTERVERSION to 2.3.0 [06:00:52] so yeah, too late for the freeze [06:06:35] Wait, why does it work anwhere then? [06:06:47] because this was working okay before [06:06:51] they broke it with 2.1.0 [06:06:55] during some refactoring [06:06:57] oh [06:07:05] precise/trusty boxes run facter 1.7.x [06:07:06] and we're running a new verision on debian [06:07:21] we have newer puppet/facter in Debian [06:07:37] and I was hoping to backport these eventually to Ubuntu as well to get some of the benefits [06:07:53] (e.g. facter 2 has "structured facts" instead of simple key/value you can return an array or a hash) [06:08:05] but now... meh, we might need facter 2.3 [06:08:35] Is 2.3 packaged anyplace? Or just source? [06:08:48] it's not, but it wouldn't be too hard [06:09:01] but I'm contemplating just replacing the ec2 facter for now [06:09:34] part of their abstraction was to basically make it all modular, so I can write a thin wrapper that works for 2.1/2.2 [06:12:07] It doesn't trouble me if puppet remains broken on that instance for a few days :) [06:12:17] yeah working on a fix right now [06:15:05] virtual => physical [06:15:07] oh great [06:15:10] virtual is also broken [06:16:26] File.read("/proc/cpuinfo") [06:16:28] [...] [06:16:30] (txt =~ /QEMU Virtual (CPU|Machine)/i) ? true : false [06:17:22] that instance's modem says [06:17:22] model name : Intel Xeon E312xx (Sandy Bridge) [06:18:05] I am also having a day of "these developers don't test any use case but their own" :( [06:18:29] facter is full of wtfs [06:18:53] their detection of kvm is full of crap [06:19:01] while systemd for example [06:19:01] [ 1.872949] systemd[1]: Detected virtualization 'kvm'. [06:19:46] so, openstack seems to pass [06:19:47] -cpu SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme [06:19:51] no clue why [06:19:54] to qemu [06:20:47] should 'SandyBridge' instead be something like 'Virtual'? [06:22:48] RECOVERY - Graphite Carbon on graphite1002 is OK: OK: All defined Carbon jobs are runnning. [06:26:26] PROBLEM - Graphite Carbon on graphite1002 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [06:28:37] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:46] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:15] 3operations: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#969686 (10tstarling) Increasing the sysctl "kernel.pid_max" to 4M would probably work as a temporary hack, and would help to mitigate similar errors in other programs. Ideally puppet would check... [06:45:26] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:48:06] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:12] 3operations: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#969689 (10ori) > Unsure why this didn't trigger sooner since the orphan pid file timestamp was months old and mysqld had been running for three days. The number in the PID file did not correspon... [06:57:22] (03PS4) 10Yuvipanda: Memoize parsing View definitions into Table objects [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184143 [07:08:41] valhallasw`cloud: ^ I’m still using copy because I’m not sure the Table objects are immutable [07:24:38] !log on virt1005 and virt1006, ran 'ln -s /usr/bin/qemu-system-x86_64 /usr/bin/kvm' that allows nova to migrate instances between hosts. [07:24:43] Logged the message, Master [07:25:15] ? [07:25:19] no /usr/bin/kvm? how come? [07:26:54] <_joe_> why not create an alternative, if it's not present? [07:26:59] <_joe_> (good morning) [07:27:11] YuviPanda: cache the parse result, create new object every time? [07:27:23] valhallasw`cloud: it’s parsed twice. [07:27:28] valhallasw`cloud: terrible, I know [07:27:38] I'm not sure. _joe_ isn't creating an alternative what I did? [07:27:46] valhallasw`cloud: but pyparsing seems stupid and doesn’t give me back a ‘tree’ structure. Only a flat list. [07:27:54] Oh. [07:28:04] valhallasw`cloud: so there’s parse and there’s a scan [07:28:23] Openstack used to invoke qemu-kvm, but that's no longer installed by modern qemu packages. [07:28:46] So I presume that openstack is making the internal decision regarding platform. [07:28:56] Move scan to fn, wrap parser with that, cache that new fn? [07:29:12] Or copy the obj, indeed... [07:29:13] it's the same platform [07:29:18] qemu-kvm folded into qemu [07:29:59] but qemu-kvm still ships a /usr/bin/kvm that calls qemu, basically [07:30:03] at least in Debian [07:30:07] <_joe_> andrewbogott: no [07:31:15] heh, qemu-kvm installs a script as /usr/bin/kvm that's just 'exec qemu-system-x86_64' as paravoid said. So I might as well do that. [07:31:24] Baffled why openstack doesn't install that yet depends on the presence of that exec [07:32:01] qemu-kvm used to be a fork of qemu [07:33:55] but they merged, right? Conventional wisdom online is that kvm-qemu is now moot. Which is probably why an OS dev pulled it out. [07:33:58] valhallasw`cloud: yeah, copying is the least effort option here, I think. I don’t want to make Table immutable as an invariant without some language features to enforce that. [07:34:01] yeah [07:34:32] I'll log a bug once I get a good run [07:34:57] so basically, qemu was just an emulator, kvm got released with a bunch of patches to support its kernelspace equivalent [07:35:10] at some point kvm got renamed into qemu-kvm [07:35:23] qemu started getting kvm support but was still a different project [07:35:41] and it was a mess, too, they frequently did git cross-merges across the two trees [07:35:52] so commit history was a very big PITA [07:35:59] I had a thread with upstream back then about this.. [07:36:55] oh and on top of that [07:37:06] kvm had a different versioning scheme, it was kvm-87, kvm-88 etc. [07:37:26] when they switched to qemu-kvm, they switched to tracking qemu's releases, do 0.12.x [07:37:44] also, when they did that switch, kvm's internal tree structure got inside-out [07:37:54] they used to have kvm stuff in their /, and qemu sources under qemu/ [07:38:21] and then to make it more compatible with the regular qemu tree, they switched to / for qemu and /kvm for kvm internals [07:38:26] It sounds like I shouldn't plan on doing a git bisect on qemu code anytime soon [07:38:47] oh and they kept releasing kvm-NN releases for quite a while after they created qemu-kvm [07:39:14] dammit "Live Migration failure: operation failed: migration job: unexpectedly failed" that's helpful [07:40:59] I'm getting annoyed at facter now [07:41:20] I fixed things just fine [07:41:24] but their timeouts are too aggressive [07:41:35] so the fact works only half of the time [07:43:06] for the metadata service? It should be super quick [07:43:16] http://photos1.blogger.com/blogger/3556/3162/1600/FM1P08.jpg [07:43:32] ("The Fact") ;) [07:43:50] well 0.2 of a second is also kinda low [07:45:49] sometimes it takes more for sure [07:47:03] how does that work exactly, do you know? [07:47:16] where is this webserver running, how do the VMs talk to it etc. [07:48:42] The metadata service runs on the network node, I believe. Every VM has particular IP that routes automatically to the metadata service. [07:48:57] So you do rest queries on that IP and it can run a few select queries. There's very little info there [07:49:48] Want me to see if I can find a log for it? [07:52:38] http://p.defau.lt/?P93idILJnoW6JZXX_HnXYQ [07:53:09] if you repeat the command shortly, it succeeds almost always [07:53:33] Is it doing a wget? Or something else? [07:53:49] this is my test script which is just doing a GET [07:53:51] There are other ways to access metadata, I think, although I'd be surprised if they're enabled. [07:54:02] Oh, cribbed from the old fact? [07:54:03] it's copied from facter but just a barebone [07:54:09] no, from the new fact [07:54:24] But it is using the REST interface, clearly. [07:54:36] http://169.254.169.254/latest/meta-data/ [07:54:47] yeah [07:56:50] Ah, it's in the api service. OK, looking... [07:57:02] yeah I'm looking at it too [07:57:13] I mean, fundamentally facter is wrong [07:57:16] 0.2 is too low of a timeout [07:57:42] if I bump it to 0.5, it works all of the time [07:57:43] Yeah, in the log are entries like "len: 332 time: 0.2181320" [07:57:59] (03CR) 10Merlijn van Deen: [C: 032] Memoize parsing View definitions into Table objects [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184143 (owner: 10Yuvipanda) [07:58:05] Those machines are right next to each other, but I guess the api might be busy sometimes... [07:58:20] (03CR) 10Merlijn van Deen: [V: 032] Memoize parsing View definitions into Table objects [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184143 (owner: 10Yuvipanda) [07:58:25] that said, 200ms to respond locally to a metadata query is a bit too much [07:58:29] so if we can optimize it, that'd be nice :) [07:58:46] It does a db query, which is passed through an orchestration service to virt1000 [07:58:49] where the actual DB is [07:58:56] So, there's a lot of… stuff… going on there [07:59:25] None of which should take any noticeable time, but there might be locks and such if the orchestration wasn't designed for speed [07:59:26] 2015-01-12 07:53:35.837 9238 INFO nova.api.ec2 [req-5fdcabec-a754-479e-a2e8-b1d797f76a69 None None] 0.337231s 10.68.16.107 GET /latest/meta-data/ None:None 200 [Ruby] text/plain text/html [07:59:41] (03CR) 10Yuvipanda: [C: 032 V: 032] Use absolute imports instead of relative imports [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184139 (owner: 10Yuvipanda) [07:59:51] (03PS2) 10Yuvipanda: Fix refactoring snafu [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184142 [08:00:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix refactoring snafu [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184142 (owner: 10Yuvipanda) [08:00:11] (03PS5) 10Yuvipanda: Memoize parsing View definitions into Table objects [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184143 [08:00:14] I'm contemplating reverting my ec2id patch and be done with it [08:00:19] (03CR) 10Yuvipanda: [V: 032] Memoize parsing View definitions into Table objects [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184143 (owner: 10Yuvipanda) [08:00:31] kinda sucks though, this ec2.rb fact is much more powerful ? [08:00:33] :/ [08:02:25] If we decide not to swich to facter /now/ then we can get a patch in 2.3 that changes the timeout [08:02:39] in 2.4 you mean [08:02:45] 2.3 is already released [08:02:47] oh, yeah [08:02:51] yeah we could [08:02:57] and we could also ship a patched facter internally [08:03:39] the code is so incredibly stupid [08:03:45] it tries for 0.2s but 3 times [08:03:52] So if you run it twice the second run is fast… but then it gets slow again after a few second wait? [08:03:53] so you get the worst of both worlds [08:04:07] So, cached someplace but only for a second? [08:04:10] both a 0.6s delay when the service isn't responsive [08:04:13] <_joe_> paravoid: something from puppetlabs is incredbly stupid? I'm pretty surprised [08:04:20] andrewbogott: yeah [08:04:35] hm [08:04:52] That doesn't help us [08:07:07] ok, I have a sort-of stupid & unrelated question [08:07:21] why do we use the instance id in the first place? :) [08:07:28] why do we need those silly i-NNN names? [08:07:37] they are immutable / permanent [08:07:45] while fqdns can change if you delete / recreate an instance [08:08:01] paravoid, I think we have the option of running a dedicated metadata service, decoupled from the api service. That /might/ be faster, or it might get throttled by whatever locking is slowing us down now. [08:08:18] <_joe_> YuviPanda: what problem does it present to puppet? [08:08:26] <_joe_> I mean the puppet master [08:08:27] cert cleaning mostly [08:08:41] we could alternatively write a nova plugin that cleans the puppet cert when an instance is deleted [08:08:45] and then we can use fqdns [08:08:46] <_joe_> you could make that when destroying an instance [08:08:58] <_joe_> yeah, extremely-low-priority [08:09:01] yup. [08:09:05] <_joe_> but still an annoyance tbh [08:09:14] yup. both on puppet and salt [08:10:58] brb [08:26:44] (03PS1) 10Faidon Liambotis: Fix facter's virtual detection [puppet] - 10https://gerrit.wikimedia.org/r/184291 [08:26:46] (03PS1) 10Faidon Liambotis: base: add ec2_kvm fact, workaround facter 2.2 bug [puppet] - 10https://gerrit.wikimedia.org/r/184292 [08:27:09] the first one is needed nevertheless [08:27:19] the second one... it does the job, but only for < 0.2s responses :( [08:29:49] paravoid: does factor automatically use virt-what if present? [08:29:53] yes [08:30:04] 'k [08:30:06] so that's an easy fix [08:30:12] (03CR) 10Andrew Bogott: [C: 032] Fix facter's virtual detection [puppet] - 10https://gerrit.wikimedia.org/r/184291 (owner: 10Faidon Liambotis) [08:30:20] although I should probably file a bug report so that it can try systemd-detect-virt first [08:30:26] but whatever, 12kb binary [08:34:31] with this new fact… I take it custom facts always override built-in ones? [08:34:39] There's not a race where sometimes we'll get this one and sometimes the other? [08:35:20] (03PS1) 10Faidon Liambotis: Split off module locales out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184293 [08:35:22] (03PS1) 10Faidon Liambotis: locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 [08:35:24] (03PS1) 10Faidon Liambotis: Split off module debconf out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184295 [08:35:26] (03PS1) 10Faidon Liambotis: Kill generic::upstart_job definition [puppet] - 10https://gerrit.wikimedia.org/r/184296 [08:35:28] (03PS1) 10Faidon Liambotis: Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 [08:36:11] andrewbogott: it's a different "resolution" and the other one is confined to Xen so it doesn't run at all [08:36:20] (03CR) 10jenkins-bot: [V: 04-1] locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 (owner: 10Faidon Liambotis) [08:36:20] andrewbogott: so... "no" [08:36:57] Ah, it loads both facts every time? [08:37:01] That'll work. [08:37:30] (03PS2) 10Faidon Liambotis: Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 [08:37:32] (03PS2) 10Faidon Liambotis: Kill generic::upstart_job definition [puppet] - 10https://gerrit.wikimedia.org/r/184296 [08:37:34] (03PS2) 10Faidon Liambotis: Split off module debconf out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184295 [08:37:36] (03PS2) 10Faidon Liambotis: locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 [08:37:55] (03CR) 10Andrew Bogott: [C: 032] base: add ec2_kvm fact, workaround facter 2.2 bug [puppet] - 10https://gerrit.wikimedia.org/r/184292 (owner: 10Faidon Liambotis) [08:38:33] (03PS1) 10Yuvipanda: tools: Install some pywikibot dependencies [puppet] - 10https://gerrit.wikimedia.org/r/184298 (https://phabricator.wikimedia.org/T86015) [08:38:35] andrewbogott: not sure this does you any good, the timeout is going to make puppet do weird things with the configs on half the puppet runs :( [08:38:47] (03PS2) 10Yuvipanda: tools: Install some pywikibot dependencies [puppet] - 10https://gerrit.wikimedia.org/r/184298 (https://phabricator.wikimedia.org/T86015) [08:38:53] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+topic:ssh-userkey,n,z [08:38:57] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+topic:kill-generic,n,z [08:39:13] request for review of these :) [08:39:21] *cough* _joe_ *cough* [08:39:30] I have more nice ones coming thanks to a boring weekend [08:39:55] (03CR) 10Yuvipanda: [C: 032] tools: Install some pywikibot dependencies [puppet] - 10https://gerrit.wikimedia.org/r/184298 (https://phabricator.wikimedia.org/T86015) (owner: 10Yuvipanda) [08:40:23] paravoid: same failure on labs-bootstrapvz-jessie, although I think a timeout would lead to the same error. [08:40:31] Seems like it should work some of the time though [08:40:35] yeah [08:41:38] <_joe_> paravoid: I have another topic I'm working on on which I'll ask for review in exchange :P [08:41:57] that's fair ;) [08:42:52] andrewbogott: the 2.2 fact is also worse than the 1.7 [08:43:06] so the 3x0.2s runs *will* run in prod as well [08:43:21] while in 1.7, they did heuristics on the MAC address ranges [08:43:24] for amazon & openstack [08:43:31] um… that's not so good. [08:48:02] paravoid: it's hitting the service correctly (I see the queries on labnet1001) but it times out every time [08:48:18] Always between .2 and .3, like it's taunting me [09:06:20] !log Tweak Zuul configuration to pin python-daemon <= 2.0 and deploying tag wmf-deploy-20150112-1. {{bug|T86513}} [09:06:26] Logged the message, Master [09:06:32] smells like napalm [09:07:59] hashar: what does? [09:09:08] all of france? :-p [09:10:47] not the most politically-sensitive joke [09:11:41] (03CR) 10Ori.livneh: [C: 031] Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 (owner: 10Faidon Liambotis) [09:11:59] ori: you're right, sorry. [09:13:08] i wasn't offended, but i thought i'd mention it in case it was unintended [09:13:18] in other news, what's up? [09:15:39] ori: rcstream in pywikibot has finally been merged! [09:15:46] oh cool [09:15:51] i didn't even know that was in progress [09:16:02] but we're seeing some issues on arwiki (not sure why arwiki specifically), so there might be a followup soon-ish [09:16:06] <_joe_> good to know [09:16:23] <_joe_> arwiki is what language? [09:16:31] * _joe_ smells python unicode drama [09:16:33] standard arabic, I think [09:18:27] can someone add me to wmf-nda group in phabricator? [09:19:13] Nikerabbit: is that the appropriate group for you? aren't you still staff? [09:21:30] also, i think the change has to be done in ldap [09:22:00] ori: what is definition of "staff"? [09:23:34] !log Restarting Zuul [09:23:40] Logged the message, Master [09:23:44] Nikerabbit: I really don't know what the boundaries are, to be honest -- this isn't my area. If no one responds to your request on IRC, I'd file a request in phab. [09:24:06] YuviPanda: valhallasw`cloud: that is just Zuul requiring some upgrade :] [09:24:30] hashar: I know. Just added a sensible project to get it out of #wikimedia-dev ;-) [09:24:37] ori: thanks for you follow up to get hhvm installed on the CI boxes :-] [09:25:03] I asked because I keep hitting access denied links on phabricator and multiple members of my team are already in that group [09:25:55] WMF-NDA includes a ton of people who are full time WMF employees [09:26:31] (03PS1) 10Yuvipanda: tools: Install python-unicodecsv [puppet] - 10https://gerrit.wikimedia.org/r/184302 (https://phabricator.wikimedia.org/T86015) [09:26:47] hashar: cool, is it working and everything? [09:27:05] (03PS1) 10Faidon Liambotis: Revert "remove custom fact ec2id" [puppet] - 10https://gerrit.wikimedia.org/r/184303 [09:27:10] (03PS1) 10Yuvipanda: tools: Sort list of python dependencies [puppet] - 10https://gerrit.wikimedia.org/r/184304 [09:27:24] Nikerabbit: *nod* you may be right. I think you need someone in ops for the LDAP change though. [09:27:28] ori: seems like [09:27:31] (03CR) 10Yuvipanda: [C: 032] tools: Install python-unicodecsv [puppet] - 10https://gerrit.wikimedia.org/r/184302 (https://phabricator.wikimedia.org/T86015) (owner: 10Yuvipanda) [09:27:41] (03CR) 10Yuvipanda: [C: 032] tools: Sort list of python dependencies [puppet] - 10https://gerrit.wikimedia.org/r/184304 (owner: 10Yuvipanda) [09:27:44] ori: I will have to figure out how to get the fcgi part to work :] [09:28:26] (03PS2) 10Faidon Liambotis: Revert "remove custom fact ec2id" [puppet] - 10https://gerrit.wikimedia.org/r/184303 (https://phabricator.wikimedia.org/T86297) [09:28:32] ori: and Faidon taught me about Debian unattended upgrade system, so the CI slaves are self upgrading hhvm now! [09:28:32] I do have deployment access, and before migration I was asked to use RT (but I was never able to create an account there) [09:29:39] hashar: the fcgi setup should be mostly straightforward, i think i included an example in my reply to your ops@ thread [09:29:59] ah yeah: https://gist.githubusercontent.com/atdt/8dcd1ebcb442efd2585f/raw/1ef9f20c49d8525e5867d339546241792b5ca66f/hhvm-mediawiki.conf [09:30:55] just be sure to have 'include ::apache::mod::proxy_fcgi' somewhere in the puppet manifest [09:31:35] 3Labs-Team, operations: facter: VM detection incorrect in labs - https://phabricator.wikimedia.org/T78813#969949 (10faidon) 5Open>3Resolved a:3faidon I fixed this today with 272b0f4ba495d4b6fba12069bce3c457da4d46c7, for an entirely different reason, an hour before realizing this existed as a task. Funny! [09:32:25] ori: yup got that saved somewhere. There is probably no use for fcgi in CI yet anyway. [09:32:39] aha [09:33:00] * hashar grabs coffee && nature's call [09:33:50] valhallasw`cloud: how has adoption been of rcstream generally? [09:34:34] ori: I know multichill is planning on rewriting some bots to use it (those bots now read recentchanges every N minutes) [09:34:43] cool [09:35:15] (03CR) 10Alex Monk: "Shouldn't I also be in the wmf-deployment gerrit group?" [puppet] - 10https://gerrit.wikimedia.org/r/181421 (owner: 10Giuseppe Lavagetto) [09:35:34] ori: I'm not sure how to solve the situation where the reader is too slow to read results, though. I'm going for the option 'raise warnings, but keep caching results' rather than 'fixed queue length, kick of oldest/newest entries to make space' [09:36:06] basically letting the bot author solve the issue ;-) which I think is also the sensible place. It can be tricky though, for instance when mediawiki tells you 'hey, it's busy, back off please' [09:36:39] anyway, time for coffee, bbl. [09:37:39] not sure what you mean [09:50:26] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 605 [09:50:50] ori: he needs caffeine [09:50:59] :) [09:52:37] ori: while I am in SF, poke me about Zuul [09:52:53] what about it? [09:53:24] ori: I noticed a few rants from last months about how we run useless tests. Could use to talk a bit with you so I can end up clarifying Zuul gating system :D [09:53:40] db1008, that's in frack [09:55:01] Sure, OK [09:55:26] RECOVERY - check_mysql on db1008 is OK: Uptime: 7676906 Threads: 2 Questions: 208996132 Slow queries: 54365 Opens: 142017 Flush tables: 2 Open tables: 64 Queries per second avg: 27.224 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:07:47] (03PS4) 10Hashar: Tel-Hai Academic College event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184119 (https://phabricator.wikimedia.org/T85773) (owner: 10Eranroz) [10:09:26] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184119 (https://phabricator.wikimedia.org/T85773) (owner: 10Eranroz) [10:10:38] (03CR) 10Hashar: [C: 032] Tel-Hai Academic College event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184119 (https://phabricator.wikimedia.org/T85773) (owner: 10Eranroz) [10:10:44] (03Merged) 10jenkins-bot: Tel-Hai Academic College event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184119 (https://phabricator.wikimedia.org/T85773) (owner: 10Eranroz) [10:11:48] !log hashar Synchronized wmf-config/throttle.php: Tel-Hai Academic College event - Bug: T85773 (duration: 00m 07s) [10:11:50] Logged the message, Master [10:11:56] (03CR) 10Hashar: "Deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184119 (https://phabricator.wikimedia.org/T85773) (owner: 10Eranroz) [10:14:06] if anyone knows about Java 8 , we got it installed on the CI Jenkins slaves with https://gerrit.wikimedia.org/r/#/c/183222/ :D [10:14:22] though the package is apparently still experimental, it is required by wikidata/gremlin.git repo [10:14:49] (03PS2) 10Giuseppe Lavagetto: puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/183879 [10:14:51] (03PS2) 10Giuseppe Lavagetto: puppet: include admin in role classes for mediawiki and cache [puppet] - 10https://gerrit.wikimedia.org/r/183882 [10:14:53] (03PS2) 10Giuseppe Lavagetto: mediawiki: move cluster definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/183880 [10:14:55] (03PS2) 10Giuseppe Lavagetto: puppet: use the role keyword for all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/183881 [10:14:57] (03PS1) 10Giuseppe Lavagetto: swift: use roles and other linting [puppet] - 10https://gerrit.wikimedia.org/r/184307 [10:14:59] (03PS1) 10Giuseppe Lavagetto: puppet: include admin in swift roles, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/184308 [10:15:01] (03PS1) 10Giuseppe Lavagetto: videoscaler: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/184309 [10:15:03] (03PS1) 10Giuseppe Lavagetto: puppet: use hiera for elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/184310 [10:15:05] (03PS1) 10Giuseppe Lavagetto: puppet: use role for logstash [puppet] - 10https://gerrit.wikimedia.org/r/184311 [10:15:07] (03PS1) 10Giuseppe Lavagetto: puppet: use role, hiera in rcstream [puppet] - 10https://gerrit.wikimedia.org/r/184312 [10:21:48] 3Project-Creators, operations: Create #site-incident tag and use it for incident reports - https://phabricator.wikimedia.org/T85889#970077 (10Qgil) [10:22:17] <_joe_> !log upgrading HHVM on all appservers [10:22:23] Logged the message, Master [10:30:45] 3operations: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#970101 (10JanZerebecki) I think this can be prevented now by changing /etc/init.d/puppet to use `--name` and possibly `--user` in addition to `--pidfile` with `start-stop-daemon`. In a future wh... [10:35:34] (03CR) 10Andrew Bogott: [C: 032] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/184303 (https://phabricator.wikimedia.org/T86297) (owner: 10Faidon Liambotis) [10:36:00] 3operations: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#970107 (10JanZerebecki) But to completely prevent anything like it we would need to change this for all services running on one system. [10:37:04] (03PS3) 10Filippo Giunchedi: lsearchd: remove lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) [10:37:29] (03CR) 10Filippo Giunchedi: "good catch, removed search_pool references from configuration.pp" [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [10:39:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Split off module locales out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184293 (owner: 10Faidon Liambotis) [10:42:05] (03CR) 10Giuseppe Lavagetto: [C: 031] locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 (owner: 10Faidon Liambotis) [10:45:19] <_joe_> !log restarting hhvm on mw1126, stuck in HPHP::StatCache::refresh [10:45:22] Logged the message, Master [10:45:56] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.126 second response time [10:45:56] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69220 bytes in 0.362 second response time [10:51:24] (03PS2) 10Filippo Giunchedi: lsearchd: remove udp2log configuration [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) [10:52:59] (03CR) 10Filippo Giunchedi: "indeed, updated the commit message actions to take, no ensure = 'absent' support AFAICT" [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [10:53:36] RECOVERY - HHVM queue size on mw1126 is OK: OK: Less than 30.00% above the threshold [10.0] [10:54:07] RECOVERY - HHVM busy threads on mw1126 is OK: OK: Less than 30.00% above the threshold [57.6] [11:02:40] (03CR) 10Filippo Giunchedi: [C: 031] puppet: include admin in swift roles, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/184308 (owner: 10Giuseppe Lavagetto) [11:05:59] (03CR) 10Giuseppe Lavagetto: [C: 031] Split off module debconf out of the generic module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184295 (owner: 10Faidon Liambotis) [11:12:15] (03CR) 10Giuseppe Lavagetto: [C: 031] Kill generic::upstart_job definition [puppet] - 10https://gerrit.wikimedia.org/r/184296 (owner: 10Faidon Liambotis) [11:13:35] (03CR) 10Giuseppe Lavagetto: [C: 031] Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 (owner: 10Faidon Liambotis) [11:17:37] (03CR) 10Filippo Giunchedi: "LGTM, however it seems the nodes won't pick up the right cluster? http://puppet-compiler.wmflabs.org/566/change/184307/html/" [puppet] - 10https://gerrit.wikimedia.org/r/184307 (owner: 10Giuseppe Lavagetto) [11:19:52] (03CR) 10Filippo Giunchedi: [C: 04-1] graphite: introduce local c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [11:43:26] (03PS1) 10Giuseppe Lavagetto: puppet: use role for ocg services [puppet] - 10https://gerrit.wikimedia.org/r/184325 [11:54:34] <_joe_> paravoid: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:remove_globals,n,z [11:54:42] sec. [11:54:50] <_joe_> whenever you can [11:54:53] <_joe_> no rush at all [12:01:57] (03PS3) 10Faidon Liambotis: Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 [12:01:59] (03PS3) 10Faidon Liambotis: Kill generic::upstart_job definition [puppet] - 10https://gerrit.wikimedia.org/r/184296 [12:02:01] (03PS3) 10Faidon Liambotis: Split off module debconf out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184295 [12:02:03] (03PS3) 10Faidon Liambotis: locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 [12:02:05] (03PS2) 10Faidon Liambotis: Split off module locales out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184293 [12:02:39] (03CR) 10Faidon Liambotis: [C: 032] Split off module locales out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184293 (owner: 10Faidon Liambotis) [12:03:46] (03CR) 10Faidon Liambotis: [C: 032] locales: add locales::all, Debian support, purge [puppet] - 10https://gerrit.wikimedia.org/r/184294 (owner: 10Faidon Liambotis) [12:04:04] (03CR) 10Faidon Liambotis: [C: 032] Split off module debconf out of the generic module [puppet] - 10https://gerrit.wikimedia.org/r/184295 (owner: 10Faidon Liambotis) [12:05:06] (03CR) 10Faidon Liambotis: [C: 032] Kill generic::upstart_job definition [puppet] - 10https://gerrit.wikimedia.org/r/184296 (owner: 10Faidon Liambotis) [12:05:31] (03CR) 10Faidon Liambotis: [C: 032] Move umask-wikidev.sh to role::deployment::common [puppet] - 10https://gerrit.wikimedia.org/r/184297 (owner: 10Faidon Liambotis) [12:08:32] (03PS2) 10KartikMistry: Use cxserver/deploy in deployment [puppet] - 10https://gerrit.wikimedia.org/r/184217 [12:10:07] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [12:11:31] 3operations: git::clone makes changed files root-only readable - https://phabricator.wikimedia.org/T86527#970557 (10hashar) 3NEW [12:13:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [12:15:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "That would work, but needs change also in the upstart script (either update base_path or the exec)" [puppet] - 10https://gerrit.wikimedia.org/r/184217 (owner: 10KartikMistry) [12:15:36] ^ tin puppet error is just a transient one, puppet runs fine now [12:20:09] <_joe_> !log upgrading HHVM on the API cluster [12:20:15] Logged the message, Master [12:21:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [12:24:53] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#970586 (10faidon) 5Resolved>3Open #HTTPS exists now, but my original report also said: > and another one for an HTTPS-by-default milestone #HTTPS-by-default. It could be argued that this can be a simple tag... [12:25:03] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#970589 (10faidon) p:5Normal>3High [12:26:10] (03PS5) 10Alexandros Kosiaris: Reuse parsoid varnish for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/181613 (https://phabricator.wikimedia.org/T76200) [12:27:18] (03PS13) 10KartikMistry: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 [12:27:48] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce cxserver.eqiad.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/183888 (https://phabricator.wikimedia.org/T76200) (owner: 10Alexandros Kosiaris) [12:28:16] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:29:49] (03CR) 10Alexandros Kosiaris: [C: 032] Reuse parsoid varnish for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/181613 (https://phabricator.wikimedia.org/T76200) (owner: 10Alexandros Kosiaris) [12:33:37] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 2 failures [12:43:17] RECOVERY - Graphite Carbon on graphite1002 is OK: OK: All defined Carbon jobs are runnning. [12:46:57] PROBLEM - Graphite Carbon on graphite1002 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [13:01:01] (03PS1) 10Alexandros Kosiaris: Followup commit for 2e0ce6b [puppet] - 10https://gerrit.wikimedia.org/r/184338 [13:01:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp3010 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [13:04:50] (03CR) 10Alexandros Kosiaris: [C: 032] Followup commit for 2e0ce6b [puppet] - 10https://gerrit.wikimedia.org/r/184338 (owner: 10Alexandros Kosiaris) [13:06:17] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:07:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp3010 is OK: OK: Less than 1.00% above the threshold [0.0] [13:10:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [13:13:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [13:17:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp3007 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [13:19:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp3003 is OK: OK: Less than 1.00% above the threshold [0.0] [13:19:37] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 79.591667 [13:22:47] RECOVERY - Varnishkafka Delivery Errors on cp3016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:24:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp3007 is OK: OK: Less than 1.00% above the threshold [0.0] [13:29:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [13:30:10] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#970698 (10Tbayer) 3NEW [13:42:02] (03PS1) 10Alexandros Kosiaris: Append the port to apertium url in cxserver [puppet] - 10https://gerrit.wikimedia.org/r/184339 [13:43:14] (03CR) 10Alexandros Kosiaris: [C: 032] Append the port to apertium url in cxserver [puppet] - 10https://gerrit.wikimedia.org/r/184339 (owner: 10Alexandros Kosiaris) [13:52:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [14:00:05] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150112T1400). Please do the needful. [14:00:45] deploy! [14:02:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [14:02:48] DEPLOY ALL THE THINGS [14:02:52] :) [14:08:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [14:14:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp3003 is OK: OK: Less than 1.00% above the threshold [0.0] [14:17:56] wow, MWSearch is undeployed? :) [14:18:53] 3ops-core: Status of ms1004? - https://phabricator.wikimedia.org/T84435#970785 (10ArielGlenn) This is not one of my hosts; it looks like it was a thumb server ages ago. It can be reclaimed for something else. [14:19:05] 3operations, Wikidata, Datasets-General-or-Unknown: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#970787 (10ArielGlenn) Thanks for the patch! I will check it out in the next couple of days. I'm really sorry for the long delay; I've been out for medical reasons and a... [14:21:34] aude: yup :D [14:21:50] :) [14:31:06] mutante, can you complete https://gerrit.wikimedia.org/r/#/c/181421/ ? [14:35:42] 3operations, Labs-Team: facter: VM detection incorrect in labs - https://phabricator.wikimedia.org/T78813#970807 (10Gage) Thanks! The above link doesn't work for me ("Access Denied: Restricted Application"), change is https://gerrit.wikimedia.org/r/#/c/184291/ [14:46:36] (03CR) 10JanZerebecki: [C: 031] SSL: Remove RC4, enable 3DES [puppet] - 10https://gerrit.wikimedia.org/r/178555 (owner: 10BBlack) [14:48:19] when trying to do git fetch, i get "authenticity of host gerrit...." can't be established [14:48:27] did something change there? [14:49:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [14:50:49] so [14:51:00] since I am french should I GIGN my changes instead of SWAT them ? [14:51:30] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#970844 (10chasemp) There is also a 'goal' one outside of sprint. Preference? [14:52:22] aude: maybe you started using ipv6? [14:52:44] no, i doubt it [14:53:18] just feels weird to say "yes" but nothing else looks odd to me [14:56:05] looks like ipv6 [14:57:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [14:57:41] hashar: maybe you can RAID them [14:58:02] hashar: want me to deploy echo? [14:58:09] or are you just doing it? [14:58:43] aude: doing it [14:58:46] ok [14:58:47] aude: well I am trying :° [14:58:54] :) [14:59:01] * aude then needs to deploy wikibase stuff [15:01:28] bah the gate and submit job failed [15:01:35] :( [15:01:45] AND Zuul died [15:01:56] if it's about tidy, then overrule [15:02:24] !log restarting Zuul. Deadlocked due to Gerrit database [15:02:26] Logged the message, Master [15:04:10] that is my first extension deploy in age [15:04:29] and that takes a while! :] [15:10:56] !log hashar Synchronized php-1.25wmf13/extensions/Echo/tests/phpunit/includes/cache/TitleLocalCacheTest.php: php-1.25wmf14/extensions/Echo/tests/phpunit/includes/cache/TitleLocalCacheTest.php (duration: 00m 05s) [15:10:59] Logged the message, Master [15:11:08] aude: I have pushed my lame patch. Should be good for you now [15:11:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [15:11:18] ok [15:12:20] (03PS1) 10Aude: Set useLegacyUsageIndex to false for test wikidata and test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184350 [15:14:46] aude: do you have any extensions / core patches to push as well? [15:14:57] we do [15:15:10] and will need to run scap, since one contains messages [15:15:24] do you need anything before that? [15:15:43] aude: I wanted to push a huge CI change, but I guess that will wait :] [15:15:49] ok :/ [15:15:57] hopefully it won't take very long [15:16:20] I am pushing it :-] It is reasonably tested already [15:16:23] ok [15:16:34] * aude waiting for jenkins [15:16:37] I should stop being paranoid [15:16:59] for wmf branches, I should probably remove the Zend phpunit job [15:17:06] it is useless [15:17:16] heh [15:17:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp3009 is OK: OK: Less than 1.00% above the threshold [0.0] [15:17:29] though we might still rely on Zend [15:17:38] i think on tin, etc [15:17:42] terbium [15:17:52] at the least [15:22:49] 3MediaWiki-Core-Team, operations: Deploy multi-lock PoolCounter change - https://phabricator.wikimedia.org/T85071#970891 (10fgiunchedi) what: * build and upload a new version of poolcounterd debian package * coordinate the poolcounter deployment, perhaps under SWAT? ** namely this involves upgrading the poolcoun... [15:23:11] (03CR) 10Rush: [C: 04-1] "can we talk about these in the ops meeting? This structure is not a good idea as the admin module stands now" [puppet] - 10https://gerrit.wikimedia.org/r/183882 (owner: 10Giuseppe Lavagetto) [15:23:51] <_joe_> chasemp: why in ops meeting? [15:23:59] oh I thought you weren't around :) [15:24:01] <_joe_> we don't really have the time [15:24:05] <_joe_> what's bothering you? [15:24:08] Added and populated wbc_entity_usage table on testwiki and testwikidatawiki [15:24:29] _joe_: so the admin module as it stands really doesn't play nicely with embedding it in the roles [15:24:38] <_joe_> why? [15:24:40] as soon as you have two roles with a conflicting user it's going to crap out [15:24:46] <_joe_> yes [15:24:49] because it needs full context to dedupe all users [15:24:59] so that's weird in that we are only going to be able to do this on _some_ nodes [15:25:13] and then there will be a situation where we will have to undo it as soon as roles have the same users [15:25:25] leaving it at node level ensures consistency [15:25:32] <_joe_> in those cases, you declare admin::groups in the hosts/$host.yaml file [15:25:44] I would really prefer to do it one way across the board [15:25:48] <_joe_> there is no reason to include it anywhere but in 'base' [15:25:54] yes there is [15:25:55] !log aude Started scap: Update Wikidata and WikimediaMessages [15:25:59] Logged the message, Master [15:26:00] * aude waits... [15:26:01] <_joe_> chasemp: the end goal is: [15:26:07] <_joe_> "include admin" in base [15:26:16] that won't work [15:26:17] <_joe_> then configure it via hiera either at node level [15:26:20] I think even now [15:26:22] <_joe_> or in the roles [15:26:28] because the analytics roles require explicit ops [15:26:30] so even at this moment [15:26:33] <_joe_> chasemp: how is that remotely possible? [15:26:33] you will have conflict [15:26:46] <_joe_> explicit what? [15:26:54] <_joe_> can you make me a working example? [15:26:58] !log Added and populated wbc_entity_usage table on testwiki and testwikidatawiki [15:27:01] Logged the message, Master [15:27:05] <_joe_> I just don't get your point :) [15:27:51] here is an example [15:27:51] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/data/data.yaml;dfab2a45bd68df20cd225de964b99ac4ff3473e4$215 [15:27:55] otto is in roots [15:28:00] and analytics-privatedata-users [15:28:04] so if you move admin to base [15:28:07] and then try to include that role [15:28:08] conflict [15:28:15] it needs full context at the _node level_ [15:28:17] <_joe_> which role? [15:28:17] to dedup users [15:28:29] any role, an analytics role in that case I would imagine [15:28:42] what you are doing cannot be applied consistently in the environemnt [15:28:47] without rewriting some of the admin module [15:28:49] which is cool [15:28:55] <_joe_> chasemp: that's not how include works in puppet [15:29:18] <_joe_> or are you saying you _must_ do "include admin" AFTER any other inclusion? [15:29:22] no [15:29:29] <_joe_> that is very very wrong [15:29:41] I'm saying if you have two include admin's on the same node with the same user [15:29:43] it craps out [15:29:51] and what you are doing makes that hard to track down and really weird [15:30:02] and it's not going to be possible to put admin in base for this reason [15:30:10] the moment you need more than just 'include admin' on any node with base [15:30:11] <_joe_> "include" stanzas get deduplicated by puppet [15:30:12] it will conflict [15:30:26] well how would that work then when they are not //the same// [15:30:27] <_joe_> what would you need more than "include admin"? [15:30:38] for any group of users who are not included by default? [15:30:43] /go mara [15:30:43] any node that has more than ops [15:30:45] no [15:30:47] (03PS2) 10Aude: Set useLegacyUsageIndex to false for test wikidata and test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184350 [15:30:57] <_joe_> chasemp: you use hiera for that [15:30:59] <_joe_> we already do that [15:31:23] in this situation you would have an include admin in base [15:31:32] <_joe_> yes [15:31:35] and then an include admin with specific groups from hiera right? [15:31:49] for one node? [15:31:52] <_joe_> and when admin gets included, all its class parameters are searched in hiera [15:31:53] or any node that has base and groups [15:31:57] yes that won't work [15:31:59] <_joe_> in the context of that specific node [15:32:03] I [15:32:05] <_joe_> why [15:32:16] <_joe_> make me a working example I can't solve this way [15:32:17] I don't know how better to explain it [15:32:23] <_joe_> https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/mediawiki/appserver.yaml [15:32:29] <_joe_> see here [15:32:31] <_joe_> for mediawiki nodes [15:32:45] (03CR) 10Manybubbles: [C: 031] contint: install Java 8 on Trusty servers [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) (owner: 10Hashar) [15:32:57] <_joe_> do you have one active node now where it won't work? [15:33:31] <_joe_> just so that I understand your objection [15:33:33] I dont know, but I know even if it will work it only works //until someone defines an incosistency// and then it all has to be reverted [15:33:58] it creates a really weird edge case that is entirely resaonable that will crap out [15:33:59] <_joe_> what's the inconsistency you mention? someone declaring the class explicitly? [15:34:03] no [15:34:26] if you put include admin in base [15:34:26] <_joe_> sorry, I'd need a practical example, I mean some sample code to show the problem [15:34:33] well if you read the admin module [15:34:37] and the dedup user logic [15:34:39] it may make more sense [15:34:47] it needs full context at the time of running for a node [15:34:55] <_joe_> what does [15:34:56] for //all users// in order to dedup correctly [15:34:59] puppet [15:35:01] <_joe_> full context mean? [15:35:14] if you have two include admins [15:35:17] and they have different groups [15:35:19] on one node [15:35:23] <_joe_> no that' [15:35:24] and two of those groups include the same user [15:35:26] it dies [15:35:27] <_joe_> s not possible! [15:35:30] !log Enabling test/gate of several extensions together. {{gerrit|180494}} , RFC extensions continuous integration {{bug|T1350}} [15:35:31] well it is [15:35:32] Logged the message, Master [15:35:33] as I wrote it [15:35:34] <_joe_> no. [15:35:38] please try it [15:35:42] better than arguing [15:35:50] it's a duplicate definition [15:35:51] <_joe_> how can you include a class twice with different parameters? [15:35:55] <_joe_> exactlyu [15:36:20] the user is a duplicate definition [15:36:37] <_joe_> chasemp: I don't understand what I should try [15:36:48] how is that possible if you are saying it works :) [15:36:51] <_joe_> can you show me? :) [15:37:10] <_joe_> because it's obvious for me what should and should not work [15:37:17] <_joe_> so lemme do a paste [15:37:53] need a test env to show you well [15:37:54] but [15:40:10] <_joe_> chasemp: https://phabricator.wikimedia.org/P208 this is what I mean [15:40:31] <_joe_> of course you won't declare the admin class anywhere else [15:40:41] <_joe_> but maybe I am not seeing what you're saying [15:40:50] <_joe_> that's why I'm asking you for an example [15:40:51] if foo-roots and bar-roots have the same user [15:40:53] it errors [15:40:59] or [15:41:07] <_joe_> seriously? [15:41:19] if admin ops has the same user as foo-roots or bar-roots [15:41:21] it will error [15:41:25] if they include base [15:41:27] on the node [15:41:37] <_joe_> but that will be the same even now right? [15:41:38] yes I'm quite sure of this as I bend over backwards for puppet 2.7 [15:41:50] well so this was pre-3.0 yes [15:42:04] <_joe_> chasemp: hold on a sec [15:42:05] but I haven't tested it and I'm sure it was shit in 2.7 and this particular behavior I don't think is different [15:42:23] my plan was to rewrite this post 3.0 [15:42:37] as hopefully the limitations in 2.7 on virtual resources would be gone (bus) [15:42:39] bugs even [15:43:19] <_joe_> chasemp: see my comment please [15:43:26] <_joe_> to the paste [15:43:32] <_joe_> this is how we do currently [15:44:04] <_joe_> there is _no_ difference as long as puppet is concerned in declaring admin in these two ways [15:44:34] <_joe_> are you saying that it will choke in the original form but not in the form in the comment? [15:44:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [15:44:50] <_joe_> I can't believe it unless you show me :) [15:44:54] the original form is a bit confusing as [15:45:01] you don't include standard anywhere and it's defined etc [15:45:11] and I'm not sure if #needs foo-roots is shorthand for an admin declaration [15:45:17] or just a note that it needs one in another place in the tree [15:45:18] <_joe_> it's not [15:46:10] <_joe_> chasemp: the point of using hiera is to be able to "include myclass" wherever, and define its parameters separately from code [15:46:53] yes I understand that but I'm saying it doesn't work in this case as the admin module in written in a consistent manner so if you do what you are doing it creates a situation where it will be inconsistent across nodes by definition and any time someone adds a new user to a group [15:46:58] <_joe_> so you declare your admin::groups in hiera, not in code [15:47:00] they risk creating this dupe user situation [15:47:07] who can add people to the gerrit wmf-deployment group? admins? [15:47:08] yes that will need a rewrite of admin to do this way [15:47:13] <_joe_> chasemp: they risk already [15:47:17] no they don't [15:47:26] it dedupes teh users at run time [15:47:39] <_joe_> chasemp: sorry I got what your doubt is now [15:47:52] <_joe_> hiera won't allow different declarations of parameters [15:48:15] what happens then (seriously asking) [15:48:18] when base has include admin [15:48:19] <_joe_> so say you define admin::groups as 'foo-roots' in the hiera file for one role [15:48:33] <_joe_> and 'bar-roots' in another [15:48:38] <_joe_> and you include both classes [15:49:05] <_joe_> puppet will fail telling you you have conflicting parameter declarations for the class [15:49:24] so in this scheme you can only ever have one role per node? [15:49:28] <_joe_> and you will need to re-declare admin::groups higher in the hierarchy *specifically for that node* [15:49:36] <_joe_> so you don't risk what you stated [15:49:50] <_joe_> nope [15:49:51] because hiera errors sooner than the user dupe [15:49:55] but it's the same basic problem [15:50:07] <_joe_> chasemp: ok a working example [15:50:21] sync common failed on mw1062, something about mathjax and some read only file system errors [15:50:28] known issue? [15:50:30] <_joe_> say we have a node that has both the appserver role and one analytics role [15:50:32] manybubbles, marktraceur, ^d: Who wants to SWAT today? [15:50:33] <_joe_> aude: no [15:50:40] <_joe_> chasemp: ok? [15:50:43] yup [15:51:05] <_joe_> so, in general nodes with the appserver role will have the 'deployment' group [15:51:08] anomie: I'm trying to dig out from missing two days of emails so I'd like to skip today if possible. [15:51:09] <_joe_> (https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/mediawiki/appserver.yaml) [15:51:13] can do it needed [15:51:34] <_joe_> say the analytics role has instead the group of users 'pinkunicorns' [15:51:55] <_joe_> if you include both roles in a node, you get an error because you have conflicting class declarations [15:52:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:14] <_joe_> the solution is to declare the paramters at the node level in hiera [15:52:19] <_joe_> in that specific case [15:52:19] hoo, aude, gi11es: Ping for SWAT in about 8 minutes [15:52:23] anomie, had been planning to offer to do it for the first time, don't have full access yet though (ssh is sorted, gerrit group is not)... [15:52:29] anomie: pong [15:52:29] <_joe_> this all happens well before admin ever gets applied [15:52:31] <_joe_> :) [15:52:36] this I understand [15:52:56] but I disagree with the approach then for the same reason different error [15:52:59] * aude is here (and almost done scapping for earlier stuff) [15:53:01] <_joe_> so you will always be permitted to have one and only one set of parameters for admin [15:53:09] I do not want to apply groups inconsistently in the hierarchy at all [15:53:14] except for the error, which should be looked at [15:53:19] <_joe_> but you won't [15:53:20] after haveing spent weeks tracking down the weird places people embedded groups [15:53:29] they should all be defined at the same level of abstraction [15:53:38] <_joe_> chasemp: yes, in hiera [15:53:39] if you can't guarantee all things can coexist at the role level [15:53:45] then we should keep it at the node level for all [15:53:48] <_joe_> that's the only place where it gets defined [15:54:04] but it's a different level of abstraction if you osme things are node level [15:54:06] <_joe_> I don't see why, having a hierarchical lookup is a great advantage [15:54:09] and some things are role level [15:54:23] and it's all predetermined by what users and what roles [15:54:33] and we have lots of places where multiple roles are included by design [15:54:36] where it will be top level [15:54:42] and some where they don't where it will be embedded [15:54:52] but never can we say for sure which without digging [15:54:57] <_joe_> (also note this is mostly a scholastic discussion, as most roles don't have user groups attached) [15:55:00] and it will always change when adding or removing things [15:55:07] yes but some do [15:55:13] and it's not bad practice [15:55:22] 3ops-network, operations: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970935 (10Cmjohnson) 3NEW [15:55:24] so we are creating an intentional inconsistency [15:55:28] that will have to be worked around [15:55:33] <_joe_> chasemp: what inconsistency? [15:55:39] <_joe_> I don't understand sorry [15:55:47] <_joe_> I really don't get it [15:55:49] ok so, my perspective comes from [15:55:56] digging through all the code to find out why a user is on a host [15:56:08] and people putting their assignements at node level, role level, in weird classes etc [15:56:11] <_joe_> no just the hiera data for _that_ host [15:56:12] and it was a giant ugly mess [15:56:17] yes just back story [15:56:22] and in that case we came out with [15:56:27] <_joe_> ok I know that was horrible [15:56:30] ok let's put all user assignment at the top level [15:56:38] <_joe_> but trust me this is pretty clean [15:56:40] because it's consistency and easy to see where any user comes from on any node [15:56:45] it's clean but not consistent [15:56:49] and worse [15:56:51] i also see sync-common error for mw1010 (think it might be that instead) [15:56:55] teh inconsistency is not consistent [15:56:56] <_joe_> how is it not consistent? [15:57:05] any node could need it applied either way depending on how many roles and what they include [15:57:06] <_joe_> aude: can you paste all those? [15:57:09] sure [15:57:20] it is otherwise 99% done and ok [15:57:20] in some cases you'll inheret users from a role, and in some cases from node level definitions [15:57:25] ut it could change for a node or a role [15:57:29] depending on other roles applied [15:57:38] and if they need embedded uses [15:57:43] so some nodes will have admin declarations [15:57:44] <_joe_> no [15:57:44] at top [15:57:48] <_joe_> no [15:57:53] <_joe_> and no [15:57:54] <_joe_> :) [15:58:02] !log aude Finished scap: Update Wikidata and WikimediaMessages (duration: 32m 06s) [15:58:03] k then I'm lost, I though that was your example [15:58:07] Logged the message, Master [15:58:14] can't have to include admins on a node I thought [15:58:26] another error for mw1062 for wikiversions [15:58:28] <_joe_> you don't need to [15:58:46] http://dpaste.com/0FZSEQV [15:58:56] <_joe_> chasemp: so your concern is you can't track down what users are on a node? [15:59:06] !log logstash not showing any events at all since 2015-01-12T13:58:59.728Z [15:59:10] Logged the message, Master [15:59:16] no my concern it's weirdly implicit and not consistent [15:59:19] <_joe_> it's pretty easy, you just do a hiera lookup of admin::groups in the context of that node :) [15:59:25] and that outways the limited niceness in hiera in some cases [15:59:31] (03CR) 10Aude: [C: 032] Set useLegacyUsageIndex to false for test wikidata and test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184350 (owner: 10Aude) [15:59:35] (03Merged) 10jenkins-bot: Set useLegacyUsageIndex to false for test wikidata and test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184350 (owner: 10Aude) [15:59:51] in your example to roles on one node both with include admin errors right? [15:59:56] to=>two [15:59:58] <_joe_> I can understand 'implicit' (which is not, given it's clearly declared in hiera) [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150112T1600). [16:00:13] * anomie will SWAT, he supposes [16:00:17] gi11es: You're first [16:00:20] <_joe_> chasemp: only if they declare different admin::groups [16:00:27] !log aude Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [16:00:30] Logged the message, Master [16:00:36] <_joe_> if that's the case, the "default" for such roles is not valid [16:00:37] ok so if you have base with 'include admin' and a node includes base [16:00:38] (03PS2) 10Anomie: Disable thumbnail prerendering in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183885 (https://phabricator.wikimedia.org/T76035) (owner: 10Gilles) [16:00:41] and a role with admin with a gorup [16:00:43] that errors? [16:00:49] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183885 (https://phabricator.wikimedia.org/T76035) (owner: 10Gilles) [16:00:49] <_joe_> no of course [16:01:02] <_joe_> why should it [16:01:07] !log aude Synchronized wmf-config/Wikibase.php: Enable usage tracking on test.wikidata and testwiki (duration: 00m 05s) [16:01:10] <_joe_> the /data/ are all in hiera [16:01:10] Logged the message, Master [16:01:11] * aude is done [16:01:16] it will if the users in the group are dupes [16:01:18] <_joe_> so think about hiera for a moment [16:01:19] at runtime [16:01:25] got same errors though for config patches [16:01:41] <_joe_> chasemp: no you're wrong [16:01:42] <_joe_> sorry [16:01:49] <_joe_> that's not how puppet works [16:01:53] http://dpaste.com/15SNG3X [16:01:58] <_joe_> not how hiera or the include keyword works [16:02:19] <_joe_> but hold on a sec [16:02:28] <_joe_> I'll take a look at aude's problem [16:02:33] thanks [16:02:36] gotta hop off, please don't merge those [16:02:40] until we shake this out [16:02:43] <_joe_> ok [16:03:12] 3ops-network, operations: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970953 (10chasemp) [16:03:51] !log elasticsearch on logstash1001 not responding to http requests [16:03:53] <_joe_> !log depooling mw1062, disk errors [16:03:55] Logged the message, Master [16:03:57] Logged the message, Master [16:04:06] <_joe_> aude: the disk is dead [16:04:10] hmm. Slow Jenkins today. Apparently it won't do parallel merging of patches in different repos anymore. [16:04:24] also mw1112 returned [255]: Error reading response length from authentication socket [16:04:34] <_joe_> chasemp: I will prepare a couple of examples for you [16:04:49] sure [16:04:51] please [16:04:55] (03Merged) 10jenkins-bot: Disable thumbnail prerendering in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183885 (https://phabricator.wikimedia.org/T76035) (owner: 10Gilles) [16:04:59] <_joe_> but I think you just miss one piece of the puzzle [16:05:01] idk if mw1062 is the only issue [16:05:22] it's not impossible :) [16:05:23] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable thumbnail prerendering in production [[gerrit:183885]] (duration: 00m 06s) [16:05:25] gi11es: ^ test please [16:05:26] !log restarted elasticsearch on logstash1001 [16:05:26] Logged the message, Master [16:05:28] Logged the message, Master [16:05:40] anomie: testing [16:05:43] but if in some cases things have to be defined at node level and in some cases things //can// be defined in role level and it depends on what is defined [16:05:51] then I'm going to say it should stay consistent and at node level [16:05:58] * anomie had a scap error for mw1062, but sees that's already being discussed [16:05:58] but maybe that's an oversimplication [16:05:59] fwiw, I agree with _joe_ that admin should just be included from base [16:06:03] and let hiera handle its parameters [16:06:12] it's not a matter of should, yes if it works well of course [16:06:21] it will work [16:06:26] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 8, timed_out: False, active_primary_shards: 68, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 196, initializing_shards: 3, number_of_data_nodes: 3 [16:06:27] I'm pretty sure it will :) [16:06:55] it will work as long as none of the users are duplicated in any other group applied to anode [16:07:06] which is not sane [16:07:27] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 68, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 197, initializing_shards: 3, number_of_data_nodes: 3 [16:07:28] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 68, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 197, initializing_shards: 3, number_of_data_nodes: 3 [16:07:39] classes can only get included once, so we include it once now [16:07:47] it's just a matter of parameters you give to the class [16:07:54] ok anyways, I gotta jet for a minute we can circle back and joe and I will try to trade examples [16:07:57] whether you give them from the manifest or from hiera, it doesn't matter [16:08:17] 3ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#970960 (10Joe) 3NEW [16:08:28] in any case "puppet apply --modulepath=... test.pp" can be very useful with abstract modules [16:08:51] <_joe_> paravoid: I _always_ do that [16:08:58] anomie: it works, thanks for the SWAT [16:09:04] aude: Ready for your config change to be SWATted? [16:09:11] !log logstash elasticsearch cluster has strange indices dated 2014-01-* and 2015-12-* again [16:09:13] <_joe_> (my file is called "prova.pp", a little italian in the mix" [16:09:13] Logged the message, Master [16:09:30] anomie: ready [16:09:37] we use that word too ;) [16:09:39] (03PS2) 10Anomie: Enable "Other projects sidebar" by default on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183288 (https://phabricator.wikimedia.org/T85971) (owner: 10Tpt) [16:09:47] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183288 (https://phabricator.wikimedia.org/T85971) (owner: 10Tpt) [16:09:51] (03Merged) 10jenkins-bot: Enable "Other projects sidebar" by default on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183288 (https://phabricator.wikimedia.org/T85971) (owner: 10Tpt) [16:10:07] esp. in the context of e.g. theater [16:10:43] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable "Other projects sidebar" by default on frwiki [[gerrit:183288]] (duration: 00m 05s) [16:10:43] aude: ^ Test please [16:10:46] Logged the message, Master [16:10:49] I've been wondering btw whether it makes sense to precompile passwd/group on the puppetmaster [16:10:54] looks good [16:10:55] and use libnss-extrausers [16:11:12] <_joe_> mh [16:11:21] puppet takes a long time to check for and possibly realize all of these users [16:11:32] I did some profiling the other day, it's by far the most called and expensive resource [16:11:33] <_joe_> not a bad idea per-se, but the puppetmasters are already quite hosed [16:11:54] 3ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#970974 (10Cmjohnson) This server is out of warranty. We have spares disks on-site and will take care of ASAP. [16:11:55] <_joe_> so maybe after we upgrade to trusty/ruby1.9? [16:11:57] <_joe_> or jessie [16:12:04] _joe_: Saw a "Error reading response length from authentication socket" for mw1111 when deploying [16:12:22] I was trying to think a way during which we wouldn't have to recompute it (e.g. generate() with each run) [16:12:34] maybe cache it, I don't know [16:12:45] I mean, all of the data is in a yaml file [16:13:17] !log logs on logstash1001 reporting elasticserch connection errors; restarted logstash service [16:13:21] Logged the message, Master [16:13:34] <_joe_> anomie: it's not the first time this happens, not related to some relevant failure anyway [16:13:46] <_joe_> we should investigate that, maybe open a ticket? [16:13:48] logstash is sad and not getting better yet :( [16:14:18] we have these puppet abstractions but the truth is, we never use them [16:14:27] because we refer to one central yaml file [16:14:33] _joe_: I can do what. Which project(s)? [16:14:56] we just care about the "groups" value + the yaml file and that's it, for the most part [16:15:02] <_joe_> anomie: I don't know if we have one project for scap. I'd say scap and operations [16:15:13] <_joe_> paravoid: yeah, basically [16:15:33] scap project is https://phabricator.wikimedia.org/tag/deployment-systems/ [16:16:01] so we could even ship that yaml to hosts and have an Exec that calls "generate-passwd groups foo" [16:16:18] that'd spit out an NSS extrausers file [16:16:20] <_joe_> upon changes [16:16:26] <_joe_> yes [16:16:31] yeah refreshonly => true obviously [16:16:37] (03PS1) 10Aude: Add testwiki to client db list setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184357 [16:16:48] <_joe_> and subscribe [16:16:52] <_joe_> yes [16:16:59] <_joe_> sounds good [16:17:09] something like that, I haven't thought it much [16:17:39] there's really no good reason why we iterate yaml in puppet [16:17:49] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#970990 (10Anomie) 3NEW [16:17:57] (I can think of, right now) [16:18:01] 3operations: graphite-web logs are not rotated - https://phabricator.wikimedia.org/T86546#970996 (10fgiunchedi) 3NEW a:3fgiunchedi [16:22:33] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#971023 (10bd808) The "Error reading response length from authentication socket." message has been reported intermittently by @reedy and others since the introduction of the sh... [16:22:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [16:27:35] (03PS1) 10Aude: Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 [16:28:28] !log deleted 2014-01-* and 2015-12-* indices from logstash elasticsearch cluster [16:28:34] Logged the message, Master [16:30:19] !log stop/start graphite-web on tungsten to clear logs [16:30:23] Logged the message, Master [16:31:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [16:33:10] (03PS11) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [16:34:12] woudl someone please review https://gerrit.wikimedia.org/r/184360 ? [16:34:16] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 4.622 second response time [16:34:35] would* [16:34:36] PROBLEM - Disk space on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:36] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:37] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:57] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:37] RECOVERY - DPKG on tungsten is OK: All packages OK [16:35:37] RECOVERY - Disk space on tungsten is OK: DISK OK [16:35:46] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [16:36:06] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9352 bytes in 0.352 second response time [16:37:03] (03CR) 10Ottomata: Sync Hive generated TSVs to stat1002 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/184223 (owner: 10QChris) [16:37:08] (03PS2) 10Ottomata: Sync Hive generated TSVs to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/184223 (owner: 10QChris) [16:38:29] (03CR) 10Ottomata: [C: 032] Sync Hive generated TSVs to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/184223 (owner: 10QChris) [16:39:57] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:16] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:26] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:36] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:07] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9352 bytes in 6.580 second response time [16:41:18] ^ still me, tungsten isn't amused even when removing huge files [16:41:27] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [16:41:28] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [16:41:36] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [16:41:37] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [16:42:07] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [16:42:28] (03PS1) 10Ottomata: Don't declare /a/log/webrequest/archive directory [puppet] - 10https://gerrit.wikimedia.org/r/184364 [16:43:03] (03CR) 10BBlack: [C: 04-1] "1) This still needs the Vary-related stuff removed if we're going the vcl_hash route." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [16:44:10] (03CR) 10Ottomata: [C: 032] Don't declare /a/log/webrequest/archive directory [puppet] - 10https://gerrit.wikimedia.org/r/184364 (owner: 10Ottomata) [16:44:14] (03PS2) 10BBlack: Install mcelog and intel-microcode everywhere [puppet] - 10https://gerrit.wikimedia.org/r/181743 [16:45:05] (03CR) 10BBlack: [C: 032 V: 032] Install mcelog and intel-microcode everywhere [puppet] - 10https://gerrit.wikimedia.org/r/181743 (owner: 10BBlack) [16:45:35] (03PS2) 10Aude: Update client db list setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184357 [16:45:47] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:45:49] (03PS4) 10BBlack: SSL: Remove RC4, enable 3DES [puppet] - 10https://gerrit.wikimedia.org/r/178555 [16:46:36] 3operations: replicate metric traffic in eqiad and codfw - https://phabricator.wikimedia.org/T85908#971086 (10fgiunchedi) p:5Triage>3High [16:46:58] 3operations: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#971087 (10fgiunchedi) p:5Triage>3High [16:47:07] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#971088 (10fgiunchedi) p:5Triage>3High [16:47:15] 3operations: acquire graphite hardware in codfw and eqiad - https://phabricator.wikimedia.org/T85907#971089 (10fgiunchedi) p:5Triage>3High [16:49:16] 3Wikimedia-General-or-Unknown, operations: COPYING is served as application/octet-stream - https://phabricator.wikimedia.org/T63903#971092 (10fgiunchedi) @maxsem ideas whether we are (or should) be serving COPYING? [16:50:11] (03PS5) 10BBlack: SSL: Remove RC4, enable 3DES [puppet] - 10https://gerrit.wikimedia.org/r/178555 [16:51:00] (03CR) 10Dzahn: [C: 032] Move notices to -releng from -qa [puppet] - 10https://gerrit.wikimedia.org/r/184199 (https://phabricator.wikimedia.org/T86053) (owner: 10Greg Grossmeier) [16:51:02] (03CR) 10BBlack: [C: 032 V: 032] SSL: Remove RC4, enable 3DES [puppet] - 10https://gerrit.wikimedia.org/r/178555 (owner: 10BBlack) [16:51:05] \o/ [16:51:18] ++ :) [16:51:30] collision lol [16:51:36] can I merge yours? [16:51:40] yes please [16:52:18] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:52:20] 3operations: create pxe-bootable rescue image - https://phabricator.wikimedia.org/T76135#971097 (10fgiunchedi) [16:52:21] 3operations: provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#971098 (10fgiunchedi) [16:53:57] (03PS12) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [16:54:57] RECOVERY - DPKG on labmon1001 is OK: All packages OK [16:55:17] 3ops-core, Multimedia, operations: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#971101 (10Joe) The first HAT imagescaler is reimaged, and basic testing shows it's working fine. I would need suggestions on how/what to test before taking it into rotation. [16:55:32] (03CR) 10Dzahn: "what kind of error do you get and how did you try to clone? there shouldn't be a restriction on cloning operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [16:55:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [16:56:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [16:56:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [16:58:42] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#971112 (10chasemp) p:5Triage>3Normal a:3greg [16:58:53] greg-g: Any chance to deploy a CentralAuth patch now-ish? https://gerrit.wikimedia.org/r/183832 [16:59:17] PROBLEM - puppet last run on analytics1018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:59:24] hoo: assume so, I have to run (can't look at it), if others think you're sane, yes [16:59:53] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [16:59:54] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [16:59:55] (03CR) 10Dzahn: "just also needed renaming the contact that is defined in the private repo. did that just now" [puppet] - 10https://gerrit.wikimedia.org/r/184199 (https://phabricator.wikimedia.org/T86053) (owner: 10Greg Grossmeier) [17:00:04] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [17:00:04] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 2 failures [17:00:13] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [17:00:14] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 1 failures [17:00:55] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [17:01:34] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 1 failures [17:02:34] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:02:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [17:02:35] !log labmon1001 - purging mlocate package that was status 'rc' [17:02:38] Logged the message, Master [17:03:23] cp4002 - i dont see a failure there ...false alarm? [17:05:03] PROBLEM - Varnishkafka Delivery Errors on cp3004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 983.950012 [17:05:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think the template configuration is too complex as it is now; it's probably worth simplifying a bit." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [17:06:21] (03PS1) 10Calak: Create "autopatrolled", "patroller" and "rollbacker" user group on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) [17:06:25] (03CR) 10jenkins-bot: [V: 04-1] Create "autopatrolled", "patroller" and "rollbacker" user group on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [17:08:14] RECOVERY - Varnishkafka Delivery Errors on cp3004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:08:32] 3operations: something (reqstats?) puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#971131 (10fgiunchedi) p:5High>3Normal downgrading to normal since we've mitigated the biggest growth. still pending zuul patching which now has over 93k disti... [17:09:04] (03PS1) 10BBlack: do not apply hw packages on virtuals [puppet] - 10https://gerrit.wikimedia.org/r/184372 [17:09:12] 3ops-core, operations: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#971134 (10Joe) Well, installing these servers is not as straightforward as "do a PXE install" - they need some puppet work as well. Also a precondition to make them work is having a memcached cluster in codfw as well [17:09:44] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [17:10:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [17:10:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [17:10:34] CUSTOM - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:10:41] (03CR) 10BBlack: [C: 032] do not apply hw packages on virtuals [puppet] - 10https://gerrit.wikimedia.org/r/184372 (owner: 10BBlack) [17:10:43] (03PS2) 10Calak: Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) [17:10:45] (03CR) 10jenkins-bot: [V: 04-1] Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [17:10:52] sigh. varnishkafka esams. you are on my naughty list. [17:11:54] (03PS13) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [17:11:54] i wonder what happened to "misccommands.cfg" on neon [17:12:14] it's still there but apparently doesnt change anymore when things are changed in puppet [17:12:37] (03CR) 10Giuseppe Lavagetto: "I think your doubts are misguided, see a simple example of how this works here:" [puppet] - 10https://gerrit.wikimedia.org/r/183882 (owner: 10Giuseppe Lavagetto) [17:13:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [17:14:04] PROBLEM - Varnishkafka Delivery Errors per minute on cp3006 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [17:14:24] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:14:44] RECOVERY - puppet last run on analytics1018 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:15:33] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:16:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp3009 is OK: OK: Less than 1.00% above the threshold [0.0] [17:16:25] CUSTOM - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [17:16:33] (03PS2) 10Ottomata: Remove unused udp2log awk scripts that forwarded to universities [puppet] - 10https://gerrit.wikimedia.org/r/184137 (owner: 10QChris) [17:17:03] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:17:04] PROBLEM - Varnishkafka Delivery Errors on cp3006 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 580.416687 [17:17:44] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:18:22] mutante, thanks for adding me to the right group. I noticed some other people are still in there but probably shouldn't be anymore [17:18:24] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:18:34] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:19:44] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [17:20:02] Krenair: i did as well and made T86548 [17:20:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Wait until I am sure of what is happening with apache hard restarts from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/183828 (owner: 10Giuseppe Lavagetto) [17:22:03] (03CR) 10Rush: "Ok thanks man for outlining. A few things, the behavior I worked around previously was an older version of puppet here, and I would need " [puppet] - 10https://gerrit.wikimedia.org/r/183882 (owner: 10Giuseppe Lavagetto) [17:22:21] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#971157 (10Krenair) bsitu, mwalker, pgehres, rfaulk, sumanah [17:22:22] !log restarted icinga-wm to join -releng [17:22:24] Logged the message, Master [17:22:26] (03PS1) 10Alexandros Kosiaris: Plan forward with cxserver records to support GeoIP [dns] - 10https://gerrit.wikimedia.org/r/184377 [17:23:44] CUSTOM - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [17:24:33] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [17:25:23] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [17:26:39] (03CR) 10Dzahn: "works now after also restarting icinga-wm, tested with a custom notifications which arrived in the new -releng channel" [puppet] - 10https://gerrit.wikimedia.org/r/184199 (https://phabricator.wikimedia.org/T86053) (owner: 10Greg Grossmeier) [17:26:56] (03CR) 10Ottomata: [C: 032] Remove unused udp2log awk scripts that forwarded to universities [puppet] - 10https://gerrit.wikimedia.org/r/184137 (owner: 10QChris) [17:27:38] graphite1002 says "CRITICAL: Not all configured Carbon instances are running. " [17:27:56] but where [17:27:56] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#971165 (10aude) also ran into this when deploying a config change earlier today. [17:28:26] (03CR) 10Gage: "Recent changes:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [17:28:53] ottomata, mutante: let's merge https://gerrit.wikimedia.org/r/#/c/169691/ this week [17:31:03] !log hoo Synchronized php-1.25wmf13/extensions/CentralAuth/: Only test passwords once in CentralAuthUser::prepareMigration (duration: 00m 06s) [17:31:05] Logged the message, Master [17:31:36] !log hoo Synchronized php-1.25wmf14/extensions/CentralAuth/: Only test passwords once in CentralAuthUser::prepareMigration (duration: 00m 06s) [17:31:38] Logged the message, Master [17:31:52] !log hoo Synchronized php-1.25wmf13/extensions/CentralAuth/: Only test passwords once in CentralAuthUser::prepareMigration - 2nd try (duration: 00m 07s) [17:31:55] Logged the message, Master [17:32:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [17:33:01] !log mw1010: rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30) [17:33:04] Logged the message, Master [17:33:23] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [17:34:51] (03PS1) 10Dzahn: remove disabled search-pool monitoring [puppet] - 10https://gerrit.wikimedia.org/r/184380 [17:34:54] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [17:34:54] source => "/var/lib/puppet/ssl/private_keys/${fqdn_pem}", [17:34:57] does that even work!? [17:35:08] has this module been tested? [17:35:09] yes [17:35:16] works and tested [17:35:25] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [17:35:45] wtf, this is the first time I've seen this pattern in 5+ years of puppet [17:36:26] /dev/sda1 on / type ext4 (rw,errors=remount-ro) [17:36:28] on mw1010 [17:36:32] (03PS14) 10Alexandros Kosiaris: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [17:36:33] is that wtf as in surprise or displeaseure [17:36:36] would anyone take a look? _joe_ ? [17:36:39] surprise [17:36:43] ok :) [17:37:38] (03CR) 10Alexandros Kosiaris: [C: 032] Plan forward with cxserver records to support GeoIP [dns] - 10https://gerrit.wikimedia.org/r/184377 (owner: 10Alexandros Kosiaris) [17:38:35] (03CR) 10Ori.livneh: "> This still needs the Vary-related stuff removed if we're going the vcl_hash route." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [17:38:51] hoo: not sure I follow.. what's the problem ? [17:39:05] akosiaris: It's / is read only [17:39:14] and it's pooled [17:39:22] (03CR) 10Faidon Liambotis: [C: 04-1] "The template needs to be cleaned up, probably rewritten from scratch." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [17:40:01] hoo: no it is not akosiaris@mw1010:/$ sudo touch a [17:40:01] akosiaris@mw1010:/ [17:40:14] uh, true [17:40:14] bblack: what's your preference with respect to removing the vary on the debug header vcl code? It's redundant but adds an extra layer of safety if anyone ever messes with the vcl_hash stuff, no? I don't mind either way; I can remove it if you like. [17:40:25] akosiaris: accidental changes in config? [17:40:26] it might if errors show up in the fs or disk [17:40:34] but not right now [17:40:43] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [17:41:01] hoo: it says "rw" above, which means readwrite [17:41:10] paravoid: I'm not that clueless :D [17:41:15] Just stupid ;) [17:41:30] _joe_: did you have a chance to look at the twemproxy stuff? [17:41:31] it says remount-ro and not ro :P [17:41:32] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#971173 (10Chad) It's not really a "wmf" staff group in that sense, just a group that deployers are in. Per IRC, people in this group who don't have access via puppet anymore can be removed though. [17:41:48] <_joe_> ori: nope sorry [17:41:52] errors=remount-ro means remount readonly if errors appear [17:41:57] bummer [17:42:03] RECOVERY - Varnishkafka Delivery Errors per minute on cp3009 is OK: OK: Less than 1.00% above the threshold [0.0] [17:42:06] kart_: cxserver.eqiad.wikimedia.org vs cxserver.wikimedia.org. Just planning for the future [17:42:14] <_joe_> ori: tomorrow I guess [17:42:26] akosiaris: nice :) [17:42:29] kart_: when is this set to be deployed btw ? [17:42:57] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#971176 (10RobH) irc update discussion: daniel, chad, and I were chatting. Any users in wmf-deployment (gerrit) not in deployers (puppet) should have their rights in gerrit reduced accordingly (removed from wmf-deployme... [17:43:00] akosiaris: we're on hold due to security review, so few more days. [17:43:03] _joe_: patch landed upstream: https://github.com/twitter/twemproxy/commit/0dbb3a915d746d8b4fb625c58daf1583968e5ed2 \o/ [17:43:04] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [17:43:09] kart_: OK, thanks [17:43:13] <_joe_> ori: good! [17:43:22] <_joe_> ori: I'll rebuild the package tomorrow [17:43:29] weeeeeee [17:43:29] cool [17:44:06] <_joe_> and the new hhvm package solved the TZ bugs :) [17:44:46] nice [17:45:37] the twemproxy thing is a little pressing just because: [17:45:39] [fluorine:/a/mw-log] $ grep -c 'CONNECTION FAILURE' memcached-serious.log [17:45:39] 1315676 [17:45:43] PROBLEM - Varnishkafka Delivery Errors per minute on cp3010 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [17:45:45] 1.3 million [17:45:54] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [17:45:56] why did that suddenly become a problem? [17:46:05] <_joe_> ori: I know [17:46:40] paravoid: it didn't suddenly, it just means performance is seriously degraded on some api reqs because memcached isn't available [17:47:05] it's only from some hosts (the ones with weight: 20 in pybal) [17:47:05] <_joe_> ori: if we want a quick solution, i'd set net.ipv4.tcp_tw_reuse = 1 [17:47:07] is that because of the ephemeral port starvation? [17:47:12] yes [17:47:29] <_joe_> (not sure if I remembered correctly) [17:47:33] paravoid: akosiaris: So any idea why scap failed on it? [17:47:34] i'd be fine with that, but tuning kernel params is opsy [17:48:03] <_joe_> the socket has the added nice advantage that it will also give us a perf gain [17:48:15] maybe, probably [17:48:23] unless someone has measured it :) [17:48:46] I'm not so sure it will [17:48:52] ori: re: vary, I just don't see the point in additional complexity for someone (probably me!) to have to think about and wonder at 6 months from now. If we're doing vcl_hash on it, we don't need the vary hack. [17:49:01] <_joe_> well, it may be irrelevant in our context [17:49:13] bblack: ok, makes sense [17:49:21] I actually checked out the patch this morning and started on cleaning it all up, but then it got complicated and I gave up, because I need to go focus on other things :P [17:49:26] mh... apparently it only failed to set times, so never mind [17:49:26] <_joe_> vcl_hash doesn't work the way we usually expect [17:49:41] but the ephemeral port issue can be easily fixed with a number of ways [17:49:49] well, i landed patches in HHVM and twemproxy and mediawiki for that stupid unix socket thing [17:49:55] let's try it so i don't feel like a complete idiot [17:49:56] increasing the amount of ephmeral ports, for example [17:49:59] lol [17:50:02] <_joe_> eheheheh [17:50:03] sure, let's try it [17:50:03] _joe_: ? [17:50:30] personally, I'd fix the ephemeral port issue first, though, as to have a clear answer of the unix vs. udp effect [17:50:44] rather than unix vs. udp-most-of-the-time [17:50:51] but ymmv, I don't think I'll get involved in this :) [17:51:00] <_joe_> paravoid: makes sense [17:51:25] ^d: the search-pool monitoring, should we remove it? [17:51:31] <_joe_> ori: I want to use the socket, but it requires more testing than just setting some kernel parameter as there are more moving parts [17:51:44] <_joe_> so let's say we do that in two stages [17:52:02] sounds fine to me [17:52:22] <^d> mutante: Where? [17:52:22] _joe_: what did you mean with "vcl_hash doesn't work the way we usually expect"? [17:52:36] wait, do we even use udp? [17:52:48] <_joe_> paravoid: not AFAIR [17:52:49] https://gerrit.wikimedia.org/r/#/c/184380/1/modules/lvs/manifests/monitor.pp because they are CRIT in Icinga but also notifications have been disabled [17:52:55] ^d: [17:53:11] <^d> Ah in lvs. [17:53:25] (03CR) 10Gage: "Fixed style issues pointed out by Faidon." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [17:53:29] <_joe_> sorry, I have a meeting in 10, then another one, then another one [17:53:39] ^d: it creates these: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=search-pool [17:53:40] <_joe_> so, out for today unless in a meeting :) [17:53:41] <^d> mutante: godog had started decom [17:53:56] ^d: ok, gotcha [17:54:02] <^d> I don't see why we couldn't tear it all down. Shouldn't take more than an hour or two [17:54:04] RECOVERY - Varnishkafka Delivery Errors per minute on cp3010 is OK: OK: Less than 1.00% above the threshold [0.0] [17:54:11] cya _joe_ [17:54:54] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=search-pool" [puppet] - 10https://gerrit.wikimedia.org/r/184380 (owner: 10Dzahn) [17:55:16] <^d> https://gerrit.wikimedia.org/r/#/c/183462/ [17:55:20] <^d> mutante: ^ [17:55:51] ^d: ah :) ok, looks like duplicate, thanks [17:56:03] PROBLEM - Varnishkafka Delivery Errors per minute on cp3007 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [17:56:23] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [17:57:09] (03PS3) 10Chad: lsearchd: remove udp2log configuration [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [17:59:52] ^d: indeed, my plan is for tomorrow [18:00:04] bd808: Dear anthropoid, the time has come. Please deploy Wikimania Scholarships app (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150112T1800). [18:00:06] (03PS14) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [18:00:19] <^d> godog: sounds good to me, jw since mutante was wondering about icinga [18:01:12] (03CR) 10Chad: [C: 032] Remove $wgDisableCounters, defunct and true by default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183902 (owner: 10Chad) [18:01:18] !log Applied 2015 schema changes to scholarships database on m2-master [18:01:23] Logged the message, Master [18:02:47] (03Merged) 10jenkins-bot: Remove $wgDisableCounters, defunct and true by default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183902 (owner: 10Chad) [18:03:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp3007 is OK: OK: Less than 1.00% above the threshold [0.0] [18:03:39] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 08s) [18:03:42] Logged the message, Master [18:03:43] RECOVERY - Varnishkafka Delivery Errors per minute on cp3009 is OK: OK: Less than 1.00% above the threshold [0.0] [18:04:25] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 06s) [18:04:28] Logged the message, Master [18:04:53] !log Deployed scholarships at hash a5bc6fd [18:04:56] Logged the message, Master [18:05:17] <^d> Hmm, what's up with mw1201 and mw1010? got pubkey denied on the former and "returned non-zero exit status 12" on the latter. [18:06:30] try it again and see if it works? [18:06:49] ^d: mw1010 fails to set time on files [18:06:58] but that's apparently nothing to worry about [18:07:06] the content synchs fine and ops don't care [18:07:07] "12 Error in rsync protocol data stream" -- http://wpkg.org/Rsync_exit_codes [18:07:36] bd808: rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30) [18:07:42] got a couple of these on mw1010 [18:07:59] see SAL [18:08:07] ops don't care... I would not phrase it that way [18:08:07] yikes [18:08:23] more like we haven't received something to actually debug yet [18:09:15] 3operations: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#971192 (10fgiunchedi) 3NEW [18:09:40] godog: I'd swear there was a ticket for this [18:09:44] read-only file system... beh.. rsync lies there [18:09:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3010 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [18:10:16] paravoid: I think I failed at my searching :( [18:10:30] ah [18:10:30] https://phabricator.wikimedia.org/T84050 [18:10:32] last sentence [18:11:13] not exactly the same [18:11:16] ah there we go, searched only in #operations [18:11:32] 3operations: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#971203 (10faidon) Also see T84050. [18:11:39] yeah I was about to ask, why not #ops-core? [18:12:10] ok, I am gonna ask.. what get's filed in #operations ? [18:12:15] cause I am not clear on it :-( [18:12:30] me neither tbh [18:12:48] so #operations is good as any since I wasn't sure anyway [18:12:48] <_joe_> akosiaris: by us? Ideally nothing. it's supposed to be used externally to submit tickets to us [18:12:56] so triaging [18:12:57] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10RobH) 3NEW a:3Papaul [18:13:23] that makes sense for me and I am happy with it. I somehow doubt we are all on the same page though [18:13:24] <_joe_> akosiaris: it's what I got from discussions in ops@, and I think it makes sense [18:13:41] (03PS1) 10RobH: setting graphite2001 mgmt ip [dns] - 10https://gerrit.wikimedia.org/r/184384 [18:13:45] I got that too, but I was not sure, hence the question [18:13:55] <_joe_> akosiaris: a group of more than 3 ops not on the same page, how can that happen? :) [18:14:13] how crazy would it be to delete svn.wm.org entirely? [18:14:19] my proposal was to tag all of our tickets with #operations [18:14:23] kill #ops-requests [18:14:26] (03PS7) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [18:14:35] 3ops-core: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#971236 (10fgiunchedi) [18:14:37] and keep e.g. #ops-core to make some tickets more specific [18:14:56] (or #ops-network or whatever) [18:15:09] #operations being the "team" tag and #ops-core being the project [18:15:26] <_joe_> paravoid: the idea of removing it from triaged tickets that are not #ops-requests makes sense when you need to be the one triaging [18:15:38] remove what? [18:15:55] <_joe_> remove the #operations tag [18:16:00] 3ops-codfw: set graphite2001 asset tag mgmt entries - https://phabricator.wikimedia.org/T86555 (10RobH) 3NEW a:3RobH [18:16:04] why would you remove it? [18:16:28] 3ops-codfw: set graphite2001 asset tag mgmt entries - https://phabricator.wikimedia.org/T86555 (10RobH) p:5Normal>3Low [18:16:39] ops-requests doesn't make that much sense anymore in phab, yea, it was more like #incoming in RT and the difference were just the permissions who gets to see tickets [18:16:47] to point out it has been triaged ? [18:17:27] <_joe_> because setting priority is not enough sometimes to really having something triaged, and phab can't find tickets that are "only in #operations" [18:17:30] (03CR) 10RobH: [C: 032] setting graphite2001 mgmt ip [dns] - 10https://gerrit.wikimedia.org/r/184384 (owner: 10RobH) [18:17:35] <_joe_> IIRC [18:19:23] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [18:20:23] PROBLEM - Varnishkafka Delivery Errors per minute on cp3007 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [18:21:02] (03PS8) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [18:21:51] !log purging 'mlocate' package from neon as well to fix Icinga DPKG crits [18:21:52] 3ops-core: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#971260 (10fgiunchedi) 3NEW [18:21:54] Logged the message, Master [18:23:54] ACKNOWLEDGEMENT - Varnishkafka Delivery Errors on cp3010 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1268.050049 daniel_zahn see list mail from ottomata [18:25:02] mutante: i'm trying to just disable notifications on all those when they flap right now [18:25:31] i like ACK because then they disappear from the list in the web ui as well [18:25:40] (03CR) 10Ori.livneh: "@bblack: made the changes you requested, cherry-picked on labs, applied successfully on labs varnishes." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [18:26:37] how about this one: dbproxy1002 - haproxy failover - CRITICAL check_failover servers up 1 down 1 [18:33:50] mutante: if you ack, do they come back when the flap? [18:33:52] they* [18:35:04] ACKNOWLEDGEMENT - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /srv/project 1035190 MB (3% inode=76%): /exp/project/abusefilter-global 1035190 MB (3% inode=76%): /exp/project/account-creation-assistance 1035190 MB (3% inode=76%): /exp/project/analytics 1035190 MB (3% inode=76%): /exp/project/bastion 1035190 MB (3% inode=76%): /exp/project/bots 1035190 MB (3% inode=76%): /exp/project/category-sorting 1035190 MB ( [18:37:06] <_joe_> ottomata: yes [18:40:15] aye, thought so [18:41:16] (03PS1) 10Ottomata: Include hadoop::namenode::standby on analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/184393 [18:47:07] (03CR) 10CSteipp: [C: 031] "Or just remove all of the bug54847 handling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184136 (owner: 10Hoo man) [18:48:28] !log Ran sync-common on osmium [18:48:36] Logged the message, Master [18:49:04] PROBLEM - DPKG on analytics1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:51:34] RECOVERY - DPKG on analytics1001 is OK: All packages OK [18:54:54] (03PS1) 10Ottomata: Add -nonInteractive flag to namenode -format and namenode -bootstrapStandby commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/184400 [18:55:10] (03CR) 10Ottomata: [C: 032 V: 032] Add -nonInteractive flag to namenode -format and namenode -bootstrapStandby commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/184400 (owner: 10Ottomata) [18:56:13] (03PS2) 10Ottomata: Include hadoop::namenode::standby on analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/184393 [18:59:24] (03PS1) 10Dereckson: New user message extension configuration on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184402 (https://phabricator.wikimedia.org/T76716) [18:59:37] (03CR) 10Dereckson: "Follow-up: change I6a1b1eb41f58dbaccf109a3913e030ded1743deb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/179596 (https://phabricator.wikimedia.org/T76716) (owner: 10Dereckson) [19:11:14] 3Ops-Access-Requests: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#971426 (10chasemp) This still needs manager and team lead approval per docs. [19:15:40] 3wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#971449 (10GWicke) 3NEW [19:15:43] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [19:16:02] 3Ops-Access-Requests: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#971456 (10Tnegrin) approved by manager [19:19:55] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [19:24:16] 3wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#971491 (10JanZerebecki) Do we want to have the Cassandra and Titan nodes be in the same rack as I assume that query performance is very latency sensitive? [19:24:24] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:50] 3Wikimedia-General-or-Unknown, WMF-Legal, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#971511 (10LuisV_WMF) I don’t have a lot of time to respond in detail today, but I will try to do so soon. In the meantime, I’m confused - why is Apache acceptable where CC0 is no... [19:37:19] 3wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#971547 (10Joe) a:3Joe [19:41:34] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:47:30] 3wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#971589 (10GWicke) >>! In T86561#971491, @JanZerebecki wrote: > Do we want to have the Cassandra and Titan nodes be in the same rack as I assume that query performance is very latency sensitive? I don'... [19:47:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [19:56:04] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:47] 3Wikimedia-General-or-Unknown, WMF-Legal, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#971643 (10hashar) >>! In T67270#971511, @LuisV_WMF wrote: > I don’t have a lot of time to respond in detail today, but I will try to do so soon. In the meantime, I’m confused - w... [20:06:30] springle: ping? [20:06:41] springle: got a moment? or too late / early? [20:06:58] springle: mostly checking if dropping a view (unused, mostly) from the public databases is going to cause any issues. [20:14:59] (03CR) 10Alexandros Kosiaris: "For some reason, this change was also (cherry-picked?) on the top of the beta labs puppetmaster's git repo resulting in git rebase errors " [puppet] - 10https://gerrit.wikimedia.org/r/179480 (owner: 10BryanDavis) [20:16:31] ori: hmm, there’s a commit 9c290827589a73e54607c9592c826daeea9f53d8 from you on betacluster puppetmaster that doesn’t seem to be on gerrit… [20:16:36] is that still needed? [20:16:57] (03CR) 10Ottomata: [C: 032] Include hadoop::namenode::standby on analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/184393 (owner: 10Ottomata) [20:22:13] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 3 failures [20:22:20] (03PS1) 10Ottomata: Add analytics1001 and 1002 to list of namenode_hosts [puppet] - 10https://gerrit.wikimedia.org/r/184419 [20:24:22] 3Ops-Access-Requests: Provide Toby Negrin and Dan Garry with Google Webmaster Tools access - https://phabricator.wikimedia.org/T85938#971712 (10Dzahn) a:3Dzahn [20:25:02] (03CR) 10Ottomata: [C: 032] Add analytics1001 and 1002 to list of namenode_hosts [puppet] - 10https://gerrit.wikimedia.org/r/184419 (owner: 10Ottomata) [20:25:17] 3ops-network, operations: Connect Apple Airport to mr1-codfw - https://phabricator.wikimedia.org/T86574#971721 (10Cmjohnson) 3NEW a:3Papaul [20:26:14] 3ops-network, ops-codfw, operations: Connect Apple Airport to mr1-codfw - https://phabricator.wikimedia.org/T86574#971721 (10Cmjohnson) [20:27:07] 3Ops-Access-Requests: Provide Toby Negrin and Dan Garry with Google Webmaster Tools access - https://phabricator.wikimedia.org/T85938#971743 (10Dzahn) Mark wrote: "Dan needs full access. You can provide the master password, along with all the other instructions in the file to keep the account secure." The easie... [20:27:12] Deskana: ^ [20:27:45] please see that new file in your home and read the instructions [20:28:50] (03CR) 10Rush: "we talked about this in our weekly. Ariel and Mark are going to follow up here. Thanks for your patience." [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [20:29:58] (03PS1) 10Ottomata: Remove analytics1001 and 1002 from list of namenodes [puppet] - 10https://gerrit.wikimedia.org/r/184421 [20:30:19] (03CR) 10Ottomata: [C: 032 V: 032] Remove analytics1001 and 1002 from list of namenodes [puppet] - 10https://gerrit.wikimedia.org/r/184421 (owner: 10Ottomata) [20:31:02] 3Ops-Access-Requests: Provide Toby Negrin and Dan Garry with Google Webmaster Tools access - https://phabricator.wikimedia.org/T85938#971765 (10Dzahn) 5Open>3Resolved Dan, please see the new file in your home directory and read the warnings/instructions. [20:31:03] akosiaris: btw, there’s cxserver puppet failures on deployment-cxserver03 for a few days now [20:31:33] I think the user’s homedir was moved or something and puppet’s confused [20:31:37] YuviPanda: yeah, vaguely aware [20:31:44] akosiaris: Thanks for fixing that beta merge conflict. I bet I didn't bother to update the cherry-pick when I made changes in PS5. [20:31:46] akosiaris: want me to poke at it to see if I can fix that? [20:32:02] bd808: we should figure some way of alerting if there’s rebase failures [20:32:04] could you point me to shinken for beta-labs [20:32:06] ? [20:32:14] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 4 failures [20:32:28] YuviPanda: feel free, I was thinking of fixing it tomorrow along with a couples of other minor issues [20:32:38] akosiaris: shinken.wmflabs.org [20:32:41] guest / guest [20:32:44] thanks [20:32:44] username / password [20:32:58] (need to spend more time on it, but so many things to do..) [20:33:10] YuviPanda: The merge script knows when there is a conflict so you'd just need to add something to that that would trigger alerting. [20:33:34] The current logging for the merge script is pretty weak [20:33:39] bd808: right. not exactly sure how to do this, though. uh, send boolean data to graphite? that osunds incredibly stupid [20:34:00] hmm, perhaps could have some other service that accepts boolean values only [20:34:08] a graphite for boolean values. [20:34:14] and that can be used for booleanish checks [20:34:23] I could probably just use graphte. [20:34:25] *graphite [20:34:39] it's not elegant but a lot of places just shove it into graphite [20:34:45] most of the deploy tracking at like etsy is that way [20:34:56] YuviPanda: can it just tell shinken directly that it failed? [20:36:08] bd808: not really. we have no security in place anywhere, so I’ll either have to figure out some form of auth mechanism, or… something [20:36:10] that would be like passive checks in icinga [20:36:20] hmm, come to think of it, we don’t have any security in place for *graphite* either [20:37:08] why would graphite need authn/z? [20:37:33] dashboard, DoS attacks [20:37:43] dashboard manipulation that is [20:39:22] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#971811 (10chasemp) What about #HTTPS-By-Default as a "goal". This the same color as "sprint" but obviously goals would be different in context, but a sprint is a type of goal..? I think spreading the color gr... [20:40:40] (03PS1) 10Dzahn: create shell account for Tilman Bayer [puppet] - 10https://gerrit.wikimedia.org/r/184425 [20:42:54] (03CR) 10Dzahn: "we have https-by-default goal/milestone https://phabricator.wikimedia.org/tag/https-by-default/" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [20:43:31] btw graphite https://gerrit.wikimedia.org/r/#/c/181949/ [20:44:26] PROBLEM - Hadoop Namenode - Stand By on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [20:44:26] (03PS2) 10Dzahn: create shell account for Tilman Bayer [puppet] - 10https://gerrit.wikimedia.org/r/184425 (https://phabricator.wikimedia.org/T86533) [20:44:56] PROBLEM - Hadoop Namenode - Stand By on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [20:48:17] oh pshh posh [20:48:30] these are new namenodes, no worry, ACKing [20:48:48] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#971835 (10Dzahn) step one, to create a shell account for you: https://gerrit.wikimedia.org/r/#/c/184425/ [20:49:19] ACKNOWLEDGEMENT - Hadoop Namenode - Stand By on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode ottomata these are new namenodes, still being configured. [20:49:19] ACKNOWLEDGEMENT - Hadoop Namenode - Stand By on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode ottomata these are new namenodes, still being configured. [20:49:19] ACKNOWLEDGEMENT - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 4 failures ottomata these are new namenodes, still being configured. [20:51:24] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [20:52:14] (03CR) 10Dzahn: [C: 031] "linked ticket needs manager approval but technically this doesn't give access, it prepares access by making an account that can later be a" [puppet] - 10https://gerrit.wikimedia.org/r/184425 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [20:55:11] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#971866 (10Dzahn) approval from Erik on the ticket would be great indeed, thanks [20:55:47] mutante: Thanks, I'll check it out soon. :-) [20:56:52] (03CR) 10Ori.livneh: [C: 031] create shell account for Tilman Bayer [puppet] - 10https://gerrit.wikimedia.org/r/184425 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [20:59:13] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Papaul) Server racked, Rack table updated, mgmt set-up complete. ge-5/0/1 [21:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150112T2100). Please do the needful. [21:12:11] !log deployed parsoid version 2cd6fefa [21:12:14] Logged the message, Master [21:17:13] (03PS1) 10Dzahn: planet: add missing Header to vary on protocol [puppet] - 10https://gerrit.wikimedia.org/r/184438 [21:19:39] (03CR) 10Dzahn: [C: 032] "avoid redirect loop (in Chrome), fix as pointed out by hashar" [puppet] - 10https://gerrit.wikimedia.org/r/184438 (owner: 10Dzahn) [21:20:13] mutante: make sure you have mod_header enabled [21:22:17] hashar: yea, i do, but might be from another role on the same box, so better i add it anyways [21:22:49] why just Chrome btw? [21:24:44] yea, it's loaded from the BZ role [21:26:20] 3operations, Wikimedia-General-or-Unknown: COPYING is served as application/octet-stream - https://phabricator.wikimedia.org/T63903#971960 (10MaxSem) Hmm. Remove w/COPYING and w/CREDITS dead symlinks Not sure how useful these were to be exposed. But considering they were added on purpose, maybe we... [21:28:49] (03PS1) 10Dzahn: planet: load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/184441 [21:28:58] (03PS9) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [21:29:00] (03PS1) 10Ori.livneh: Restore the standard hit-for-pass processing in mobile-frontend [puppet] - 10https://gerrit.wikimedia.org/r/184442 [21:35:29] (03CR) 10Dzahn: [C: 032] planet: load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/184441 (owner: 10Dzahn) [21:38:57] !log Set email for global account "Carol.Christiansen" after having it confirmed by a steward and a dewiki bureaucrat (also based on old OTRS records) [21:39:01] Logged the message, Master [21:42:25] (03PS2) 10Ori.livneh: "Un-disable" xhprof [puppet] - 10https://gerrit.wikimedia.org/r/182992 [21:43:57] (03CR) 10Aaron Schulz: [C: 031] "Un-disable" xhprof [puppet] - 10https://gerrit.wikimedia.org/r/182992 (owner: 10Ori.livneh) [21:44:15] 3operations, ops-network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#972028 (10Papaul) [21:44:17] 3operations, ops-codfw, ops-network: Connect Apple Airport to mr1-codfw - https://phabricator.wikimedia.org/T86574#972026 (10Papaul) 5Open>3Resolved airport is connected to fe-0/0/2 on mr1-codfw [21:50:48] 3ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#972055 (10Cmjohnson) 3NEW a:3Papaul [21:54:19] 3ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#972071 (10Papaul) disk 8, slot 7 failed [22:01:08] (03CR) 10Dzahn: [C: 032] planet: puppetize feedparser.py bug workaround [puppet] - 10https://gerrit.wikimedia.org/r/183007 (https://phabricator.wikimedia.org/T47806) (owner: 10Dzahn) [22:03:58] 3ops-codfw: Update Racktables scs-c8 - https://phabricator.wikimedia.org/T86591#972106 (10Cmjohnson) 3NEW a:3Papaul [22:16:02] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#972148 (10Dzahn) a:3Dzahn [22:18:18] (03PS10) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [22:20:03] (03PS4) 10Reedy: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [22:23:33] (03CR) 10Reedy: [C: 031] monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [22:24:05] ^d, marktraceur, RoanKattouw_away: Any idea who's going to be handling the SWAT deployment today? [22:24:12] Not it. [22:24:17] I should take my name off that list. [22:24:33] <^d> I've been working since 4am, I'm going home and going to bed in ~35m [22:24:33] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#972175 (10Dzahn) >>! In T86548#971157, @Krenair wrote: > bsitu, mwalker, pgehres, rfaulk, sumanah removed these users and jgonera from the Gerrit group. checked all other users in the group (if they exist in admins.pp... [22:24:48] 3operations: wmf-deployment group has ex-employees - https://phabricator.wikimedia.org/T86548#972176 (10Dzahn) 5Open>3Resolved [22:25:58] ^d, marktraceur, RoanKattouw_away: So far, it's only mobile patches, so I can handle them if no one else is available. [22:26:36] (03CR) 10BryanDavis: "1.25wmf14 will hit group1 tomorrow and make this config change safe to apply." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [22:33:58] (03PS1) 10Dzahn: add tbayer to analytics/statistics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) [22:34:50] (03CR) 10jenkins-bot: [V: 04-1] add tbayer to analytics/statistics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [22:35:34] (03CR) 10Dzahn: "nice jenkins. 22:34:14 Users assigned that do not exist: ['tbayer'] i know, it's created in the change that is a dependency for this" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [22:37:02] uhm.. " echo exec("cd /data/project/wikibugs/wikibugs2 && git pull"); [22:40:49] :) [22:44:57] 3WMF-Legal, Wikimedia-General-or-Unknown, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#972278 (10LuisV_WMF) The basic theory for CC0 is: # The patent grant in Apache is not useful for this sort of larger-than-I-realized, but still pretty minimal, code - we're no... [22:50:13] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#972292 (10Manybubbles) 3NEW [22:50:57] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#972292 (10Manybubbles) [23:25:36] (03CR) 10Dzahn: [C: 032] "creating account but not adding to any groups yet, so no access" [puppet] - 10https://gerrit.wikimedia.org/r/184425 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [23:27:05] (03PS2) 10Dzahn: add tbayer to analytics/statistics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) [23:28:40] (03CR) 10Dzahn: "now jenkins likes it and -this- is the actual access request, waiting 3-day period" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [23:30:11] (03CR) 10Dzahn: "ottomata: additional bastion group not needed, right" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [23:32:15] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#972419 (10Dzahn) I have already added the user and key to puppet but didn't add it to any groups yet. So what actually gives access is the second change that adds the user to groups. And this is pending approval and a... [23:38:11] 3MediaWiki-Core-Team, operations: HHVM gets stuck in what seems a deadlock in pcre cache code - https://phabricator.wikimedia.org/T1194#20577 (10bd808) [23:55:56] (03PS2) 10BBlack: Restore the standard hit-for-pass processing in mobile-frontend [puppet] - 10https://gerrit.wikimedia.org/r/184442 (owner: 10Ori.livneh) [23:56:04] (03CR) 10BBlack: [C: 032 V: 032] Restore the standard hit-for-pass processing in mobile-frontend [puppet] - 10https://gerrit.wikimedia.org/r/184442 (owner: 10Ori.livneh)