[00:03:42] Lighting deploy; VE deployment starting (merging cherry-picks now and working on mw-core wmf branch updates) [00:04:36] Krinkle: i'm done, so go ahead [00:05:54] Coren: Does labstore1 serve the same volumes via Gluster and NFS? (Looking at modules/download/manifests/gluster.pp ("fstype => 'glusterfs'") and templates/labsnfs/auto.space.erb ("/datasets -nfsvers=3,ro,ghost labstore1.pmtpa.wmnet:/publicdata-project")). [00:07:17] (03CR) 10Ori.livneh: [C: 032] ensure tmpfile is accessible to dsh subprocess [operations/puppet] - 10https://gerrit.wikimedia.org/r/111375 (owner: 10Ori.livneh) [00:07:38] Coren: ok to puppet-erge your change? [00:07:42] modules/dynamicproxy/templates/proxy.conf [00:08:33] it looks safe, merging. [00:12:38] (03PS2) 10GWicke: Bug 60694: Make the config file path configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 [00:16:17] !log kaldari synchronized php-1.23wmf12/extensions/VectorBeta [00:16:24] Logged the message, Master [00:20:42] !log krinkle synchronized php-1.23wmf11/extensions/VisualEditor 'Idfbbf2e43a7de' [00:20:50] Logged the message, Master [00:21:09] !log krinkle synchronized php-1.23wmf12/extensions/VisualEditor 'I1214378b5452b37' [00:21:16] Logged the message, Master [00:21:41] ori: Sorry, got distracted. [00:22:02] Coren: 'sokay, I merged it. [00:23:43] Done deploying [00:26:53] (03PS1) 10Ori.livneh: Move Git::clone['mediawiki/tools/scap'] to mediawiki::sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/111385 [00:27:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Git::clone['mediawiki/tools/scap'] to mediawiki::sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/111385 (owner: 10Ori.livneh) [00:27:18] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Last successful Puppet run was Tue 04 Feb 2014 06:26:26 PM UTC [00:34:43] I may be being really silly right now -- but I cannot access civicrm.wikimedia.org from my desk (but I can from my VPS.) [00:34:47] Tracerouting it, the last valid response is from te3-4.co2.as30217.net; the traceroute to aluminium.wikimedia.org (the same host, just a different name) ends at xe-5-2-1.cr1-eqiad.wikimedia.org [00:35:00] so... it's almost like I'm being selectively firewalled at our border [00:35:06] which doesn't make a huge amount of sense [00:35:44] mwalker, does it resolve to the same IP from both? [00:36:09] could also be IPv6 [00:36:58] yep; civicrm is a CNAME of aluminium.wikimedia.org [00:37:02] it doesn't have an IPv6 [00:37:50] ping aluminium.wikimedia.org works from here [00:38:10] so presumably not an issue in the office in general [00:38:35] oh; interesting... nslookup resovles to the same; but ping is not using the right IP for civicrm [00:38:42] * mwalker rummages in his /etc/hosts [00:38:53] ah yes [00:38:56] thanks gwicke! [00:39:18] * mwalker now wonders why I had that statically defined in /etc/hosts [00:39:44] * gwicke ran into that many times in the past [00:39:57] /etc/hosts is dangerous ;) [00:41:32] (03PS1) 10Jeremyb: Dynamic proxy: Serve SSL certificate chain. v2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 [00:41:37] (03PS1) 10Jeremyb: rm root cert from chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 [00:42:19] (03CR) 10Jeremyb: [C: 04-1] "should also figure out what to do with the other star certs. idk if they are still in use anywhere (for me to test against)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [00:43:22] (03CR) 10Jeremyb: "fu in I4fba98a3856f591f64eab30b91ce2f478fc4f271" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111342 (owner: 10Tim Landscheidt) [00:51:39] (03PS2) 10Tim Landscheidt: Set correct CA for star.wmflabs.org.pem [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [00:51:41] !log schema changes db1047 s1-analytics-slave repl stopped [00:51:49] Logged the message, Master [00:51:55] (03CR) 10Tim Landscheidt: [C: 031] Set correct CA for star.wmflabs.org.pem [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [00:54:25] (03PS1) 10Springle: s3 db1027 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111388 [00:54:51] (03CR) 10Springle: [C: 032] s3 db1027 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111388 (owner: 10Springle) [00:54:57] (03Merged) 10jenkins-bot: s3 db1027 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111388 (owner: 10Springle) [00:56:14] !log springle synchronized wmf-config/db-eqiad.php 's3 db1027 full steam' [00:56:22] Logged the message, Master [01:07:04] (03PS3) 10Jeremyb: star.wmflabs.org: fix intermediate CA [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 [01:08:55] (03CR) 10Tim Landscheidt: "I think this is correct, but after wrangling with bug 52630, I'm not sure. Apparently, Apache wants *only* the intermediate certificate a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:15:11] (03CR) 10Tim Landscheidt: "To avoid breaking too much, we could migrate backwards-compatibly:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:20:31] (03CR) 10Jeremyb: "to clarify I didn't test the change, I just verified that the status quo is wrong." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:38:31] Coren: Does labstore1 serve the same volumes via Gluster and NFS? (Looking at modules/download/manifests/gluster.pp ("fstype => 'glusterfs'") and templates/labsnfs/auto.space.erb ("/datasets -nfsvers=3,ro,ghost labstore1.pmtpa.wmnet:/publicdata-project")). [01:39:19] (03CR) 10Jeremyb: "so, planet does use chained w/ Apache which explains more weirdness in the way it's serving its cert (the server cert is sent in duplicate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:39:35] scfc_de: Yes, NFS3 for compatibilities. Doesn't scale quite right though; the NFS server implementation that's built into gluster is teh sux. [01:43:07] Coren: That clears that up. Thanks! [01:44:45] For that matter, gluster's implementation of glusterfs is also teh sux. [01:49:04] :-) [01:51:48] (03PS1) 10Springle: depool db1034 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111392 [01:52:20] (03CR) 10Springle: [C: 032] depool db1034 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111392 (owner: 10Springle) [01:52:29] (03Merged) 10jenkins-bot: depool db1034 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111392 (owner: 10Springle) [01:54:29] !log springle synchronized wmf-config/db-eqiad.php 's2 depool db1034 for schema changes' [01:54:39] Logged the message, Master [02:01:44] !log LocalisationUpdate failed: git pull of extensions failed [02:01:50] Logged the message, Master [02:07:54] A shell user needs to poke at LocalisationUpdate, I guess. [03:28:18] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Last successful Puppet run was Tue 04 Feb 2014 06:26:26 PM UTC [03:44:29] could the puppet freshness of cp3019 be related to its sub-par kafka networking performance in any way? (outdated config, ..?) [03:47:33] (03PS1) 10Springle: s1 repool db1043 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111407 [03:48:06] (03CR) 10Springle: [C: 032] s1 repool db1043 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111407 (owner: 10Springle) [03:48:12] (03Merged) 10jenkins-bot: s1 repool db1043 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111407 (owner: 10Springle) [03:49:42] !log springle synchronized wmf-config/db-eqiad.php 's1 repool db1043 warm up' [03:49:50] Logged the message, Master [03:53:18] (03PS1) 10Springle: depool db1055 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111408 [03:53:47] (03CR) 10Springle: [C: 032] depool db1055 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111408 (owner: 10Springle) [03:53:53] (03Merged) 10jenkins-bot: depool db1055 for schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111408 (owner: 10Springle) [03:55:07] !log springle synchronized wmf-config/db-eqiad.php 's1 depool db1055 schema changes' [03:55:14] Logged the message, Master [04:08:25] I filed https://bugzilla.wikimedia.org/show_bug.cgi?id=60860 about LocalisationUpdate. [04:17:27] Snaps: early riser :) [04:18:59] (03PS1) 10Ori.livneh: report stats from shell scripts to statsd, not carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/111409 [04:23:34] * ^d sighs [04:23:54] !log LocalisationUpdate failed: git pull of extensions failed [04:24:02] Logged the message, Master [04:24:32] <^d> what the hell? [04:25:03] <^d> submodules are quite possibly the worst invention ever. [04:29:31] :-) [04:29:58] <^d> Ah, got it. [04:34:28] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [04:52:05] <^d> Gloria: Anyway, the git part of l10n update is fixed now. [04:52:15] <^d> I've got it running. I'll close out the bug whenever it finally finishes. [04:54:30] Thanks. :-) [04:56:54] (03CR) 10Ori.livneh: [C: 032] report stats from shell scripts to statsd, not carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/111409 (owner: 10Ori.livneh) [04:57:14] (03PS2) 10Ori.livneh: Make scap script files symlinks to /srv/scap/bin files [operations/puppet] - 10https://gerrit.wikimedia.org/r/111373 [04:58:57] (03CR) 10Ori.livneh: [C: 032] Make scap script files symlinks to /srv/scap/bin files [operations/puppet] - 10https://gerrit.wikimedia.org/r/111373 (owner: 10Ori.livneh) [05:04:13] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-05 05:04:13+00:00 [05:04:20] Logged the message, Master [05:07:07] !log ori synchronized README 'Ensuring that sync-file works after Ia210f3ced' [05:07:15] Logged the message, Master [05:07:40] !log Last sync-file: connect to host mw1163 port 22: Connection timed out [05:07:48] Logged the message, Master [05:18:47] !log LocalisationUpdate completed (1.23wmf11) at 2014-02-05 05:18:46+00:00 [05:18:53] Logged the message, Master [05:30:12] (03PS1) 10Springle: prepare for s1 master rotation db1056 to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111412 [05:30:47] (03CR) 10Springle: [C: 032] prepare for s1 master rotation db1056 to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111412 (owner: 10Springle) [05:30:56] (03Merged) 10jenkins-bot: prepare for s1 master rotation db1056 to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111412 (owner: 10Springle) [05:32:17] !log springle synchronized wmf-config/db-eqiad.php 'prepare for s1 master rotation db1056 to db1052' [05:32:24] (03PS1) 10Ori.livneh: Delete the scap scripts that have been moved to mediawiki/tools/scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/111414 [05:32:25] Logged the message, Master [05:36:41] (03CR) 10Ori.livneh: [C: 032] Delete the scap scripts that have been moved to mediawiki/tools/scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/111414 (owner: 10Ori.livneh) [05:45:44] (03PS1) 10Springle: switch s1 master to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111415 [05:46:40] (03CR) 10Springle: [C: 032] switch s1 master to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111415 (owner: 10Springle) [05:46:46] (03Merged) 10jenkins-bot: switch s1 master to db1052 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111415 (owner: 10Springle) [05:47:31] !log springle synchronized wmf-config/db-eqiad.php 'switch s1 master to db1052' [05:47:37] (03PS1) 10Springle: update pmtpa config for s1 master change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111416 [05:47:38] Logged the message, Master [05:47:45] (03CR) 10jenkins-bot: [V: 04-1] update pmtpa config for s1 master change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111416 (owner: 10Springle) [05:51:38] (03PS2) 10Springle: update pmtpa config for s1 master change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111416 [05:51:55] (03CR) 10Springle: [C: 032] update pmtpa config for s1 master change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111416 (owner: 10Springle) [05:52:01] (03Merged) 10jenkins-bot: update pmtpa config for s1 master change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111416 (owner: 10Springle) [05:52:51] !log springle synchronized wmf-config/db-pmtpa.php 'update pmtpa config for s1 master change' [05:52:59] Logged the message, Master [05:54:00] (03PS1) 10Springle: update dns for s1 master switch [operations/dns] - 10https://gerrit.wikimedia.org/r/111417 [05:55:12] (03CR) 10Springle: [C: 032] update dns for s1 master switch [operations/dns] - 10https://gerrit.wikimedia.org/r/111417 (owner: 10Springle) [06:02:29] (03PS1) 10Springle: update coredb topology for s1 master [operations/puppet] - 10https://gerrit.wikimedia.org/r/111419 [06:02:49] (03CR) 10Springle: [C: 032] update coredb topology for s1 master [operations/puppet] - 10https://gerrit.wikimedia.org/r/111419 (owner: 10Springle) [06:29:18] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Last successful Puppet run was Tue 04 Feb 2014 06:26:26 PM UTC [06:56:18] I'm going to run a 0-diff scap to test changes to the scap script [06:59:23] !log ori started scap: (no message) [06:59:24] k [06:59:31] Logged the message, Master [06:59:32] evening :) [06:59:43] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [06:59:45] !log ori finished scap: (no message) (duration: 00m 35s) [06:59:50] Logged the message, Master [06:59:56] 35 seconds! :) [06:59:58] Logged the message, Master [07:00:27] heh [07:15:14] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [07:15:22] Logged the message, Master [07:51:30] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [07:51:38] Logged the message, Master [07:53:16] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [07:53:23] Logged the message, Master [07:54:49] ori! It is midnight :P [07:55:09] five to! [07:57:32] :P [08:11:46] ori: I'd like to debug deployment on a wtp box [08:11:53] any special way to depool one? [08:12:00] or do I just need to stop the daemon? [08:12:17] for parsoid? i'm not sure [08:12:28] I know MW will stop using one if the daemon is stopped [08:13:27] this is why doing rolling upgrades of them works [08:13:40] I'll test on wtp1012 briefly [08:13:56] !log stopping parsoid on wtp1012 shortly [08:14:03] Logged the message, Master [08:15:45] ori: do you have access to RT 5391 ? [08:16:19] yes [08:16:42] seems the initial snafu with deploying parsoid with checkout_submodules => true and none of the submodules checked out on tin is what's causing the current issue [08:16:42] these are checks I'd like to have in the frontend so that this is an impossibility [08:16:42] same with checks to see if anything is checked in locally, so that people can't accidentally wipe out security patches [08:17:08] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [08:17:25] !log restart parsoid on wtp1012 [08:17:32] Logged the message, Master [08:17:35] ori: is it a request for tobi negrin access to stat1 stat1002 bast1001.wikimedia.org bastion.wmflabs.org ? [08:18:08] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [08:18:38] matanya: i'm not sure if i can divulge that -- what's the reason for asking? [08:19:00] duplicate of 6768 ori [08:19:26] started working on the latter, and saw all is in place with comments per RT 5391 [08:19:43] I'll PM [08:20:52] hello [08:22:25] hey [08:22:47] morning hashar [08:23:08] RECOVERY - Host labnet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:25:18] PROBLEM - Disk space on labnet1001 is CRITICAL: Connection refused by host [08:25:28] PROBLEM - RAID on labnet1001 is CRITICAL: Connection refused by host [08:25:39] PROBLEM - puppet disabled on labnet1001 is CRITICAL: Connection refused by host [08:25:48] PROBLEM - SSH on labnet1001 is CRITICAL: Connection refused [08:33:02] moin [08:34:01] 'lo [08:35:09] so... many... emails [08:36:25] millions? [08:37:06] how do i know what uid to give a user in admins.pp ? [08:37:28] any free number above 1000 ? [08:37:38] PROBLEM - NTP on labnet1001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:39:54] akosiaris paravoid andrewbogott still waiting for approval of: https://wikitech.wikimedia.org/wiki/Puppet_coding#Coding_Style [08:41:31] matanya, works for me! [08:41:58] lgtm [08:42:34] once you are fine with it, please remove the draft template [08:44:11] OK, another no-op scap [08:44:30] !log ori started scap: (no message) [08:44:59] Logged the message, Master [08:47:41] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [08:47:43] !log ori finished scap: (no message) (duration: 03m 31s) [08:47:49] Logged the message, Master [08:47:56] Logged the message, Master [08:49:48] RECOVERY - SSH on labnet1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:52:48] paravoid: to handle mails flood : press delete. [08:52:51] that works pretty well [08:58:38] RECOVERY - puppet disabled on labnet1001 is OK: OK [08:59:18] RECOVERY - Disk space on labnet1001 is OK: DISK OK [08:59:28] RECOVERY - RAID on labnet1001 is OK: NRPE: Unable to read output [08:59:38] RECOVERY - DPKG on labnet1001 is OK: All packages OK [09:00:07] (03PS1) 10TTO: Add local interwiki for metawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 [09:02:44] hashar: mind opening /operations/debs/fabric for me? [09:03:08] matanya: in meeting right now [09:03:13] can you mail / bug fill it please? [09:06:43] yes [09:10:47] (03CR) 10TTO: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111426 (owner: 10TTO) [09:16:55] (03CR) 10TTO: [C: 031] "Yes, let's get rid of this stupid bit of code. I'll put my hand up as another person who has fallen for this silly trick in the past, and " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110926 (owner: 10Hashar) [09:18:59] (03CR) 10TTO: "Basically "PHP_SAPI !== 'cli'" means "if we are not executing on the command line". So, for example, you don't output HTTP headers in a CL" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110926 (owner: 10Hashar) [09:20:06] matanya: by the way, I am not sure why we need a deb for Fabric, isn't it in Ubuntu already ? [09:20:18] matanya: http://packages.ubuntu.com/search?keywords=fabric [09:20:28] matanya: so we can most probably backport a more recent version [09:20:28] old stuff. https://rt.wikimedia.org/Ticket/Display.html?id=6766 [09:23:10] matanya: replied on ticket [09:23:29] I am not sure how well it is going to backport on Precise though [09:23:40] there is not that many dependencies http://packages.ubuntu.com/trusty/fabric [09:25:05] yeah, i thought of taking the ppa and make it fut us [09:25:15] but your approach is better [09:30:18] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Last successful Puppet run was Tue 04 Feb 2014 06:26:26 PM UTC [09:30:25] matanya: albeit might take a bit more time to complete [09:30:36] yes [09:30:41] matanya: the maintainer might already have completed the package work for 1.8 [09:31:04] http://packages.qa.debian.org/f/fabric.html list the git repository used for packaging [09:31:13] which is on github \O/ https://github.com/lamby/pkg-fabric [09:31:56] and Fabric 1.7 wont work with paramiko below 1.10.0 :( https://github.com/lamby/pkg-fabric/issues/2 [09:32:03] and https://github.com/lamby/pkg-fabric/issues/2 [09:32:33] BTW hashar https://gerrit.wikimedia.org/r/#/c/90684/ is my answer what you were looking for? [09:32:50] yeah more or less :-D [09:32:59] will let diederick figure it out with someone :] [09:33:37] I like the require Class['python-foo'] [09:33:50] but I don't thinks ops like having myriad of class wrapping packages [09:34:13] stdlib has a package definition that takes care of duplicate packages [09:34:22] but our std lib version is too old and does not have the define :-( [09:34:48] ensure_packages('python-foo') [09:35:09] we do need to update the std lib [09:35:40] (03CR) 10Hashar: "Another approach is using the ensure_packages() define from puppet stdlib. Our version of stdlib is too old though and does have that def" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90684 (owner: 10Diederik) [09:35:59] updating stdlib is going to have a bunch of side effects though :-( [09:43:45] fabric? [09:44:05] why would we add fabric? [09:45:34] (03CR) 10Alexandros Kosiaris: [C: 032] Adding ability to set ulimit nofiles, increased to 8192 by default for kafka server [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/110590 (owner: 10Ottomata) [09:49:34] (03CR) 10Alexandros Kosiaris: "Per our discussion yesterday about populating /etc/security/limits.{conf,d} with needed files, it won't be needed. That ulimit is executed" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/110590 (owner: 10Ottomata) [09:49:39] paravoid: potential idea would be to replace scap/dsh with Fabric [09:50:08] I guess Bryan (bd808) can use the pip version on labs if he wants to experiment [09:50:21] no [09:50:29] the first order of business is to produce a deployment script that is maximally concise and properly abstracted, but identical to scap in behavior. the sympathy between scap and fabric is as follows: [09:51:04] scap is composed of a suite of scripts that are invoked either as standalone or as part of a larger workflow [09:51:54] fabric represents this well: you have functions that are registered as @tasks that can either be called using the fab command-line tool or invoked as a function in the code of other tasks [09:53:41] it also has a notion of host groups. and it doesn't add an additional service; it is dsh-like in that it simply opens concurrent ssh connections. but it is smart about collating input and lets you define a strategy for dealing with failures on individual nodes [09:53:55] finally, it also provides a suite of helper functions for writing prompt-driven console utilities [09:54:30] these aren't so sophisticated that they couldn't be reproduced, but that's a lot of boilerplate code that is already implemented for us [09:54:51] ori: can you copy paste that on https://wikitech.wikimedia.org/wiki/Fabric ? [09:55:52] I mostly like to avoid making decisions early in a design process that are hard to back out from if they turn out to be wrong [09:56:25] fabric is just syntactic sugar on top of paramiko, which is just an ssh2 library for python [09:56:46] there is no software to install on target nodes; it's the same as dsh, just easier to orchestrate from python code. [09:57:32] i could, sure [09:57:35] I'd very much veto adding yet another remote execution system at this point [09:57:44] it's not yet another remote execution system [09:58:30] yes it is [09:58:39] it's executing a shell command via SSH. it's just implemented in python rather than by shelling out to a subprocess to do it. [09:58:52] I know, I've used it [09:59:30] I'd also would like very much to avoid designing anything at this point that uses SSH [09:59:51] well, that's a lie, SSH with personal accounts that is [10:00:10] it's not designing anything; it's a line-for-line rewrite of scap [10:00:18] oh come on [10:01:02] and in any case, introducing fabric, even if I agreed with it, shouldn't come to our attention by a request to open a gerrit repo to host it in a random irc conversation [10:01:24] bryan suggested it, and i told him to file an RT ticket [10:01:31] i presume that's how matanya encountered it [10:01:54] where I was left was that bryan would collect requirements before making any decisions [10:02:08] is this complete? do we have a list of requirements now? [10:02:17] 'deploy mediaiki'? [10:02:24] i'm a bit irritated by this conversation, to be honest [10:02:41] I specifically objected to this simplistic requirement during the mediawiki quarterly meeting [10:02:43] it's a bit frustrating to have this conversation; the we're-really-gonna-do-it-this-time deadline for git-deploy was the eqiad migration in 2012. it has never been used to deploy mediawiki. i have never had it work for any purpose ever without at least some degree of manual intervention. [10:02:49] we have way way more than just mediawiki [10:03:20] gabriel has raised a thread with ops about deploying parsoid with debs [10:03:30] yes, ori this is how i found it out [10:03:31] otto & nik want to deploy jars [10:03:38] and we have tons of other use cases [10:03:51] we're not going to use 4 different deployment systems just to push code to a bunch of boxes [10:04:03] hope i didn't cause trouble with this question [10:04:07] want to fix or even replace git-deploy? I'm all for it [10:04:07] it's not a deployment system [10:04:28] but let's not just add tools into the mix [10:04:38] i can remove dsh [10:05:20] dsh is a distributed shell that is invoked from the command-line; fabric is a distributed shell that is invoked from python. if we're switching from one to the other, the latter is a better fit. that's all. [10:05:34] I think at this point it's either fix git-deploy/trebuchet or replace it with something that can work for its other use cases as well [10:05:56] and I thought the process we were following was collecting requirements *across the org* [10:05:59] not just from platform [10:06:08] being more genral is better of course [10:06:08] I pointed otto to bryan the other day, specifically for this [10:06:12] i don't know what process anyone was folloing [10:06:22] all i know is that it produced junk that doesn't work [10:06:22] that's what I heard in that meeting [10:06:34] two or three weeks ago [10:06:48] in front of the whole platform team [10:07:25] it sounded like an okay plan to me, a bit more slow that I would have liked but nevertheless analytical and structured [10:08:11] OK, look, I don't need fabric [10:08:19] thanks, etc. [10:19:32] calm down ori :) [10:19:42] i probably should [10:20:13] I replied to RT #6766 with my thoughts [10:21:54] i can commit to replacing DSH [10:23:52] but i really can live without it. it'll just mean some grotesque usage of the subprocess module, but oh well. [10:24:18] i've explained before that i don't see this as skipping requirement-gathering; it is requirement-gathering [10:24:40] I must have missed this explanation [10:24:47] the requirements are currently buried under several layers of bashisms [10:24:54] are we only using dsh atm? [10:25:00] and no one has a particularly concise description of all it is that scap does [10:25:03] and salt [10:25:36] my take is that a properly abstracted python rewrite that doesn't try to improve on scap's behavior, but instead just tries to articulate it concisely, will produce code that will double as a document [10:25:59] how is that requirement-gathering and not implementation? [10:26:38] here what will happen: you'll rewrite all of scap, you'll start deploying mediawiki with it and you/robla will call it a complete project that you can't devote more resources in the next quarter about it [10:26:47] gabriel will keep poking us about deploying with debs [10:27:09] and the rest of the org will keep wasting their time to work around trebuchet's bugs & limitations [10:27:18] okay, let's suppose that this is how it plays out, even though robla hasn't given me this task, i am doing this with my characteristic amount of coordination [10:27:37] the end result is: hundreds of lines of bash replaced with concise python code [10:27:43] and dsh replaced with paramiko [10:27:48] how is that in any way worse? [10:28:27] each of the things i am proposing to introduce (python, fabric) is replacing something uglier that i can commit to eradicating [10:28:54] it's worse in the sense that you/bryan/others could spend an equal or similar amount of time and have something that covers other use cases as well [10:29:07] read my comments on https://gerrit.wikimedia.org/r/#/c/110904/ [10:29:08] and remove the need to support code that noone is maintaining internally etc. [10:29:42] that's all I'm saying [10:30:07] you should also consider the history -- we have several stabs at the problem that started out from general requirements because after all mediawiki just needs some files rsynced, how hard could it be? [10:30:40] I am considering the history, that's why I'm making this a bigger deal [10:31:03] but then it turns out that swapping out a system that sees almost daily use to deploy our primary application stack isn't that simple, especially when it is written in a language that likes to explode for obscure reasons [10:31:26] Apropos history, is there some write-up of the reasoning behind scap? For example, why aren't we using Puppet itself? [10:31:34] I don't see where we disagree here :) [10:32:02] so see my three-point plan in the commit message [10:32:05] copy-pasted: [10:32:07] The plan: [10:32:07] - Convert remaining scap scripts to Python. [10:32:07] - Break up procedural code into well-factor functions. [10:32:09] - Stop to reassess. [10:33:15] i'm pretty close to being able to delete 'scap' (as in: the primary entry-point script, not the whole system) [10:33:45] that's what Ryan said a few months ago [10:33:59] challenge accepted :) [10:34:31] it's not that I don't believe you can make it [10:34:38] anyways, i'm sorry for getting cranky about this -- i know you're asking questions in good faith [10:35:35] but i just resent the pattern that we have of allowing the first pass at a problem to be governed entirely by personal whimsy, and then we get burnt and become arch-conservatives with respect to the second proposal, even when it's orders of magnitude more cautious [10:36:52] I think you're misreading the situation [10:37:01] the deployment of salt is pretty gross at the moment. it involves having puppet modules that execute python code that sets redis values that are used as salt grains. and it doesn't work. and we can't say exactly under what conditions salt cmd.run reports failures, etc. [10:37:25] and yet your proposal doesn't fix those issues :) [10:37:42] i have a different proposal for that, as you know [10:38:22] but you think it's relative to this discussion to mention it [10:38:27] but it's not part of the same proposal [10:38:52] there's an RT saying "install fabric, we're going to use it in scap" and says nothing about salt, or git-deploy/trebuchet [10:39:03] it's not my RT [10:39:06] while it's clear that you resent both of these tools [10:39:15] and I'm totally fine with exploring alternatives [10:39:36] but let's do _that_, consider them as alternatives, not as tools that will exist permanently in parallel [10:40:49] sure, i mentioned earlier that one of the selling points of fabric is that it doesn't require installing software on all nodes, just tin, and it doesn't require a full stack of applications to work, just a couple of python libs [10:41:15] I mostly like to avoid making decisions early in a design process that are hard to back out from if they turn out to be wrong [10:42:34] but from my perspective, we don't even need that [10:42:51] i'll just stick to shelling out to dsh for now [10:43:12] I think you should make a plan and you should form a proposal that incorporates your ideas about fabric, salt and your criticism of scap & git-deploy, among others [10:43:30] now, can one tell me how a uid for a user is decided in admins.pp? [10:43:34] but i am really being very honest when i say that i don't have a definite idea of what a deployment system should look like [10:43:56] that's why i've imposed severe constraints on what i'm allowing myself to do at the moment [10:44:05] which is to rearticulate scap in a different idiom [10:44:21] you have a good understanding of the problem and ideas though, and you should explore them further before asking for stuff to be deployed in production [10:44:54] I for one would be happy to discuss ideas with you as you know [10:45:15] note how we have another proposal that came to the list today (debs) [10:45:19] bryan said: > We could also use Fabric (http://fabfile.org) instead of Dancer's shell. This would have the added benefit of providing a way to get structured success/failure responses from each host in each batch [10:45:45] i gave a breakdown of what i took to be the plusses and minuses, and urged that we reconsider salt: [10:45:47] "salt is theoretically faster because it has a daemon running on each host and it uses a lightweight protocol to distribute tasks. On the other hand, it is not yet a mature platform, and it has not benefited from the years of intense scrutiny that have gone into ensuring OpenSSH is secure. On the other other hand, it is already installed on the cluster." [10:45:51] Going back to the history: Has there been ever a discussion about the deployment process? Or was it just: That's what we got, let's build on that? [10:45:51] (that includes replacing reprepro) [10:46:13] scfc_de: we've gone in circles a lot. it's not an easy problem. [10:46:13] scfc_de: I don't believe there was something concrete, no. that's what I'm asking for right now [10:46:21] and then i concluded: "...basically, I think we should stick to shelling out to dsh, even though it's horrible and gross, until we've finished Pythonizing the remaining scripts. We'll then be in a better position to choose." [10:46:35] what would you like us to do, ori? [10:46:47] rewrite morebots [10:46:49] :D [10:46:55] what would you reply to gabriel? [10:47:46] that i'm not convinced, because debs are fundamentally oriented toward a single machine, not distributed service-oriented architectures [10:47:48] one deployment system per wmf group? [10:48:10] no, it's better to have one, of course [10:48:44] but in my experience the best way to tackle problems is by starting from where the constraints are tightest [10:49:01] should I just install fabric for you to work cowboy-style on a scap alternative, install mini-dinstall for gabriel and +2 his debs, have aotto/nik deploy jars via some other adhoc method and deploy all the rest with some half-assed git-deploy/trebuchet thing? [10:49:24] does this sound like a solid plan to you? [10:49:25] it's easy to fit a deployment system to eventlogging because it's simply and not mission-critical, you could have an erlang app that retrieves it from bittorrent and it would still work [10:49:37] it seems the deployment process across projects would benefit from a well-organized RFC. but what ori's arguing for, I think, is to also help map the problem space a bit more by creating a version of our existing primary toolset (scap) that's modeled to be easier to inspect and analyze to define the requirements for a more holistic approach [10:49:50] Eloquence drive-by! [10:49:52] :) [10:49:57] haha, hi Eloquence [10:50:28] paravoid: not sure you have to re-invevt the wheel [10:50:32] wikitext is too complicated and i don't use VE out of solidarity with the proletariat [10:50:38] so i can't write RFCs [10:50:49] twitter had the same issue and they invented murder [10:50:58] you can put it in google docs and piss off everyone in one go ;-) [10:51:09] heh [10:51:12] we've discussed murder before, it was part of the initial git-deploy design [10:51:22] i've often thought of murder when considering git-deploy [10:51:31] Eloquence: I'm not stopping anyone from prototyping [10:51:41] the discussions is about deploying it in production :) [10:51:47] discussion* [10:51:49] re: should I just install fabric for you to work cowboy-style on a scap alternative [10:51:50] and whey was it rejected [10:51:55] i don't see my name on that request [10:51:57] you can install fabric in labs alright [10:52:07] and my comments are mostly urging caution [10:52:14] prototypes are good and contribute greatly to a discussion [10:53:01] you're the one arguing passionately about this, sorry if it was misdirected [10:53:19] and ok, cowboy-style was a bit too strong, I apologize [10:53:29] hey, i was flattered [10:53:32] "be bold" style :) [10:54:27] either the FSF or GNU has identified bittorrent sync as a software stack for which no adequate OSS exists [10:54:40] and has encouraged people to design an alternative [10:54:45] if i were really being bold i'd think about that [10:54:46] why so ori ? [10:55:05] i used it at my company for some time [10:55:15] and dropped it [10:55:17] 'cause it's not free [10:55:35] scfc_de: during the MediaWiki summit week, all of mwcore wikimedia team + other folks had a 3 hours meeting to talk about the deployment process [10:55:48] the protocol isn't free ? [10:56:12] hashar: Got a link to the minutes? [10:56:21] scfc_de: erik and rob were around. We have a bunch of items to talk about and prioritize. Items are listed at https://www.mediawiki.org/wiki/Development_and_Deployment_Process/Review20140122-Notes [10:57:00] scfc_de: I had a talk with greg-g yesterday about that list. Is going to prioritize them somehow by involving other people (like me) [10:57:05] ori: to close this thread, let's circle back with bryan who's the original requestor and is supposed to spearhead the deployment discussion I'd say [10:57:22] fine by me [10:57:30] scfc_de: from there we should get a roadmap more or less and get directors to assign resources to the identified tasks. Or get some tasks dismissed entirely :-D [10:57:32] let's have a discussion e.g. tonight/your tomorrow maybe [10:57:58] as I said, I already pointed aotto to him yesterday, and we now have a mail from gabriel in our inboxes [10:58:25] so we (ops) we have to respond to that request, and I'd like to coordinate with bryan on that [10:59:00] sure, but i can taste the blood of scap [10:59:08] compare https://git.wikimedia.org/blob/mediawiki%2Ftools%2Fscap.git/19ad5344bd67cf98ed280ad6da974254b7eb7c52/bin%2Fscap to https://git.wikimedia.org/blob/mediawiki%2Ftools%2Fscap.git/19ad5344bd67cf98ed280ad6da974254b7eb7c52/bin%2Fscappy [10:59:21] the former is longer because it folds a bunch of things that exist as php or shell scripts [11:00:01] hashar: But that's more about problems with the whole deployment process (humans & Co.), and not the technical side (scap & Co.), isn't it? [11:00:05] i'm hoping to be at the point where we can git rm scap ; git mv scappy scap in the next 24 hrs or so [11:00:18] the behavior is the same [11:00:56] scfc_de: we have both humans behavior issues and tooling issues [11:01:06] scfc_de: one sure thing is: don't blame the tools :-D [11:01:34] the tool (i.e. scap) mostly reflect the fact that we never sit down together to rethink our deployment process [11:02:01] scfc_de: scap is a shell script that more or less got invented ~10 years ago and which we amended along the years to add bunch of monkey patches to it [11:02:24] it calls scap-1, which calls .. scap-2 :) [11:02:26] ori: tbh, I think you're being a bit too aggressive on schedule; I also don't see any considerable improvements in that python script [11:02:38] ok, sure, it's python and not shell, and does a few things in a smarter way [11:02:56] but it's just a script that shells out to other scripts [11:02:57] scfc_de: end result is a bit of a mess and does not match our nowadays needs (i.e. easy/safe deployment and the ability to copy all the l10n / code while attempting to switch wikis very fast) [11:03:14] paravoid: it is now, but then those other scripts will be rewritten [11:03:22] and then shelling out will be replaced with function invocation [11:03:54] and code will be laid out by common functionality and arranged to form a module [11:05:03] hashar: I assumed that; but I would be interested why for example we don't build upon Puppet. From an outside perspective, having several ways to deploy and roll back releases sounds like more work than necessary. [11:05:37] scfc_de: there is two issues with puppet: 1) only ops can merge in there 2) it executes every half hour [11:05:42] i mean, i'm importing a shell script in get_config(), it's a borderline crime against humanity. but once i replace the other calls to source mw-deployment-vars.sh, get_config could be replaced with anything [11:06:28] I have a pile of mails waiting for me, and it's way past your bedtime I think [11:06:36] what else is new? :) [11:06:36] lol [11:06:39] scfc_de: so you can really push code on all servers at roughly the same time since you need to get puppet to execute on each host. Though we could use a script that would force run puppet on all application servers at the sametime. But that is not fast enough and I am not sure the puppet master will handle the sudden burst of requests (up to 400) [11:06:42] off you go ori [11:07:08] !kick ori sleep deprivation fix [11:07:16] well, it's a quorum [11:07:20] mark: can you please glance at https://gerrit.wikimedia.org/r/#/c/108314/ [11:07:29] ori: you should relocate on this side of earth :-] [11:07:35] even erik is there nowadays! [11:07:35] i suppose i should go. anyways, paravoid, thanks for hearing me out [11:07:46] i don't think we disagree on anything [11:07:52] except the bedtime thing, but w/e [11:07:56] good night! [11:08:07] I'm all for having another discussion, preferrably with Bryan that we're kinda leaving out despite being explicitly tasked for this :) [11:08:13] hashar: Sure there are pros and cons; that's why I'm interested if there has been any evaluation. For example, if the burst is problematic, we could add mirroring puppetmasters. Etc. [11:08:35] and I'm okay with prototypes and replacements and crazy ideas in general [11:09:27] i can't think of doing anything more conservative than what i'm currently doing other than not doing anything at all, so i object to the characterization........but i'm really off to bed [11:09:33] scfc_de: puppet isn't very good at deploying in general. it is good in verifing the server looks the why you want it to look [11:09:41] scfc_de: yup possibly. But we don't really want to use puppet for deployment. + it needs merge right on the repository which we can grant easily [11:10:02] (I didn't mean that /this/ is a crazy idea, I'm just trying to express that it's not conservatism that drives me to this discussion) [11:10:10] scfc_de: i (and anyone from mwcore wikimedia team) can't merge changes in operations/puppet.git for example [11:10:18] paravoid: it's my wit! [11:10:26] and personal charm [11:10:31] sleep well ori! [11:10:36] heh [11:10:45] * matanya puts ori to bed [11:11:06] should i close the light too? [11:11:08] * hashar sends all European folks to bed so ori has no one to talk with  [11:13:07] Well, it is 11 UTC [11:13:18] ori is finally ready for bed [11:13:19] hashar: "we don't really want to use puppet for deployment" = that's not a very good argument :-). If someone from ops needs to +2 a change, I think that might be easier to guarantee in a deployment window than maintaining a whole software stack. But as I said, I only have an outside view on that and no hard data. [11:14:19] scfc_de: well add in that you can't really run puppet on all machine at the sametime + that would make ops the bottleneck (no offense there opsen :D) [11:14:21] OTZ = Ori Time Zone = UTC+12:00 [11:17:05] hashar: How does scap solve that? [11:17:18] in a nasty way [11:17:29] we rsync the material to some rsync proxies [11:17:46] then ask (using dsh) chunks of application servers to rsync from those proxies [11:18:13] and hope for the best [11:18:13] dsh let you run commands on target servers in parralel [11:18:25] So "dsh -g app-servers sudo puppetd -tv" should work, shouldn't it? [11:18:37] no, you would kill the master [11:19:16] matanya: That's why I was suggesting mirroring puppet masters. [11:19:23] (In lieu of the rsync proxies.) [11:19:44] so don't really solve the problem, just replace it with some other workaround [11:20:41] Streamline the process, don't use multiple tools, standardize. [11:20:51] we would like to get rid of the need to use several tools to deploy [11:21:57] you still need dsh and puppet in your way [11:22:39] moreover, the unix way would be very upset with using puppet for this :P [11:22:42] Puppet's already there; dsh sounds a lot less complicated than scap :-). [11:22:55] puppet is completely unsuitable for software deployments like these [11:23:07] it runs only once every half hour, so you can't do a quick deployment [11:23:22] also it's harder to check whether it completed on every box at all [11:23:24] not to talk about a revert [11:23:25] often it doesn't [11:23:52] and it's not even good or efficient at deploying large directories [11:24:22] puppet is ok for deploying a few files if you don't care about timing and such, but for anything more you need some deployment system or package management [11:25:13] mark: We already were past "dsh sudo puppetd ..."; we also have already checks on successful Puppet runs. As I understand it, scap takes a SHA1 and deploys that. Puppet could be given a SHA1 and then deploy that to a directory. [11:25:43] we have checks on successful puppet runs after like an hour [11:25:45] and not even those are reliable [11:26:02] and some of them take minutes to be done [11:26:44] puppet's orchestration is generally regarded as pretty weak [11:26:52] it's fine for eventual consistency [11:26:59] but that's not nearly good enough here [11:30:03] also the fact that if a puppet run on any host breaks for any unrelated reason, that would just halt software deployments completely on that box until it gets fixed [11:36:25] (03PS1) 10Odder: Add Apple Touch icon for Labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111434 [11:53:30] (03PS1) 10Matanya: nrpe: move carbon checks to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/111436 [11:55:54] mark: I'm still catching up on backscroll, but meanwhile: https://dpaste.de/oivj <- after a pxe boot, interfaces look roughly correct to me. You? [11:58:15] jizz, 26 pending reviews :/ [11:58:44] (03PS21) 10Matanya: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 [12:08:23] (03PS6) 10Matanya: webserver: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 [12:14:53] (03CR) 10Alexandros Kosiaris: [V: 032] nrpe: move carbon checks to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/111436 (owner: 10Matanya) [12:15:12] meh... wrong click [12:15:19] (03CR) 10Alexandros Kosiaris: [C: 032] nrpe: move carbon checks to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/111436 (owner: 10Matanya) [12:15:39] there.. 25 now :-) [12:15:47] LOL thanks akosiaris [12:18:44] andrewbogott: yes that looks correct [12:18:52] strange that it wouldn't work in the previous install then... [12:18:59] yeah [12:19:11] So, I want to take the stuff that was on eth1 and move it to eth4... [12:19:18] Or split it and put one on eth4 and one on eth5? [12:19:40] we'll want to do vlan tagging anyway [12:19:52] as it might save us from headaches later [12:19:59] and the switch might even be setup for that now [12:20:07] but sure, both eth4 and eth5 are connected and can work [12:20:12] we can experiment right now [12:20:13] sure, but… I still want to use the 10g interface, right? So eth1.1102 should be eth4.1102? [12:20:19] absolutely [12:20:34] 'k [12:21:09] andrewbogott: so what I now wonder about neutron as well... i think it changes the api, doesn't it? [12:21:17] so will it just work with our openstack extensions in mediawiki on wikitech? [12:22:17] akosiaris: need your help a sec. since icinga is not accessible to me, i don't know how uses check_ram.sh, can you please have a look a sec? [12:22:19] I'm not sure. We will surely have to make some changes, but I think a lot of the networking stuff happens passively due to other nova calls. [12:22:40] We don't make a whole lot of network calls directly. [12:23:05] *who [12:23:17] Also, fortunately the all apis are mediated through the same front end (keystone) so switching over won't require writing any new backend code. [12:23:22] ok [12:23:42] i just worry about the multiple tenants with shared network [12:24:16] I'm sort of expecting to have to add a 'create a new network' call to the logic that creates a project. [12:24:24] Oh, well, depending on which model we settle on [12:25:47] (03CR) 10Zhuyifei1999: [C: 031] "As per" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [12:26:08] (03PS1) 10Andrew Bogott: Use the new 10g network adapters for neutron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111439 [12:26:56] oh, wait, that's surely wrong, one moment... [12:27:22] ok, better. [12:27:36] (03PS2) 10Andrew Bogott: Use the new 10g network adapters for neutron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111439 [12:30:54] (03PS1) 10Matanya: icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 [12:31:18] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Last successful Puppet run was Tue 04 Feb 2014 06:26:26 PM UTC [12:31:45] tim's find-nearest-rsync doesn't check the $success retval of Net::Ping->ping, only $rtt [12:32:02] which is 0 if, say, the hostname couldn't be resolved [12:32:15] which is true for, say, 'mw40' on eqiad [12:32:43] making the hostname the couldn't be resolve always win, since it's hard to beat 0 rtt [12:33:25] (03CR) 10Andrew Bogott: [C: 032] Use the new 10g network adapters for neutron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111439 (owner: 10Andrew Bogott) [12:33:30] i'm not sure why it's not making everything always fail. there is some logic to failover to using tin that is activating somewhere, i guess [12:35:37] i've even seen ping return negative RTT [12:35:44] i thought that was kind of awesome [12:36:16] (03PS2) 10Matanya: spamd: change to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110931 [12:43:38] it's when the remote host senses that it's overdue for a ping [12:44:04] icmp telepathy [12:45:52] (03CR) 10Byfserag: "Per discussion, please add zh-my and zh-mo." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [12:46:42] (03CR) 10Byfserag: [C: 031] Add variant rewrites for zhwikivoyage [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [12:52:45] (03PS1) 10Ori.livneh: find-nearest-rsync: don't pick unreachable hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/111442 [12:52:55] and on that note [12:53:27] mark: during ntpadjust? :) [12:53:37] probably ;) [12:56:58] (03PS1) 10Andrew Bogott: Set br-ex by hand because Augeas hates it. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111443 [12:57:11] mark, how can I delete a tagged interface? (have the eth1.* stuff left over now.) [12:59:45] first remove it in /etc/network/interfaces [13:00:09] (03PS2) 10Andrew Bogott: Set br-ex by hand because Augeas hates it. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111443 [13:00:16] then use 'vconfig' [13:00:44] it allows you to remove tags from interfaces, check the help for exact syntax, i forget [13:01:01] you may need to do an "ip link set dev eth1.1102 down" first [13:02:14] yep, they were already down. Worked, thanks. [13:02:36] (03CR) 10Andrew Bogott: [C: 032] Set br-ex by hand because Augeas hates it. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111443 (owner: 10Andrew Bogott) [13:02:43] (03PS1) 10Matanya: sudoers: remove two files, seems not to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 [13:05:21] PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:47] hm... [13:21:46] ok, mark, a few questions. First -- in ifconfig, should the loopback interface be reflecting the fact that we pointed it at 208.80.155.255? https://dpaste.de/n2ZA [13:22:12] ok [13:22:13] second: aren't we still missing one IP? I feel like I should have three interfaces each with an IP, right now the 'data' interface doesn't have one. [13:22:16] so first of all, stop using ifconfig ;) [13:22:21] third: What's icinga on about? [13:22:24] always use ip addr show [13:22:31] ok :) [13:22:51] ah, with ip addr show, lo looks good. [13:22:57] good ;) [13:23:04] as for icinga... [13:23:32] uhm [13:23:46] I can ssh into the box, and also ping things from it. [13:23:51] yeah [13:24:51] icinga has a different ip for labnet1001 [13:24:59] it's configured for 10.64.22.11 [13:25:03] while dns resolves to .13 [13:25:50] 13 seems right -- that's eth0 [13:26:17] where does .11 come from? [13:26:46] that's what br-ex is pointed to -- the 'external' interface [13:28:19] yeah, leslie added .11 to dns as 'labnet1001-ext' [13:28:55] ok, but labnet1001.eqiad.wmnet should not resolve to that [13:28:55] but [13:28:56] it's possible that puppet picket that ip as the primary ip [13:28:56] and therefore announced it to icinga with that [13:29:03] check facter [13:29:04] s/picket/picked/ [13:29:39] but I think it will already be fixed by the next puppet run [13:29:47] assuming 10.64.22.11 won't be on eth0 then [13:30:00] presumably it will be on eth4 or eth5 [13:30:23] it's on br-ex which is bridged to eth5 [13:30:24] scfc_de: any way other than grep -r "filename" to find where files are used? [13:30:44] ok [13:30:53] i don't -think- the next puppet run will assume that ip as primary [13:31:04] i'm doing a puppet clean up, and some files are not called from anywhere, does that mean they are not used for sure? [13:31:06] if it does we should fix that [13:32:03] hm, does facter have a log someplace? Or is everything just in the syslog? [13:32:36] yeah, facter says "ipaddress => 10.64.22.11" [13:32:40] I know not why :( [13:32:53] hmm [13:32:59] It also says 'ipaddress_eth0 => 10.64.20.13' [13:33:18] Maybe should just ignore icinga until we're done messing around with this :) [13:33:20] we also have, in realm.pp: [13:33:21] # Determine the site the server is in [13:33:21] if $::ipaddress_eth0 != undef { [13:33:21] $main_ipaddress = $ipaddress_eth0 [13:33:21] } elsif $::ipaddress_bond0 != undef { [13:33:21] $main_ipaddress = $ipaddress_bond0 [13:33:23] } else { [13:33:25] $main_ipaddress = $ipaddress [13:33:27] } [13:33:34] although that doesn't get used for icinga I think [13:34:40] interfaces => br_ex,br_int,eth0,eth1,eth2,eth3,eth4,eth5,eth4_1102,eth5_1122,lo,ovs_system [13:34:50] I think it just uses the first ip of the first interface due to alphabetic order [13:34:52] which is kind of annoying [13:34:58] i'd say, yeah, ignore for now, let's look at it later [13:35:49] matanya: Not necessarily; it depends on the file, if it is for example read by some package or another WMF repo. Do you have an example? [13:36:30] (03PS1) 10BBlack: Move JP, KP, KR traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111445 [13:36:31] yes scfc_de files/searchqa/lib/searchqa.pm [13:37:03] (03CR) 10BBlack: [C: 032 V: 032] Move JP, KP, KR traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111445 (owner: 10BBlack) [13:37:06] ok. So, back to question two… shouldn't 'data' aka eth4.1102 have an ip? [13:37:22] yes [13:37:47] and it should be an ip in the "virt-guests" subnet [13:37:49] although this is assuming the old labs setup [13:37:50] it may be different with the multiple networks in neutron [13:37:57] ok, and… is there currently a virt-guests subnet? [13:38:03] i'll check [13:38:27] (03PS1) 10Hashar: contint: slave-scripts are deployed via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/111446 [13:38:29] (03PS1) 10Hashar: contint: fix slave-scripts deployment on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111447 [13:38:56] matanya: That's used by files/searchqa/bin/analyse_test_results, for example. ".pm" is the extension for Perl modules => "use searchqa;" loads searchqa.pm. [13:39:16] it's configured on the switch [13:39:19] let me check subnets in dns... [13:39:35] thanks scfc_de [13:39:47] andrewbogott: ; 10.68.16.0/24 - labs-instances1-b-eqiad [13:40:09] that assumes 256 instances max though [13:40:11] seems tight [13:40:16] let's see what we have in tampa right now [13:41:06] we have 400 some instances already. [13:42:13] yes [13:42:49] ; 10.4.0.0/21 - guest VMs subnet [13:42:53] that's a /21 [13:42:56] 2048 hosts [13:44:42] At some point I wrote "10.68.0.0/20 - eqiad labs fixed IPs (allocated range)" and "10.68.16.0/24 eqiad labs fixed IPs (actual used range as all hosts are in row B) " Not sure if Leslie dictated that to me or what. [13:45:07] well /24 will be too tight either way [13:45:11] yep [13:45:19] man that zonefile is a mess [13:45:22] we really ought to clean that up [13:46:03] so this really depends on what we're gonna do with multiple networks or not [13:46:10] do you think you could experiment with this /24 [13:46:15] and assume we'll redo this for the final install? [13:46:29] i can reallocate this but it doesn't make sense until we know what we're gonna do exactly [13:46:33] we may need an even bigger range than that [13:46:47] sure, it'll be a while before we have >256 instances. [13:46:55] at this very moment you can assume that 10.68 is labs in eqiad [13:47:14] well we should change this before we actually migrate instances other than testing [13:47:20] So, I should assign eth4.1102 to 10.68.16.0? [13:47:25] * andrewbogott nods [13:47:32] .1 [13:47:38] .0 is the "network address" [13:48:19] (03PS1) 10Andrew Bogott: Assign eth4.1102 ('data') an IP. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111448 [13:48:38] and you should probably reserve a few more addresses [13:48:42] say we'll get more network nodes, 2-4 [13:48:49] so labnet1001 is .1, labnet1002 is .2... [13:49:03] it's not really important but it's intuitive [13:49:18] actually don't worry about it, this is testing, it's not gonna stay in that /24 ;) [13:52:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 minor comment, LGTM otherwise. I see no reason not to merge it ASAP. objections ?" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [13:54:04] (03PS1) 10Andrew Bogott: Allocate 10.68.16.1 for yet another labnet1001 interface. [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 [13:54:48] mark, have a look at that last one? [13:54:52] I'll be back in 5… [13:54:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 pedantic comment, LGTM otherwise" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110931 (owner: 10Matanya) [13:56:46] Pick one 5 [s]econds, [m]inutes, [d]ays, [M]onths, [y]ear] : [13:57:07] (03CR) 10Mark Bergsma: Allocate 10.68.16.1 for yet another labnet1001 interface. (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 (owner: 10Andrew Bogott) [13:57:07] !log compressing Jenkins console logs on gallium.wikimedia.org using gzip -9 [13:57:16] Logged the message, Master [13:57:25] andrewbogott: that works, but I think i'd stick to the common convention for routers, i.e. put interface address before hostname with a dot [13:57:36] !log remove XO-Level3 avoided as-path from cr1/2-eqiad despite no ticket reply; seems to work now [13:57:44] Logged the message, Master [13:59:37] (03PS3) 10Matanya: spamd: change to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110931 [14:00:46] (03CR) 10Matanya: "one more minor from me." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [14:03:16] if someone has a few spare minutes, could use three merges for contint project: https://gerrit.wikimedia.org/r/#/q/is:open+topic:contint,n,z [14:03:27] impacts labs instances [14:03:58] one culprit is https://gerrit.wikimedia.org/r/111447 which uses git::clone() to deploy some repositories on labs because git-deploy can't be used on labs instances [14:04:44] mark, I should make that change in both files, right? [14:05:03] yes [14:05:13] look at any of the cr1-eqiad etc stuff in the zonefiles for examples [14:05:53] (03PS2) 10Andrew Bogott: Allocate 10.68.16.1 for yet another labnet1001 interface. [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 [14:06:09] essentially labnet1001 is just an advanced router [14:06:33] mark, ^ ? [14:06:48] whoah, wait, that's not right. Hang on... [14:09:18] (03PS3) 10Andrew Bogott: Allocate 10.68.16.1 for yet another labnet1001 interface. [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 [14:09:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: remove check_ram.sh doesn't seem to be used anywhere (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 (owner: 10Matanya) [14:10:35] (03CR) 10Alexandros Kosiaris: [C: 032] spamd: change to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110931 (owner: 10Matanya) [14:10:44] (03CR) 10Andrew Bogott: [C: 032] Assign eth4.1102 ('data') an IP. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111448 (owner: 10Andrew Bogott) [14:10:54] !log ms-be1002/sdd: megacli -DiscardPreservedCache, -CfgEachDskRaid0, puppet run [14:11:02] Logged the message, Master [14:11:49] hm, akosiaris, if you're doing puppet-merge, go ahead and include my patch as well. [14:12:36] andrewbogott: done [14:12:39] thx [14:14:24] (03PS2) 10Matanya: icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 [14:15:02] (03CR) 10jenkins-bot: [V: 04-1] icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 (owner: 10Matanya) [14:15:36] ottomata: cp3019 puppet freshness? [14:15:58] (03PS3) 10Matanya: icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 [14:16:52] (03CR) 10Mark Bergsma: [C: 031] Allocate 10.68.16.1 for yet another labnet1001 interface. [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 (owner: 10Andrew Bogott) [14:17:16] good morning paravoid! [14:17:17] :) [14:17:30] hi [14:17:52] (03PS4) 10Matanya: icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 [14:19:17] well, puppet freshness is better than flapping delivery error [14:19:17] hm [14:20:28] rtt is more normal now too [14:20:29] hm [14:20:46] matanya: could you chime in on https://rt.wikimedia.org/Ticket/Display.html?id=6773 ? [14:21:33] so, Snaps and I spent a bunch of time trying to understand why rtt would be different for cp3019 than for other hosts [14:21:37] didn't find anything [14:21:38] drdee_: replied on ticket [14:22:10] so, I reverted my change on cp3019 that just upped max buffer size, and instead increased batch size so that it would try to batch about once a second, just like it does on mobiles right now [14:22:17] thanks matanya ! [14:22:32] cp3019 does about 6K reqs / sec, so I upped the batch size to that, or every second, whichever comes first [14:22:32] np drdee_ as you can see not much is left [14:23:11] great! how did you create that task list? [14:23:31] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:23:45] paravoid: did it for me [14:23:50] k [14:24:01] looks like syslog output [14:24:24] matanya: Re https://gerrit.wikimedia.org/r/111440/, are there realy no checks for free memory at the moment? [14:24:52] i didn't find any one using it [14:25:19] Hmmm. Possible, but strange. [14:26:44] matanya: Re "#after absent everywhere, remove entire resource as it is not used", if you put a "TODO: " before that it's easier to find. [14:27:24] yeah, next time [14:27:32] interface::ip { 'openstack::data_interface': interface => 'eth4.1102', address => '10.68.16.1', prefixlen => '24' } <- why does augeas hate that? [14:27:39] It does everything I want, then errors thereafter :( [14:28:21] oh, hm... [14:32:36] (03PS1) 10Andrew Bogott: Assign eth4.1102 ('data') an IP within the 'tagged' resource. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111451 [14:33:41] (03PS2) 10Andrew Bogott: Assign eth4.1102 ('data') an IP within the 'tagged' resource. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111451 [14:34:00] mark, does it matter that ^ no longer specifies a prefixlen? [14:35:19] yeah I do think you need that [14:35:41] so, what is currently creating that interface, br-ex? [14:36:53] It's been interface::tagged for a while, I'm just adding the ip. [14:37:03] (03PS5) 10Matanya: icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 [14:37:28] So, since interface::tagged doesn't take a prefixlen argument… [14:37:49] Maybe I can just set the actual address to 10.68.16.1/24? [14:38:39] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: remove check_ram.sh doesn't seem to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 (owner: 10Matanya) [14:40:19] checking [14:40:39] it takes a netmask argument [14:40:52] /24 is equivalent to netmask => "255.255.255.0" [14:40:59] sorry for the inconsistency [14:41:40] (03PS1) 10Matanya: icinga: remove check_ram.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/111452 [14:41:49] (03PS3) 10Andrew Bogott: Assign eth4.1102 ('data') an IP within the 'tagged' resource. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111451 [14:42:19] andrewbogott: btw, if you hit too many issues with getting _puppet_ to do what you want, I think you could circumvent puppet (for now) and do some things manually [14:42:27] first priority I think is to see if we can get neutron to do what we need at all [14:42:34] yeah, I'm already doing that in a couple of cases. [14:42:36] if we can confirm that it does, we can worry about how to do that in puppet later [14:42:37] ok :) [14:42:59] But on the off-hand chance that it works, I want to be able to do it a second time! [14:43:05] sure [14:43:19] it's good to do that where you can easily, just don't let it stop you [14:43:29] (03CR) 10Andrew Bogott: [C: 032] Assign eth4.1102 ('data') an IP within the 'tagged' resource. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111451 (owner: 10Andrew Bogott) [14:43:41] augeas can be a real bitch to get it to do what you need, and in the end it's just editing a config file [14:44:05] i spent quite some time getting those interface:: definitions right, and they will certainly not work well for all cases [14:45:36] (03CR) 10Ottomata: [C: 031] webserver: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/110454 (owner: 10Matanya) [14:46:04] (03CR) 10Matanya: [C: 04-1] "Please don't merge until 12.2.2014. after absent from all servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111452 (owner: 10Matanya) [14:47:23] I don't remember, which box do dns patches get merged on? [14:48:47] andrewbogott: ns{0,1,2}, anyone you like more [14:48:55] 'k thx [14:50:58] (03CR) 10Andrew Bogott: [C: 032] Allocate 10.68.16.1 for yet another labnet1001 interface. [operations/dns] - 10https://gerrit.wikimedia.org/r/111449 (owner: 10Andrew Bogott) [14:57:20] ottomata: did that fix it? [15:00:12] Snaps: i'm not 100% sure if that fixed, all i know is that is all I have changed and it isn't happening now [15:00:16] hmm, lets revert it and see? [15:00:41] ottomata: cp3019 also complained about puppet freshness, related? [15:01:03] yes, because I turned off puppet [15:01:05] to make the change [15:01:08] without it being reverted [15:01:10] ah :) [15:01:35] so now you're suggesting to revert to the same values the other hosts use and see if the problems reoccur? [15:02:06] if someone has a few spare minutes, could use three merges for contint project: https://gerrit.wikimedia.org/r/#/q/is:open+topic:contint,n,z :-D [15:02:53] ottomata: it does make sense that a larger batch size helps when the roundtrip is high though [15:04:32] (03PS1) 10Andrew Bogott: Added support for the metadata secret. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111453 [15:04:49] yes, but it doesn't make sense as to why they other hosts dont' have that problem [15:04:51] (03CR) 10jenkins-bot: [V: 04-1] Added support for the metadata secret. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111453 (owner: 10Andrew Bogott) [15:05:35] hashar: write a bot to do that [15:05:39] ottomata: yeah, so the root cause is still unknown. I wish there was a way to get detailed stats for a specific tcp socket in linux. [15:06:51] Snaps: there is [15:06:51] (03PS2) 10Andrew Bogott: Added support for the metadata secret. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111453 [15:06:52] use ss [15:08:06] matanya: sweet, didnt know that! [15:08:22] (03CR) 10Andrew Bogott: [C: 032] Added support for the metadata secret. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111453 (owner: 10Andrew Bogott) [15:08:34] Snaps: glad i can be useful :) [15:10:40] ottomata: ss -mite '( dport = :9092 )' [15:12:02] ottomata: ^ thats what we want when things are going badly, and compare that to one of the good hosts [15:12:09] interesting! [15:12:09] ok [15:12:17] I will reenable puppet on cp3019, lets see whta happens [15:12:54] RECOVERY - Puppet freshness on cp3019 is OK: puppet ran at Wed Feb 5 15:12:51 UTC 2014 [15:14:38] Snaps: next time just ask :P [15:27:19] (03PS1) 10Ottomata: Parameterizing queue_buffering_max_ms and batch_num_messages [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/111455 [15:40:08] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:40:09] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:40:10] PROBLEM - check_mysql on payments1003 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [15:40:44] ^^ I'm aware of this [15:44:46] matanya: I found out where check_ram is used: On Labs :-). See #wikimedia-labs-nagios. [15:45:08] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:45:09] RECOVERY - check_mysql on payments1003 is OK: Uptime: 254 Threads: 5 Questions: 4304 Slow queries: 143 Opens: 868 Flush tables: 1 Open tables: 63 Queries per second avg: 16.944 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [15:45:09] RECOVERY - check_mysql on payments1004 is OK: Uptime: 251714 Threads: 3 Questions: 35274 Slow queries: 357 Opens: 1369 Flush tables: 1 Open tables: 61 Queries per second avg: 0.140 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [15:49:08] (03CR) 10Tim Landscheidt: "Turns out this is used by Icinga Labs; cf. #wikimedia-labs-nagios or http://icinga.wmflabs.org/icinga/." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111440 (owner: 10Matanya) [15:50:02] (03PS1) 10Tim Landscheidt: Revert "icinga: remove check_ram.sh doesn't seem to be used anywhere" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111457 [15:50:08] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:50:44] (03PS2) 10Tim Landscheidt: Revert "icinga: remove check_ram.sh doesn't seem to be used anywhere" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111457 [15:51:45] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "icinga: remove check_ram.sh doesn't seem to be used anywhere" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111457 (owner: 10Tim Landscheidt) [15:51:55] (03CR) 10Ottomata: [C: 032 V: 032] Adding parameterization for open file descriptor ulimit [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/110592 (owner: 10Ottomata) [15:55:08] RECOVERY - check_mysql on payments1002 is OK: Uptime: 975 Threads: 3 Questions: 12828 Slow queries: 97 Opens: 912 Flush tables: 1 Open tables: 44 Queries per second avg: 13.156 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [15:59:46] (03PS1) 10Ottomata: Upping nofiles ulimit for production kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/111458 [16:00:48] thanks scfc_de [16:14:46] (03PS1) 10Phuedx: Enable the GettingStarted extension on non-enwiki wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 [16:14:53] (03PS1) 10Andrew Bogott: Set up three separate neutron rolls: controller, netnode, compute. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111461 [16:23:55] (03PS2) 10Andrew Bogott: Set up three separate neutron rolls: controller, netnode, compute. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111461 [16:26:09] (03CR) 10Andrew Bogott: [C: 032] Set up three separate neutron rolls: controller, netnode, compute. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111461 (owner: 10Andrew Bogott) [16:34:42] Coren: https://etherpad.wikimedia.org/p/labs_migration [16:34:58] PROBLEM - MySQL Slave Delay Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:59] PROBLEM - MySQL Slave Running Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:59] PROBLEM - Disk space on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:08] PROBLEM - MySQL Slave Running Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:08] PROBLEM - MySQL Idle Transactions Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:08] PROBLEM - DPKG on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:08] PROBLEM - MySQL Idle Transactions Port 3307 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:47] (03PS1) 10Andrew Bogott: Added metadata-service. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111462 [16:35:48] RECOVERY - MySQL Slave Delay Port 3306 on labsdb1003 is OK: OK replication delay 0 seconds [16:35:49] RECOVERY - MySQL Slave Running Port 3308 on labsdb1003 is OK: OK replication [16:35:49] RECOVERY - Disk space on labsdb1003 is OK: DISK OK [16:35:58] RECOVERY - MySQL Slave Running Port 3306 on labsdb1003 is OK: OK replication [16:35:58] RECOVERY - MySQL Idle Transactions Port 3308 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for 0 seconds [16:35:58] RECOVERY - MySQL Idle Transactions Port 3307 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for 0 seconds [16:35:58] RECOVERY - DPKG on labsdb1003 is OK: All packages OK [16:42:24] (03CR) 10Andrew Bogott: [C: 032] Added metadata-service. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111462 (owner: 10Andrew Bogott) [16:49:25] (03CR) 10GWicke: Bug 60694: Make the config file path configurable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [16:50:39] (03PS3) 10GWicke: Bug 60694: Make the config file path configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 [16:51:19] (03CR) 10GWicke: Bug 60694: Make the config file path configurable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [16:55:18] paravoid, coren: https://rt.wikimedia.org/Ticket/Display.html?id=6752 (Which I closed prematurely, you should reopen.) [16:55:57] andrewbogott: {{done}}. [16:58:49] andrewbogott: I see you already did the ensure => absent in puppet. Do we know how long we keep the class usually? [16:59:15] Coren, I don't know, but there are some pretty old ones in there. Maybe forever? [16:59:32] Might make sense to keep the UID around anyways. [16:59:35] (03PS1) 10Andrew Bogott: Best to assign a value to neutronconfig before passing it. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111465 [17:01:32] (03CR) 10Andrew Bogott: [C: 032] Best to assign a value to neutronconfig before passing it. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111465 (owner: 10Andrew Bogott) [17:08:22] mark, if you don't feel strongly about gre vs. vlan then I may opt for gre for now, because it has fewer moving parts. [17:08:31] (now that I've actually read the instructions) [17:08:53] if it's entirely internal to neutron then I don't really care [17:14:52] (03PS1) 10Ottomata: Removing references to metrics.wikimedia.org puppetization [operations/puppet] - 10https://gerrit.wikimedia.org/r/111466 [17:15:39] (03CR) 10Ottomata: [C: 032 V: 032] Removing references to metrics.wikimedia.org puppetization [operations/puppet] - 10https://gerrit.wikimedia.org/r/111466 (owner: 10Ottomata) [17:17:13] Is $::ipaddress what it sounds like? [17:17:53] andrewbogott: if you are ever curious about a facter variable [17:17:55] you can run [17:17:57] facter ipaddress [17:17:59] or whatever var [17:18:00] on the host [17:18:01] and see the value [17:18:11] or just [17:18:11] facter [17:18:13] with no args [17:18:16] and see the list of all of them [17:18:36] (or maybe you already know that!) [17:21:09] ottomata: yeah, I guess I was worried that that wasn't straight from factor, but surely it is. [17:22:18] (03CR) 10Ottomata: [C: 032 V: 032] Upping nofiles ulimit for production kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/111458 (owner: 10Ottomata) [17:26:27] (03PS1) 10Andrew Bogott: Pass the data interface ip down to the vhost plugin config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111469 [17:28:13] (03PS1) 10Ottomata: Using unified diff output for puppet-merge when showing submodule diffs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111470 [17:30:10] (03CR) 10Andrew Bogott: [C: 032] Pass the data interface ip down to the vhost plugin config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111469 (owner: 10Andrew Bogott) [17:33:16] (03PS1) 10Andrew Bogott: Simple typo fix: s/ip_address/ipaddress [operations/puppet] - 10https://gerrit.wikimedia.org/r/111471 [17:34:48] (03CR) 10Andrew Bogott: [C: 032] Simple typo fix: s/ip_address/ipaddress [operations/puppet] - 10https://gerrit.wikimedia.org/r/111471 (owner: 10Andrew Bogott) [17:38:53] (03PS1) 10Andrew Bogott: There's no neutron-plugin-openvswitch-agent service on the controller. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111473 [17:42:08] (03CR) 10Andrew Bogott: [C: 032] There's no neutron-plugin-openvswitch-agent service on the controller. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111473 (owner: 10Andrew Bogott) [17:58:53] (03PS2) 10Ottomata: Using unified diff output for puppet-merge when showing submodule diffs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111470 [17:58:58] (03CR) 10Ottomata: [C: 032 V: 032] Using unified diff output for puppet-merge when showing submodule diffs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111470 (owner: 10Ottomata) [18:03:32] (03PS1) 10Andrew Bogott: Redefine auth_uri in the keystone_auth_token section. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111477 [18:05:36] (03CR) 10Andrew Bogott: [C: 032] "It quashes a warning!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111477 (owner: 10Andrew Bogott) [18:15:06] (03CR) 10Krinkle: "Can't tell if you did, but did you run createTxtFileSymlinks.sh? If that makes the same change then that's good. If it makes other changes" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111139 (owner: 10Jforrester) [18:18:52] greg-g, deploying zero & bits [18:19:31] kk [18:22:25] dr0ptp4kt, the last patch in brion's at @tin:/a/common/docroot/bits/WikipediaMobileFirefoxOS [18:23:42] yurik, do you mean "is"? that's the one that needs to be updated to the tip of master [18:24:38] yurik, smiling right now because i see timo just emailed the eng list about submodule updates [18:25:30] * brion murmurs [18:25:34] i have been summoned! [18:26:46] * Nemo_bis hides the magic lamp [18:32:54] (03CR) 10Alexandros Kosiaris: [C: 032] Bug 60694: Make the config file path configurable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [18:33:37] akosiaris, hopefully this works for production [18:34:28] (03PS1) 10Yurik: Updated FirefoxOS app [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111487 [18:34:45] (03CR) 10Yurik: [C: 032 V: 032] Updated FirefoxOS app [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111487 (owner: 10Yurik) [18:38:06] mark, the neutron components are now communicating enough that I can run a command and be told that I'm violating a security policy. [18:38:19] So I'm declaring victory for now and going to sleep :) [18:38:26] !log yurik synchronized docroot/bits/WikipediaMobileFirefoxOS [18:38:33] Logged the message, Master [18:38:39] dr0ptp4kt, ^ [18:38:56] yurik, k, will try to reinstall on phone [18:40:42] yurik, ff's app marketplace is broken with an xml parsing error. okay, i guess i'll be trying the simulator instead [18:41:07] dr0ptp4kt, should i revert? [18:41:18] yurik, no...it's their entire marketplace [18:44:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] etherpad: convert into a module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [18:47:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] site: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [18:48:45] (03PS2) 10Jforrester: Add visualeditor-default.dblist to the noc list of files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111139 [18:49:10] !log yurik synchronized php-1.23wmf11/extensions/ZeroRatedMobileAccess/ [18:49:18] Logged the message, Master [18:49:30] dr0ptp4kt, ^ [18:51:47] yurik, seems the ff app deployed just fine. thanks! i'll check the api [18:52:08] (03PS22) 10Matanya: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 [18:52:21] dr0ptp4kt, there is an annoying bug - an empty array is shown as an empty list (known issue of the api) [18:52:30] (03CR) 10Cmcmahon: "I am still seeing Bug 60694 in beta labs 30 minutes after this was merged." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [18:52:42] we can fix it with adding "disabled:true" or something like that [18:53:41] yurik, yeah...i don't want to go up to the core json processing and start returning empty {} for fear of breaking other stuff. realistically, any consumer of the system can just error out safely upon receiving an array instead of an object, though, too. [18:55:18] dr0ptp4kt, that's why i said we can return some default value "enabled:false" that will force it to be an object [18:57:47] !log yurik synchronized php-1.23wmf12/extensions/ZeroRatedMobileAccess/ [18:57:56] Logged the message, Master [19:03:23] yurik, recall the original patch did just that :) granted, the inbound parameter format is different [19:03:55] dr0ptp4kt, hehe, who knew ;) [19:04:52] yurik! alright, well, the api does seem to be working in production. hoping to get some feedback on the X-Zero-Rated: 1 header so we can get that other guy in production maybe with an LD if there's a window for it. but MaxSem needs to apply his wizardry to this. [19:14:01] (03CR) 10Jforrester: "Didn't, does now." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111139 (owner: 10Jforrester) [19:37:34] (03PS1) 10Dzahn: remove grosley from dhcp,dsh and backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/111505 [19:38:21] mutante: can you help with https://bugzilla.wikimedia.org/show_bug.cgi?id=60902 please :) [19:39:52] (03Abandoned) 10Matanya: icinga: remove check_ram.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/111452 (owner: 10Matanya) [19:43:03] matanya: should be done [19:43:35] mutante: thanks. and a question please, how is a uid assgined to a user in admins.pp? [19:44:17] matanya: nowadays, you look it up in labs LDAP [19:44:30] do i have access to that? [19:44:38] before: you looked up the highest one and added 1 [19:44:50] no. formey [19:45:17] well, you have, if you are on the same box with that labs user [19:45:24] ok so what is Sahar Massachi 's uid ? [19:45:29] dunno [19:45:42] what's the labs user name [19:46:24] sahar [19:46:27] maybe labsconsole could expose that information [19:46:28] i think [19:47:07] no, ehm, i don't see a user with that name [19:47:25] https://rt.wikimedia.org/Ticket/Display.html?id=6767 [19:47:56] yea, requirement: needs to create labs user [19:48:14] and then tell us which [19:48:58] and dunno know about the key confirmation thing [19:49:12] gpg signed would be cool, heh [19:49:15] ok, i'll replay on ticket [19:49:19] thanks [19:49:48] matanya: bugzilla login still works, right [19:50:01] someone took it, not access anymore to rt [19:50:02] after that merge (pssstt.. :) [19:50:29] yes, it does :) [19:50:30] because it's an access request [19:50:33] thanks a lot for that [19:50:47] great, don't tell everybody, haha [19:51:04] means? why access is taking my ability to view? [19:51:22] https://office.wikimedia.org/wiki/RT [19:51:38] don't have access to there too ... :P [19:51:49] * matanya is the most accessless steward :) [19:52:19] i don't know how it works [19:52:29] office access for NDA volunteer [19:52:47] and bo ken to ask [19:52:50] *no [19:53:20] well, wiki admins ? [19:53:27] matanya: I can relate to that... rt has always been a black hole for me too... [19:53:47] RT isnt the issue, office wiki is [19:53:50] I have access to RT hoo [19:53:50] right now [19:54:26] or request that to be pasted on wikitech [19:54:27] i dont care [19:54:37] it's documenting all the permissions [19:54:50] mutante: Oh... ok, I've never wanted much from there, except that some stuff on wikitech links to it [19:55:04] but I usually assume that to be non important if it's on officewiki :P [19:55:35] whatever [19:55:41] i spent quite some time pasting all the permissions, even the SQL dump [19:55:50] make me put it on piublic, suggest changes, kthx [19:56:36] and ops-requests has always been public [19:57:13] mutante: you mean ops requests on rt? [19:57:23] yes [19:57:55] Last time I wanted access to rt, everyone was like: That's totally easy, but I don't know how... and then I gave up [19:58:23] did you request it on a ticket or was that just IRC? [19:58:35] mutante: Just over here [19:58:52] the best way to request access to RT is to use it :) [19:59:05] you already have an autogenerated user by just mailing it [19:59:17] if you need more permissions, request them and we'll handle it [20:00:57] Someone at ... requested a password reset for you on https://rt.wikimedia.org/ [20:00:57] Your new password is: [20:00:59] matanya: hoo, same info in public [20:01:03] https://wikitech.wikimedia.org/wiki/RT [20:01:04] it's empty, literally [20:01:23] know that by heart [20:02:03] yes, known issue, just for some users [20:02:16] please just mail and can fix it for you [20:03:15] gwicke: :) [20:03:37] mutante: need my email? hoo@online.de [20:04:49] Requests: ops-requests@rt.wikimedia.org [20:08:59] greg-g, hm? [20:09:12] gwicke: re the cluster-wide atomicity [20:09:16] in that etherpad [20:09:27] ah, yeah ;) [20:09:42] this Einstein guy was right [20:10:18] mutante: Send a dummy mail to that... and now? [20:10:40] * Sent [20:10:49] gwicke: my long term goal for that issue is: have the decision on which version to serve be based not on wiki, but on user group [20:10:56] s/goal/pipe dream/ [20:11:44] hoo: now your request is handled as a ticket like any other [20:12:19] hoo: i suppose it was password request and asking for permissions to core-ops and you have NDA, right [20:12:31] Very sparse ticket, though... just send foo and bar :P [20:12:34] Yep, I'm NDAed [20:13:19] greg-g: would have to think about caching once we move towards serving the same cached HTML to logged-in users too [20:13:53] hoo: then that should be just fine, just a little patience and we'll process it. you'll get mail when it's touched [20:14:09] mutante: Ah, great :) Thanks [20:14:12] gwicke: in the deploy tool, you mean? [20:15:32] greg-g, more generally how this would interact with caching [20:15:52] * greg-g nods [20:15:54] yeah [20:16:30] hey, anybody on the channel able to do a cache flush of stuff under http://bits.wikimedia.org/WikipediaMobileFirefoxOS/ ? it's not an emergency or anything, but a ui bug will be able to be marked resolved once a couple of the assets are updated [20:17:27] (in the cache) [20:20:07] (03CR) 10Dzahn: [C: 032] remove grosley from dhcp,dsh and backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/111505 (owner: 10Dzahn) [20:29:01] ottomata: udplogging question [20:29:33] yessssss ask me! [20:29:36] in anifests/search.pp there is class server($indexer=false, $udplogging=true) [20:30:02] and the var is used in templates/lucene/lsearch.log4j.erb [20:30:33] yes [20:30:35] but the var itself is called from the class config in the .pp file [20:31:00] so this won't work in puppet 3 unless the var is moved to the config class [20:31:15] or hackish ugly use of lookup.var [20:31:48] is it possible to move that var to the config class in some way that won't break everything? [20:32:16] oof so ugly [20:32:24] as hell [20:32:42] hm [20:32:42] sure [20:32:44] easiest thing [20:32:51] and we are depecation lucene anyway [20:32:53] make lucene::config take parameters too [20:32:57] and pass the variables down [20:33:27] I usually prefer to do it opposite than it is here [20:33:31] e.g class config(udplogging = true) [20:33:40] class lucnene($all, $parameters) { do some config files } [20:34:03] not sure i got it [20:34:06] class lucene::server { Class['lucene'] -> Class['lucene::server'] start the server } [20:34:07] but [20:34:11] if you want fewer changes [20:34:13] which I think you do [20:34:15] yes [20:34:16] exactly [20:34:23] class config(udplogging = true) [20:34:31] and then in lucene::server [20:34:32] instead of [20:34:37] include lucene::server [20:34:37] do [20:34:52] class { 'lucene::server': udplogging => $udplogging } [20:34:59] sory [20:35:04] s/server/config/ [20:35:19] yeah, that makes sense [20:35:20] class { 'lucene::config': udplogging => $udplogging } (instead of include lucene::config } [20:35:22] yeah [20:35:43] actually [20:36:03] I think it is always true, BTW [20:36:10] as far as I can tell there is nothing that uses the lucene::config class [20:36:22] you could probably scrap it and just put the stuff from it directly in lucene::server [20:36:28] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:56] best way [20:37:18] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.48 ms [20:37:30] oo, i will merge your erbium change now too [20:37:42] thanks [20:37:55] oh, i need to add @ to the variable too, right? [20:38:05] in the erb file [20:38:13] (03PS3) 10Matanya: emery: RT #6143 move two logs to erbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/110382 [20:38:20] (03CR) 10Ottomata: [C: 032 V: 032] emery: RT #6143 move two logs to erbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/110382 (owner: 10Matanya) [20:38:29] yeah should [20:38:46] Tim-away, 38 PHP Fatal error: LuaSandboxFunction::call() [luasandboxfunction.call]: PANIC: unprotected [20:38:47] error in call to Lua API (not enough memory) in /usr/local/apache/common-local/php-1.23wmf11/extensions/Scribunto/engines/LuaSandbox/En [20:38:49] gine.php on line 158 [20:39:00] oh , hm matanya [20:39:03] https://gerrit.wikimedia.org/r/#/c/110382/3/templates/udp2log/filters.erbium.erb [20:39:08] did you mean to turn glam-nara back on on erbium? [20:39:41] yes ottomata [20:39:50] ok [20:39:58] multichill asked for it to be transfered [20:43:45] (03PS1) 10Matanya: lucene: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 [20:47:38] (03CR) 10Ottomata: [C: 032 V: 032] Parameterizing queue_buffering_max_ms and batch_num_messages [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/111455 (owner: 10Ottomata) [20:49:29] mutante / Jeff_Green regarding grosley, you need to remove it from the backup misc script [20:50:04] nope, and that script can be deleted too [20:50:07] matanya: 6166 [20:50:21] drdee: 4196 [20:50:58] Jeff_Green: i'll push a remove patch [20:51:21] matanya: thanks! [20:52:13] Jeff_Green: any other script needs to be removed, if we are already here? :) [20:54:24] anything that isn't installed on aluminium can be removed [20:54:52] Jeff_Green: that class is on aluminium [20:54:56] i just removed it [20:55:28] offhost_backups? [20:55:36] (03PS1) 10Ottomata: [DO NOT MERGE] Setting batch_num_messages to 6000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/111523 [20:55:37] yes [20:55:55] that's fine. we're in the process of migrating off of aluminium [20:55:59] (03PS1) 10Matanya: offhost_backups: remove, per jeff green on IRC and [operations/puppet] - 10https://gerrit.wikimedia.org/r/111528 [20:56:07] Jeff_Green: ^^ [20:56:12] just don't make puppet actually remove the file or the cron job [20:57:25] it won't be removed, just not managed by puppet [20:57:37] to remove i would set it to absent [20:57:42] yep. i'll merge in a sec [20:58:06] (03CR) 10Ottomata: [C: 04-1] lucene: puppet 3 compatibility fix: fully qualify variable (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 (owner: 10Matanya) [20:58:08] (03CR) 10Jgreen: [C: 032 V: 031] offhost_backups: remove, per jeff green on IRC and [operations/puppet] - 10https://gerrit.wikimedia.org/r/111528 (owner: 10Matanya) [20:58:29] done [20:59:19] thanks [20:59:25] thank you [21:00:01] (03CR) 10Ori.livneh: [C: 04-1] "But not all logs contain private data.. and it has definitely been the case that running zgrep -c over a span of several months has helped" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111127 (owner: 10Chad) [21:01:40] (03PS2) 10Matanya: lucene: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 [21:02:34] (03CR) 10Ori.livneh: "Andrew, see " [operations/puppet] - 10https://gerrit.wikimedia.org/r/110943 (owner: 10Andrew Bogott) [21:03:24] (03PS3) 10BryanDavis: logstash: Improve filters [operations/puppet] - 10https://gerrit.wikimedia.org/r/110483 [21:03:33] (03CR) 10Ori.livneh: [C: 032 V: 032] logstash: Improve filters [operations/puppet] - 10https://gerrit.wikimedia.org/r/110483 (owner: 10BryanDavis) [21:04:21] ori: Thanks. I'll go force the puppet run on logstash1001 to make sure my patch works [21:05:05] * bd808 sees that puppet is already running there [21:05:27] bd808: yeah, already running on logstash* [21:05:37] bd808: done on all 3 [21:05:42] Do you do that with salt? [21:06:37] yes [21:07:01] neat [21:07:20] puppet-run() { ssh palladium -t -- salt "${*}" cmd.run "'puppetd -tv'" } [21:08:29] ori: logstash looks good. I'm seeing the new "normalized_message_untrimmed" tag show up on events [21:08:54] nice! [21:15:36] (03CR) 10Nemo bis: "The request for granularity made me think of this comment by Chris "It may be nice to somehow flag private data". (03CR) 10Odder: [C: 031] "If the privacy policy says 90 days, then it's 90 days. End of story." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111127 (owner: 10Chad) [21:20:16] (03PS23) 10Matanya: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 [21:22:35] ugh [21:27:44] (03PS2) 10Hashar: contint: fix slave-scripts deployment on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/111447 [21:27:46] (03PS2) 10Hashar: contint: slave-scripts are deployed via git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/111446 [21:32:14] (03PS1) 10Hashar: contint: on slave labs, install tox from pip [operations/puppet] - 10https://gerrit.wikimedia.org/r/111536 [21:34:33] (03CR) 10Hashar: "Tox would let us run python tests in labs instances. Examples usages are pywikibot and the various analytics utilities." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111536 (owner: 10Hashar) [22:11:17] !log stopping puppet on analytics1021. Trying to get it to catch up on replica lag [22:11:25] Logged the message, Master [22:22:22] (03PS1) 10Ottomata: Adding Replica-MaxLag to Ganglia kafka view [operations/puppet] - 10https://gerrit.wikimedia.org/r/111617 [22:23:15] (03CR) 10Ottomata: [C: 032 V: 032] Adding Replica-MaxLag to Ganglia kafka view [operations/puppet] - 10https://gerrit.wikimedia.org/r/111617 (owner: 10Ottomata) [22:29:35] (03CR) 10GWicke: "See bug 60694." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111350 (owner: 10GWicke) [22:33:43] (03PS1) 10Matanya: mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 [22:43:22] (03PS1) 10Dzahn: remove grosley (decom) [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 [22:44:21] (03PS2) 10Dzahn: remove grosley (decom) [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 [22:46:16] (03CR) 10Dzahn: "this is all pretty obvious, besides grosley was IN MX 20 for donate-lb and is gone now" [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 (owner: 10Dzahn) [22:51:40] (03CR) 10LuisVilla: "To be clear, neither the current nor draft privacy policies say 90 days." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111127 (owner: 10Chad) [22:54:40] (03CR) 10Dzahn: "looks all good except spaces/tabs, sigh i know there is probably a "retab mail.pp" change somewhere in the queue as well" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 (owner: 10Matanya) [22:55:41] don't sound so poor :) [22:57:58] path conflict :p [23:00:30] (03PS2) 10Matanya: mail: change mailman check to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 [23:00:57] this is becuase i do to many patches which conflict [23:01:09] half of my patches need rebase [23:01:41] (03PS3) 10Dzahn: mail: change mailman check to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 (owner: 10Matanya) [23:02:34] narrfff [23:02:52] yeah, you ruined my rebase :) [23:03:22] or i did ruin yours? :D [23:03:35] * matanya is puzzeled [23:03:52] some kind of race :p [23:04:09] i had to do manual rebase because of the path conflict [23:04:20] i solved that too i think [23:04:24] git rebase --continue etc [23:04:29] yeah [23:04:37] and i did wrong in one file [23:04:55] (03PS2) 10Ori.livneh: find-nearest-rsync: don't pick unreachable hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/111442 [23:04:55] so git reset to my version [23:05:03] (03CR) 10Tim Starling: [C: 032] find-nearest-rsync: don't pick unreachable hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/111442 (owner: 10Ori.livneh) [23:05:12] thanks [23:05:22] ori: you are serious about getting rid of dsh and scap [23:05:26] today [23:06:05] i'm serious about getting rid of some boobytrapped, crusty bash code in favor of python code that is easier to read and reason about [23:06:27] (03PS4) 10Dzahn: mail: change mailman check to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 (owner: 10Matanya) [23:06:27] sound great! thanks :) [23:06:45] but i'm just clearing cobwebs, bd808 is going to actually modify it to match evolving requirements [23:07:45] and/or find a path to a completely different system [23:08:04] * ori nods [23:08:27] yeah, I created some flame about it earlier today, by accident [23:08:34] (03CR) 10Tim Starling: [V: 032] find-nearest-rsync: don't pick unreachable hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/111442 (owner: 10Ori.livneh) [23:09:48] mutante: while kicking jenkins bottom, mind explain what akosiaris meant in this comment? https://gerrit.wikimedia.org/r/#/c/107567/12/manifests/role/etherpad.pp [23:11:07] matanya: You just got a discussion started. I should actually thank you for that. [23:11:48] yeah well :) [23:16:54] matanya: you'll have to ask him, he says $etherpad_host is used somewhere before it's defined, but ... [23:18:02] doesn't look unclear to a novice like me, just move the if $::realm == 'labs' { etc. before etherpad_host => $etherpad_host, ? :) [23:18:13] (or vice versa) [23:19:44] no, i think he meant remove the call in the class and change in the condition to etherpad_host => $etherpad_host [23:19:49] but not sure [23:20:17] oh, yea, of course [23:20:19] what Nemo_bis said [23:20:24] and it's just style [23:20:34] because what he said "not procedural" [23:21:07] but in the role class, just do the $realm thing first [23:25:03] so mutante the if should be before calling the class? thank Nemo_bis [23:25:12] yea [23:25:46] oh, now i understand [23:25:51] or case [23:26:08] yeah, when i changed it, it jumped to my face [23:26:17] case $::realm { whatever you think is nicer [23:26:30] yea, it doesn't actually matter, but for humans it's easier that way [23:26:39] i don't like case [23:27:25] i like it having a default, 'default': { fail('unknown realm, should be labs or production') [23:27:28] shrug [23:27:48] well, i can do it to make you happy [23:28:00] don't, make Alex happy on that one [23:28:21] it's not important [23:28:38] already changed [23:32:13] (03PS13) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [23:33:37] I'm going to start a no-op scap to test changes to the scap script; will be done before VE's LD. [23:33:58] * bd808 gets popcorn [23:34:40] !log ori started scap: (no message) [23:34:48] Logged the message, Master [23:37:14] (03CR) 10Tim Starling: Replace easter egg by a more explaining message (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110926 (owner: 10Hashar) [23:42:04] (03CR) 10Matanya: "changed if to case. to make dzahn happy :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [23:42:56] (03PS5) 10Dzahn: mail: change mailman check to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 (owner: 10Matanya) [23:44:35] (03CR) 10Dzahn: [C: 032] mail: change mailman check to use nrpe::monitor_service [operations/puppet] - 10https://gerrit.wikimedia.org/r/110915 (owner: 10Matanya) [23:45:04] ori: WARNING: Revision range includes commits from multiple committers! [23:45:07] yay:) [23:45:10] thanks for that [23:45:22] glad you like it :) [23:45:25] but, actually [23:45:34] well, this is a different case [23:45:41] it is indeed multiple people [23:45:46] but on a single gerrit change [23:45:50] wiki style, you know [23:46:02] as opposed to somebody forgetting to puppet-merge [23:46:24] yeah, I still think it's useful to have a concise summary of the revision range that fits on the screen [23:47:07] yea, PgUp anyways, and the warning is good [23:48:01] maybe it's "multiple owners" vs. "multiple committers [23:48:15] in gerrit lingo [23:49:02] i'd be happy to change it if you have a firm opinion [23:49:10] just let me know what seems most accurate [23:49:45] if it has multiple owners, (more than 1 patch set), it should be like "WARNING, somebody forgot to puppet-merge before you, are you sure" [23:50:04] it's not always that [23:50:06] if it has multiple committers, but just 1 gerrit id, it should just be like " [23:50:13] sometimes people just merge other people's patches [23:50:16] "multiple people worked on this one" [23:50:58] PROBLEM - mailman on sodium is CRITICAL: NRPE: Command check_mailman not defined [23:51:03] do we not have an rsync host in tampa? [23:51:05] sigh, knew it [23:51:07] timing issue [23:51:14] re: mailman check [23:51:50] i broke it? [23:51:57] yes and no [23:52:02] it's a timing thing [23:52:10] it removed the old checkcommand [23:52:14] before it has the new one [23:52:33] (03CR) 10Krinkle: [C: 032] Add visualeditor-default.dblist to the noc list of files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111139 (owner: 10Jforrester) [23:52:39] strictly i should have asked you to do that separate [23:52:45] adding new check, and removing old check [23:52:48] (03Merged) 10jenkins-bot: Add visualeditor-default.dblist to the noc list of files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111139 (owner: 10Jforrester) [23:53:07] annoying [23:54:50] mutante: why didn't it happen when alex merged https://gerrit.wikimedia.org/r/#/c/110931/ ? [23:56:50] matanya: lucky or smarter about the order of running puppet on monitoring host and monitored host [23:57:17] neon takes longer to add the new command than the monitored host takes to remove the local NRPE command [23:58:07] i could have stopped puppet agent first there, then waited for neon, then re-enabled etc [23:58:55] whatever [23:59:04] emery is ready for shutdown [23:59:23] last log move was merged by otto earlier. [23:59:31] cool [23:59:32] see 6143 [23:59:41] will do [23:59:59] watching neon first ,it added the new command now