[00:00:03] 3operations: analyze Bugzilla access logs - https://phabricator.wikimedia.org/T86859#977836 (10Dzahn) 3NEW [00:00:04] RoanKattouw, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150115T0000). [00:01:55] marktraceur: you don't wanna be a part of the afternoon swat, right? [00:05:46] looks like there's nothing on the list anyway [00:06:30] greg-g: I have one thing to deploy for mobile, but I can do it myself [00:23:12] kaldari, thanks for deploying the Thanks fix. [00:26:56] superm401: NP [00:29:34] greg-g: Affirmative [00:31:08] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 24 data above and 8 below the confidence bounds [00:31:40] marktraceur: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=140961&oldid=140960 [00:32:07] ehh Warning: Failed connecting to redis server at fluorine.eqiad.wmnet: Connection timed out [00:32:17] greg-g: We should really have a template that has that list instead of copy/pasting everything [00:33:04] marktraceur: hmm, not bad, but then the archive isn't accurate as it'll retroactively update (on an accidental edit or some such) [00:34:38] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [00:34:45] greg-g: So use {{subst}} [00:39:42] greg-g: Just tried to do a git fetch on tin and got the following warning: The authenticity of host '[gerrit.wikimedia.org]:29418 ([2620:0:861:3:208:80:154:81]:29418)' can't be established. RSA key fingerprint is dc:e9:68:7b:99:1b:27:d0:f9:fd:ce:6a:2e:bf:92:e1. [00:39:55] greg-g: should I continue? I’ve never seen that before. [00:40:12] when fetching on tin [00:42:20] Yeah, I think so [00:42:25] Some people seem to have that issue, some don't [00:43:34] I had to purge it from my ~/.ssh/known_hosts on tin [00:43:43] AFAIR [00:43:55] ugh, http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 is a mess of servers [00:47:55] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=fluorine.eqiad.wmnet&r=day&z=default&jr=&js=&st=1421282814&v=169860&m=mem_free&vl=KB&ti=Free%20Memory&z=large [00:48:17] ori: a bit worrisome...maybe that explains the xenon->redis warnings in hhvm.log? [00:48:52] 3Scrum-of-Scrums, operations: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#978214 (10bd808) I got poked today as a #scrum-of-scrums pingback to see why this is still open and marked as blocked by #ops. Here's my $0.02: Wikitech is using hetdeploy today as a result of wor... [00:51:13] there are like 2 clients usually connected to the redis server at any time...not much packet loss either [00:52:39] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:58:34] (03PS1) 10MaxSem: Fix variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185106 [00:58:57] I wonder if the subscriber died [00:58:59] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:59:28] (03PS2) 10MaxSem: Fix variable name to prevent warnings in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185106 [01:00:27] Is wikitechwiki down? [01:00:54] weee (Cannot access the database: Too many connections (208.80.154.18)) [01:04:01] (03CR) 10MaxSem: [C: 032] Fix variable name to prevent warnings in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185106 (owner: 10MaxSem) [01:06:14] (03Merged) 10jenkins-bot: Fix variable name to prevent warnings in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185106 (owner: 10MaxSem) [01:06:42] springle: any idea on wikitech? [01:07:22] kaldari: sorry, was in a hiring meeting :/ [01:07:39] marktraceur: ok, I'm mostly unfamiliar with subst use [01:08:35] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/185106 (duration: 00m 06s) [01:09:24] !log kaldari Synchronized php-1.25wmf15/extensions/Thanks/: syncing Thanks for wmf15 (duration: 00m 05s) [01:09:51] !log kaldari Synchronized php-1.25wmf14/extensions/Thanks/: syncing Thanks for wmf14 (duration: 00m 07s) [01:15:38] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 4 failures [01:18:32] (Cannot access the database: Too many connections (208.80.154.18)) on wikitech again [01:21:58] Coren, ^^^ [01:22:12] Bleh. Again? [01:22:30] Thank thing has been bouncing up and down for days. [01:24:23] Krenair: That should have fixed it. [01:24:49] Coren, just give wikidev the ability to push to that server directly? if we can already break wikipedia, wikitech is nothing in comparison:P [01:25:35] Not something I can do unilateraly, but I agree with the principle - and it's already been discussed. [01:25:43] +1 [01:26:45] what else runs on that host other than wikitech? [01:27:06] LDAP and Lab's puppetmaster [01:27:31] Well, also lots of nova stuff though I expect you included that implicitly in "wikitech" [01:27:53] There's also a salt master iirc [01:33:48] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:38:39] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [01:43:16] 3operations: Unbreak Xenon - https://phabricator.wikimedia.org/T86872#978413 (10ori) 3NEW a:3faidon [01:43:27] (03CR) 10Ori.livneh: "The app servers now fail to reach fluorine's redis instance, which is where xenon profiling data is aggregated. This is a bit annoying -- " [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [01:49:30] (03CR) 10Jforrester: [C: 031] New Search is default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 (owner: 10Amire80) [01:53:15] !log raise mysql max_connections to 250 on virt1000. lots of nova persistent connections, little activity [01:53:20] Logged the message, Master [01:54:44] (03CR) 10Manybubbles: [C: 031] New Search is default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 (owner: 10Amire80) [01:54:59] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [01:55:10] (03CR) 10Manybubbles: "It was new. Now its just default." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 (owner: 10Amire80) [01:58:08] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:00:28] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 56 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 54, utimed_out: False, uactive_primary_shards: 57, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 115, uinitializing_shards: 2, unumber_of_data_nodes: 3} [02:00:48] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 56 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 54, utimed_out: False, uactive_primary_shards: 57, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 115, uinitializing_shards: 2, unumber_of_data_nodes: 3} [02:20:35] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:20:40] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-15 02:20:40+00:00 [02:20:41] Logged the message, Master [02:20:47] Logged the message, Master [02:21:55] (03PS1) 10Springle: depool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185115 [02:22:52] (03CR) 10Springle: [C: 032] depool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185115 (owner: 10Springle) [02:22:56] (03Merged) 10jenkins-bot: depool db1051 db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185115 (owner: 10Springle) [02:23:53] !log springle Synchronized wmf-config/db-eqiad.php: depool db1051 db1056 (duration: 00m 05s) [02:23:58] Logged the message, Master [02:26:18] !log tin mw1084 sync-file failed socket error, manual sync-common [02:26:22] Logged the message, Master [02:26:25] (03PS1) 10Jforrester: Beta Features: Disable the Compact Personal Bar feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185116 [02:27:08] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:09] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:12] (03CR) 10Jforrester: [C: 04-1] "Not just yet; this needs to be announced first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185116 (owner: 10Jforrester) [02:27:43] (03PS2) 10Jforrester: Beta Features: Remove the (now default) New Search feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 (owner: 10Amire80) [02:28:37] (03CR) 10Jforrester: "Disabling it everywhere is I6deedfd; probably going to happen next week or that thereafter. In this case, is it worth doing this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [02:29:48] PROBLEM - HHVM busy threads on mw1119 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4] [02:30:09] PROBLEM - HHVM queue size on mw1119 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [02:32:36] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:32:40] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-15 02:32:40+00:00 [02:32:43] Logged the message, Master [02:32:47] Logged the message, Master [02:34:58] PROBLEM - HHVM queue size on mw1119 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [80.0] [02:35:48] PROBLEM - HHVM busy threads on mw1119 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [86.4] [02:50:51] (03PS1) 10Tim Starling: Enable TitleBlacklist cache update logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185118 [02:55:59] (03CR) 10Tim Starling: [C: 032] Enable TitleBlacklist cache update logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185118 (owner: 10Tim Starling) [02:56:04] (03Merged) 10jenkins-bot: Enable TitleBlacklist cache update logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185118 (owner: 10Tim Starling) [03:03:34] !log tstarling Synchronized wmf-config/InitialiseSettings.php: TitleBlacklist log (duration: 00m 06s) [03:03:40] Logged the message, Master [03:23:46] (03PS1) 10Springle: upgrade db1051 db1056 trusty mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/185120 [03:24:39] (03CR) 10Springle: [C: 032] upgrade db1051 db1056 trusty mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/185120 (owner: 10Springle) [03:49:55] 3operations: Puppet Trebuchet provider compares refname with commit sha1 and does NOT refresh the git repo! - https://phabricator.wikimedia.org/T77002#978597 (10ori) Can't debug this at the moment because Puppet is broken on the relevant host: ``` Error: Could not retrieve catalog from remote server: Error 400... [04:15:53] (03PS1) 10Mattflaschen: Enable 'copyedit' GettingStarted suggestions for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185122 (https://phabricator.wikimedia.org/T86590) [04:17:35] (03CR) 10Mattflaschen: [C: 04-1] "Should get a once-over from Robmoen or Phuedx before merge. Also, when this is deployed, it will need a script to be run (populate_catego" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185122 (https://phabricator.wikimedia.org/T86590) (owner: 10Mattflaschen) [04:35:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jan 15 04:35:40 UTC 2015 (duration 35m 39s) [04:35:48] Logged the message, Master [04:38:16] 3ops-core: Support SPDY - https://phabricator.wikimedia.org/T35890#978624 (10konklone) This seems like a smart thing to prioritize for the [[ https://phabricator.wikimedia.org/tag/https-by-default/ | HTTPS-by-default ]] tag, since it has such drastic front-end speed improvements for multiplexing resources. I've... [04:40:24] 3ops-core: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#978635 (10konklone) Have you looked at SSLMate for a CA (reseller)? https://sslmate.com/ I've been using them at home and at work (https://myra.treasury.gov uses a cert by them), and their CLI based approach to certificate... [04:56:18] !log upgrade db1051 db1056 trusty [04:56:22] Logged the message, Master [06:06:10] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [06:21:07] !log ldap mass modification: Changing everyone with shell set to sillyshell to /bin/bash [06:21:13] Logged the message, Master [06:25:31] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:25:43] 3ops-core: Support SPDY - https://phabricator.wikimedia.org/T35890#978736 (10faidon) We've made a conscious decision to prioritize our HTTPS scalability work and turn on SPDY (or rather, HTTP/2.0) very shortly after. You could argue it's part of the same series of steps or a separate step, but in the end, it doe... [06:28:50] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:15] YuviPanda: what is sillyshell? [06:29:41] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:10] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] hmm [06:33:26] (03CR) 10Lydia Pintscher: "No that's fine from my side then. Not necessary to disable it individually if it is going to be turned off globally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [06:33:47] kart_: oh, see https://phabricator.wikimedia.org/T86668 [06:37:58] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#978747 (10yuvipanda) antimony, our svn server, doesn't actually seem to have LDAP configured, or ssh accessible from outside the cluster. [06:39:24] (03PS1) 10Yuvipanda: labs: Don't override shell for all users [puppet] - 10https://gerrit.wikimedia.org/r/185127 (https://phabricator.wikimedia.org/T86668) [06:39:28] (03PS1) 10Faidon Liambotis: xenon: add ferm rule for redis too [puppet] - 10https://gerrit.wikimedia.org/r/185128 (https://phabricator.wikimedia.org/T86872) [06:39:30] paravoid: ^ CR / +1? [06:40:19] (03CR) 10Faidon Liambotis: [C: 032] xenon: add ferm rule for redis too [puppet] - 10https://gerrit.wikimedia.org/r/185128 (https://phabricator.wikimedia.org/T86872) (owner: 10Faidon Liambotis) [06:40:23] gimme a sec [06:42:05] 3operations: Unbreak Xenon - https://phabricator.wikimedia.org/T86872#978754 (10faidon) 5Open>3Resolved FWIW, I wasn't actually the one to either author or push this through, I was merely the one to split it up so that it can be easily reverted in case it had unintended consequences (like it did). In any cas... [06:44:34] man the ldap manifests are crap [06:45:24] lots of labsy infrastructure things could use re-dos [06:46:02] you don't say :) [06:46:19] :) [06:46:22] (03PS2) 10Faidon Liambotis: labs: Don't override shell for all users [puppet] - 10https://gerrit.wikimedia.org/r/185127 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [06:46:30] (03CR) 10Faidon Liambotis: [C: 032] labs: Don't override shell for all users [puppet] - 10https://gerrit.wikimedia.org/r/185127 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [06:46:32] * YuviPanda has been doing some slowwwly [06:46:40] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] merged [06:48:23] paravoid: thanks! [06:52:04] (03CR) 10Legoktm: [C: 031] Beta Features: Disable the Compact Personal Bar feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185116 (owner: 10Jforrester) [06:54:38] 3Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#978769 (10Qgil) [06:56:13] 3ops-core: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#978774 (10faidon) We've evaluated a bunch of certificate vendors & solutions (most of them, I'd say) and personally, I've been aware of SSLMate since its inception. Unfortunately we have a number of unique requirements that... [07:09:41] !log changed ldap mwdeploy user shell to /bin/bash to match puppet [07:09:45] Logged the message, Master [07:12:15] <_joe_> good morning [07:12:42] <_joe_> YuviPanda: sillyshell is gone? [07:12:53] _joe_: yup, yup [07:12:57] <_joe_> \o/ [07:14:23] thoughts on what could be causing https://phabricator.wikimedia.org/T86883 [07:14:28] I suppose that needs a ferm rule [07:14:29] somewhere [07:14:36] what ports does trebuchet need/ [07:14:38] ? [07:15:01] <_joe_> YuviPanda: mmmh no idea [07:15:21] hmm, ok [07:17:09] hmm, and why is beta’s mwdeploy explicitly set to have home as /var/lib/mwdeploy? [07:17:15] when prod has it at /home/mwdeploy [07:17:31] it’s pretty empty [07:17:45] * YuviPanda checks on an mw prod host [07:17:57] yup, empty [07:19:34] !log set home of mwdeploy to /home/mwdeploy in LDAP [07:19:40] Logged the message, Master [07:24:17] ori: do you know which ports I need to open up for trebuchet to work? [07:28:50] hmm, wqat [07:29:06] setting homedir on ldap doesn’t seem to reflect on labs hosts... [07:29:37] at least for mwdeploy [07:31:21] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: puppet fail [07:32:13] <_joe_> what do you mean? [07:32:20] for trebuchet? what do you mean? trebuchet uses http (not https) for git fetch / ls-remote / etc. and salt for coordination [07:33:08] so 4505-4506 for salt and 80 on the deployment master [07:33:16] ori: https://phabricator.wikimedia.org/T86883. [07:33:32] ori: hmm, so I guess 80 on deployment-bastion should be opened up. [07:33:34] * YuviPanda checks [07:33:43] _joe_: oh, I mean: [07:33:47] LDAP: [07:33:48] homeDirectory: /home/mwdeploy [07:34:01] getent: [07:34:01] mwdeploy:x:603:603:mwdeploy:/var/lib/mwdeploy:/bin/bash [07:34:33] <_joe_> YuviPanda: ok, maybe nscd cache? [07:34:49] <_joe_> I don't know a thing of our ldap/pam setup in labs [07:34:57] * _joe_ hates ldap [07:35:25] _joe_: oh, wait. that worked. I thought I needed to restart the nslcd, which is the LDAP deamon, rather than nscd? [07:35:41] <_joe_> oh, trusty, yeah [07:36:01] hmm, restarting nslcd didn’t have any effect, though. but restarting nscd did. [07:39:50] <_joe_> user cache used to be in nscd up to precise - but I never checked if something changed in trusty [07:40:07] <_joe_> because I escaped the ldap hell [07:42:03] heh [07:42:10] icinga's a christmas tree again [07:42:15] _joe_: looks like it’s still in nscd [07:42:23] rather than nslcd [07:42:26] <_joe_> paravoid: srsly? it was so nice yesterday [07:42:51] osmium has puppet disabled, that's probably ori? [07:43:02] three mwNNNN-related alerts [07:43:04] yeah, only for the past couple of hours though [07:43:05] <_joe_> I have no idea [07:43:09] re: osmium [07:43:18] <_joe_> paravoid: sigh, it's down since 5 hours [07:43:22] ori: icinga says 3 :P [07:43:29] <_joe_> why did no one take a look? [07:43:31] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:43:34] what has? [07:43:34] who are you going to trust, me or a perl script? [07:43:42] <_joe_> paravoid: mw1119 [07:43:44] mw1119 [07:43:46] right [07:44:14] <_joe_> ori: lol [07:44:20] well 5h is not much, it's not exactly business hours anywhere but in sean's place and he's busy enough :) [07:44:33] and I mean, it's not a critical fault, it's just one out of a hundred servers [07:44:48] <_joe_> paravoid: last time it was 12 hours [07:45:01] <_joe_> paravoid: it's basically "time since I last checked icinga" [07:45:15] !log re-enabled puppet on osmium. i disabled it three hours ago to debug zhwiki key errors in memcached-serious.log. [07:45:21] Logged the message, Master [07:45:40] iridium (phab) has puppet disabled as well [07:45:49] 13 hours ago, no SAL entry [07:45:50] (the zhwiki key errors have been around for a while; they're pretty isolated) [07:45:53] <_joe_> ori: oh thanks [07:46:06] <_joe_> I was thinking about those yesterday [07:46:09] a while as in probably a year or more [07:46:11] <_joe_> what is causing them? [07:46:39] i asked tim if he knew and he didn't, but he suspected they may be exceeding memcache's size limit for values [07:46:46] I hate it when people disable puppet with no disable message AND no SAL entry [07:46:46] causing some deserialization failure [07:47:07] <_joe_> !log restarted HHVM on mw1119, all threads stuck in a lock for HPHP::RequestInjectionData::onSessionInit [07:47:11] Logged the message, Master [07:47:17] <_joe_> this is a first [07:47:28] interesting [07:47:31] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [07:47:31] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 115372 bytes in 0.383 second response time [07:48:11] <_joe_> ori: I suspect most of these failures lie deeper in the hhvm code for multithread locking [07:48:39] <_joe_> there is some special race condition where a mutex is left stale, and eventually all threads get to it and are stuck [07:48:42] these failures as in the RequestInjectionData lockup you just logged? if so, yes, probably; but if you mean the zhwiki thing it predates hhvm [07:48:56] <_joe_> what I just logged [07:49:01] yeah [07:50:13] ori: issues resolved when I opened up 80, btw. Thanks! [07:50:20] * YuviPanda goes to find proper place in puppet to put those rules in [07:50:22] paravoid: what would you say to installing 'at' globally? if it was available then i'd write a bash script that (a) wouldn't let you get away with not providing a reason; (b) logged it; and (c) gave you the option of disabling puppet for a pre-set amount of time [07:50:38] (a) and (b) don't depend on at but i'm not motivated to half-fix it [07:50:45] <_joe_> mmmh [07:51:13] nah, yet another daemon running everywhere for a silly humans-misbehaving reason [07:51:17] <_joe_> I think the last part is a bit risky [07:51:27] and yeah, that last part is risky [07:51:30] we do have an icinga alert [07:51:45] I'd rather take it up with my team so that people do check the icinga page [07:51:54] we could emulate it by having some cron.hourly script check some file [07:51:59] it = at [07:52:00] I can tell you from experience that very few people actively look at alerts now [07:52:03] that's a big problem [07:52:30] I've been the manual icinga ping quite a few times already [07:53:01] there's a lot of noise [07:53:08] no there isn't [07:53:13] unhandled alerts are 25 now [07:53:14] the varnishkafka delivery errors and the puppetmaster o'clock puppet failures [07:53:34] and most of them are "handled" [07:54:01] the varnishkafka delivery errors otto is working on, I've pinged him numerous times [07:54:15] and it's an effect of the same problem isn't it [07:54:22] people not caring about alerts [07:54:43] it's OK for a service that is non-critical and in-development to flap or struggle, but alert noise just trains people not to care [07:54:50] RECOVERY - HHVM busy threads on mw1119 is OK: OK: Less than 30.00% above the threshold [57.6] [07:54:51] RECOVERY - HHVM queue size on mw1119 is OK: OK: Less than 30.00% above the threshold [10.0] [07:54:55] it has an area effect [07:55:05] no, that's not okay either [07:55:05] <_joe_> ori: or, having someone else caring and pinging you [07:55:09] <_joe_> or solving things [07:55:26] we have the ability to acknowledge alerts [07:55:33] if it's in-development, ack the damn thing [07:55:40] <_joe_> but well, I forget to ack alerts sometimes [07:55:49] some tooling for that wouldn't kill anyone either [07:55:52] or leave a comment [07:55:53] it shouldn't be harder than an !ack [07:55:58] <_joe_> at least the ones I didn't cause [07:56:01] it should be doable from irc [07:56:11] yes, I've said so before :) [07:56:21] (03PS1) 10Yuvipanda: deployment: Open up port 90 on trebuchet deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/185129 (https://phabricator.wikimedia.org/T86883) [07:56:24] but I doubt this is what's stopping people [07:56:31] I think it's a mentality problem [07:56:35] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1062 is CRITICAL: Host mw1062 is not in mediawiki-installation dsh group Giuseppe Lavagetto Someone removed this server from dsh (faulty disk) [07:56:35] notmyproblem [07:56:43] <_joe_> paravoid: exactly [07:56:50] <_joe_> it's not just for icinga btw [07:57:16] <_joe_> but I think we'll have some time to talk about this in SF [07:57:32] ok, anyone have any ideas how to kick logstash's es? [07:57:45] <_joe_> paravoid: nope :/ [07:57:53] logstash1002's ES died I think [07:57:55] <_joe_> paravoid: apart from restarting it you mean? [07:58:06] also problem for betalabs - the firewall change stopped trebuchet from working yesterday. [07:58:06] <_joe_> oh ok [07:58:18] <_joe_> so just restarting it shold do? [07:58:22] (03PS2) 10Yuvipanda: deployment: Open up port 80 on trebuchet deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/185129 (https://phabricator.wikimedia.org/T86883) [07:58:24] <_joe_> I'll take a look [07:58:27] well java is running [07:58:37] also, I’m not sure how trebuchet on tin works. I don’t see a ferm rule to open up port 80 [07:58:56] tin has no firewall [07:59:01] oh, right [07:59:04] just saw taht [07:59:05] *that [08:00:08] oh we have an $srange now [08:00:10] I didn't know [08:00:34] <_joe_> [2015-01-15 08:00:19,108][DEBUG][action.admin.cluster.health] [logstash1002] no known master node, scheduling a retry [08:00:41] <_joe_> this is _not_ good [08:00:42] heh, I didn’t either. :D mutante opened up the rsync ports yesterday [08:00:54] <_joe_> (logstash1002) [08:01:13] (03CR) 10Yuvipanda: [C: 032] "Shall unify the production / labs roles later" [puppet] - 10https://gerrit.wikimedia.org/r/185129 (https://phabricator.wikimedia.org/T86883) (owner: 10Yuvipanda) [08:01:22] your change is wrong in a very subtle manner [08:01:41] guh [08:01:54] $srange doesn't take an array [08:01:58] it takes a ferm-formatted string [08:02:13] for labs, $deployable_networks is just a string containing "10.0.0.0/8" [08:02:24] for prod, it refers to $::network::constants [08:02:57] which is an array [08:03:35] thanks for the ferm rule for xenon [08:03:37] a ferm array is "(element1 element2 ...)" [08:03:46] i just set out to do that, missed your change earlier [08:03:52] <_joe_> !log restarted ES on logstash1002, not joining the cluster [08:03:55] Logged the message, Master [08:04:20] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 3, timed_out: False, active_primary_shards: 57, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 166, initializing_shards: 2, number_of_data_nodes: 3 [08:04:20] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 3, timed_out: False, active_primary_shards: 57, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 166, initializing_shards: 2, number_of_data_nodes: 3 [08:04:24] ori: the irony is that if you hadn't added the ferm rule for :80 when you created that manifest [08:04:35] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#978863 (10yuvipanda) 3NEW [08:04:48] I would have seen it, because I did check for xenon [08:04:59] paravoid: right, but this is only for labs, so would be ok now. needs more careful consideration when unifying / appling it to tin [08:05:00] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 3, timed_out: False, active_primary_shards: 57, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 166, initializing_shards: 2, number_of_data_nodes: 3 [08:05:01] I saw ferm in the manifest and thought "oh ok, someone already thought of ferm/xenon" [08:05:02] yeah [08:05:28] btw, that's a great example of an awful alert ^ [08:05:33] YuviPanda: it's a bit counter-intuitive though [08:05:52] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#978870 (10yuvipanda) [08:05:53] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#978871 (10yuvipanda) [08:06:03] yes it is, it's even worse for prod ES [08:06:07] <_joe_> ori: and you didn't see the first version of the OCG ones :) [08:06:14] which has more members in the ES cluster [08:07:11] <_joe_> which was one check for everything and with extremely sensible alerts so that you'd get "FIRE! EVERYTHING BURNING!" and a list of 100 check results without any highlight of the failing one [08:07:23] paravoid: true. I guess should add array support in [08:07:30] well it's actually even worse [08:07:33] in ferm::service for saddr [08:07:38] $deployable_networks = '10.0.0.0/8' [08:07:53] $deployable_networks = $::network::constants::deployable_networks [08:08:02] so is the code that references this supposed to expect a string or an array? [08:08:13] why isn't it "= [ 10.0.0.0/8' ]" [08:08:29] paravoid: $deployable_networks doesn’t seem to have actually been used anywhere... [08:08:37] can/should I do $deployable_networks.join(', ') in a template? [08:08:40] heh, figures [08:08:43] oh [08:08:44] it is [08:09:22] allow from <%= Array(@deployable_networks).join(' ') %> [08:09:39] manifests/role/phabricator.pp: srange => inline_template('(<%= @mail_smarthost.map{|x| "@resolve(#{x})" }.join(" ") %>)'), [08:09:46] manifests/role/requesttracker.pp: srange => inline_template('(<%= @mail_smarthost.map{|x| "@resolve(#{x})" }.join(" ") %>)'), [08:09:55] manifests/role/analytics/zookeeper.pp: srange => '($ANALYTICS_NETWORKS)', [08:09:58] manifests/role/logging.pp: srange => '$ALL_NETWORKS', [08:10:02] oh god the inconsistencies [08:10:41] Array(@deployable_networks)? [08:10:42] ughh [08:11:03] <_joe_> I'm looking at site.pp since yesterday [08:11:12] <_joe_> it seems no one uses the same paradigm there [08:11:29] <_joe_> I can tell who authored what there without git blame :) [08:12:15] heh [08:12:29] <_joe_> it's kinda depressing [08:12:32] _joe_: I wonder if eventually we can replace site.pp with a yaml file of sorts. [08:12:41] and then have the same structure be in use for labs too. [08:12:54] it'd be nice, although the role() function kind of makes this more difficult now [08:12:59] <_joe_> YuviPanda: you mean an ENC? [08:13:03] same ENC for labs and prod [08:13:03] yeah [08:13:10] <_joe_> paravoid: not really, I guess [08:13:16] well, not the same. but similar [08:13:19] <_joe_> we just need some little hack :) [08:13:36] <_joe_> also, I kinda love having nodes declared in a file [08:13:42] <_joe_> if it's well maintained [08:14:09] ‘declared’ as in the current way or in a potential-future-data-file way? [08:14:28] <_joe_> in the current way [08:14:31] <_joe_> but well, better [08:14:31] hmm [08:15:04] heh, there’s like, 2 package {} definitions in site.pp [08:16:14] * YuviPanda goes afk for food and things [08:17:16] it's amazing how habituated i've become to crazy people yelling on the street [08:17:21] this city does bad things to people [08:17:39] <_joe_> ori: you've never been to Rome, I guess [08:17:56] <_joe_> I have an average 2 arguments while walking my daughter to school [08:18:04] i have! i loved it actually. [08:18:15] <_joe_> oh yeah, the city center :) [08:18:22] ori: _joe_ you never been to Mumbai :D [08:18:26] <_joe_> (and I have to say, I used to love rome too) [08:18:45] <_joe_> kart_: nope :) But I've heard tales [08:19:04] like every other dumb american college student i walked around pretending i was marcello mastroianni in la dolce vita [08:19:16] i'm sure you have to shovel people like that out of your way :P [08:19:20] <_joe_> ori: oh you were one of those [08:19:20] * YuviPanda remembers at least 2 WMF employees standing on one side of a road in mumbai, looking terrified at being asked to cross it [08:19:20] but i enjoyed it [08:19:49] <_joe_> ori: well Anita Ekberg passed away 2 days ago [08:19:53] i saw! sad. [08:20:19] YuviPanda: i had a taste of that in cairo once, not sure how it compares [08:20:42] traffic doesn't stop ever, and there aren't even traffic lights to ignore [08:20:45] <_joe_> I have a friend living in Lagos, he says it's like 10x istanbul [08:21:20] <_joe_> which is the city with the worst traffic I drove in [08:21:56] ori: hmm, there are… some traffic lights, but not enough, so you’ve to just cross if you don’t want to be standing there for minutes at a time. [08:22:01] * YuviPanda should go to cairo at some point [08:22:15] <_joe_> you should come to Rome at some point too! [08:22:19] _joe_: I should! [08:22:25] _joe_: Visa requirements for the EU suckass [08:22:41] YuviPanda: ask me how it sucks :) [08:22:54] <_joe_> YuviPanda:really? that's bad :/ [08:22:56] _joe_: Wikimania 2016 nearby Rome? [08:23:01] _joe_: they give you your visa from $dayFlightLands until $dayFlightLeaves [08:23:07] <_joe_> kart_: uh? no idea [08:23:24] rather than ‘6 months’ or ‘1 year, with max 6 months’ or something like that [08:23:35] <_joe_> kart_: wow on the como lake? [08:23:38] _joe_: so that means for every trip I’ve to go through the visa process. [08:23:38] <_joe_> not "near" [08:24:06] _joe_: ah. [08:24:07] <_joe_> kart_: it's a /beautiful/ place, though [08:24:08] whereas for the US I don’t have to (10 year viistor visa). UK has a 6 month visa. [08:24:25] <_joe_> YuviPanda: ok [08:24:26] also it’s somewhat expensive. anywhere between 50% to 25% extra on the ticket costs, the visa alone [08:24:35] <_joe_> oh my [08:24:37] !log Ran kafka leader re-election to bring analytics1021 back into the set of leaders [08:24:38] <_joe_> srsly? [08:24:40] Logged the message, Master [08:24:42] yup, yup [08:24:54] _joe_: expensive, yes. [08:24:59] _joe_: oh, and it’s single entry. multiple entry ones are much harder to get. [08:25:09] _joe_: so I can’t do EU -> UK -> EU, for example. [08:25:12] <_joe_> YuviPanda: I know a lot of people from India working here [08:25:28] <_joe_> YuviPanda: that is because the UK didn't sign the Schengen treaty [08:25:32] _joe_: when I wanted to do that, the EU folks (Swiss office) told me I can’t get a multiple entry EU visa without a UK visa. [08:25:45] _joe_: and the UK visa folks told me I can’t get a visa at all without a EU visa [08:26:07] so only way to do EU -> UK -> EU is 1. get a single entry EU visa, 2. get a UK visa, 3. apply again for a multiple entry EU visa [08:26:12] YuviPanda: why? I had EU rejected and UK visa. [08:26:24] kart_: because my flights were from Switzerland to the UK [08:26:28] oh. multiple. [08:26:40] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2747.66627607 [08:27:04] _joe_: yup, yup. still, getting multiple entry EU visas easier would’ve been nic.er [08:27:05] *nicer [08:28:25] anyway, food for real now [08:36:25] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#978886 (10yuvipanda) [08:36:26] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#978885 (10yuvipanda) [08:36:39] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10yuvipanda) [08:36:40] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#978863 (10yuvipanda) [09:40:33] apergos: are you back? [09:40:41] yes! [09:40:44] :) [09:40:47] and I will be coming to SF [09:40:50] \o/ [09:40:52] how are you feeling? [09:41:01] not perfect today but good enough [09:41:16] :D [09:41:19] (03PS1) 10Yuvipanda: deployment: Unify salt_masters role for prod / labs [puppet] - 10https://gerrit.wikimedia.org/r/185137 [09:41:56] (03CR) 10Yuvipanda: [C: 04-2] "Not actually sure how including the labs role on virt1000 worked, since it referred to instanceproject." [puppet] - 10https://gerrit.wikimedia.org/r/185137 (owner: 10Yuvipanda) [09:50:12] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#978960 (10fgiunchedi) on the re-architecturing, I think (newer?) nginx versions can write logs to a pipe so that might be a quick win without patching nginx (?) [09:51:32] apergos: so, I'm a bit confused regarding the roles of role::mirror and modules/download [09:51:35] are they related? [09:52:12] lemme look at them [09:53:17] no not necessarily [09:54:11] what is role::mirror? [09:54:21] at this point role/mirror.pp can go, actually [09:54:29] we don't serve media via rsync to anyone [09:54:48] so just pull it from any host that claims to have it [09:54:52] I set up mirrors.wm.org so it started being all confusing :) [09:54:57] ah :-) [09:57:11] modules/dataset includes role::mirror::common [09:57:28] that's just another include (for vm) and a Package['rsync'] [09:57:36] so we can inline it [09:57:47] however it's included in a bunch of places within the dataset module [09:58:15] (03CR) 10Yuvipanda: "Works fine on betalabs, after hiera change https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=141013&oldid=1408" [puppet] - 10https://gerrit.wikimedia.org/r/185137 (owner: 10Yuvipanda) [09:58:26] <_joe_> how do I create a new label in phabricator? [09:58:41] _joe_: you have to file a new task against the project-creators project [09:58:42] _joe_: Create a new project [09:58:57] <_joe_> I am creating tickets for the codfw setup of the appserver stack [09:59:16] <_joe_> and it's quite a bunch of tickets [09:59:35] <_joe_> it would be nice to see them grouped more than "ops-core" alone [09:59:53] apergos: so where would that belong? the dataset module seems a bit complicated [10:00:02] {"deployment_config": {"parent_dir": "/srv/deployment", "redis": {"db": 0, "host": "-deploy.eqiad.wmflabs", "port": 6379}, "servers": {"eqiad": "-deploy.eqiad.wmflabs"}}} [10:00:03] on virt1000 [10:00:07] so that probably never worked [10:00:17] now to see why it was included there at all [10:01:22] 3operations, ops-core: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#978970 (10Joe) [10:02:26] 3operations, ops-core: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#978973 (10Joe) 3NEW [10:03:04] mirrors.wm? no it shouldn't go in datasets, that's separate from repos, web servers and such [10:03:24] no, forget mirrors.wm, that's an entirely different thing [10:03:49] I'm asking about all the "include role::mirror::common" under modules/dataset [10:03:52] oh, I see; I missed several of your messages [10:04:27] 3operations, ops-core: Setup jobrunners cluster in codfw - https://phabricator.wikimedia.org/T86889#978980 (10Joe) 3NEW [10:04:30] lemme see [10:06:05] (03CR) 10Yuvipanda: "Ok, so the virt1000 change was introduced in I53f1bb02b2555dd87649aa2ea26de2353ad44939 but that seems wrong, since it is including the *de" [puppet] - 10https://gerrit.wikimedia.org/r/185137 (owner: 10Yuvipanda) [10:06:19] 3operations, ops-core: Setup imagescalers cluster in codfw - https://phabricator.wikimedia.org/T86890#978986 (10Joe) 3NEW [10:06:40] (03PS2) 10Yuvipanda: deployment: Unify salt_masters role for prod / labs [puppet] - 10https://gerrit.wikimedia.org/r/185137 (https://phabricator.wikimedia.org/T86885) [10:06:59] (03CR) 10Yuvipanda: "I'll remove the virt1000 includes in another patch. This patch as such should be a noop" [puppet] - 10https://gerrit.wikimedia.org/r/185137 (https://phabricator.wikimedia.org/T86885) (owner: 10Yuvipanda) [10:07:46] I should write a small tool that lets me find out which hosts have a particular role applied [10:07:53] paravoid: why not just have modules/dataset/manifests/common.pp and call the class 'dataset::common' and include it that way? [10:07:53] or a particular puppetVar set [10:08:01] so I don’t have to do a commandline ldap query every time [10:09:20] <_joe_> YuviPanda: I have been thinking about it for quite some time [10:09:27] !log aude Synchronized php-1.25wmf14/extensions/Wikidata: fix noexternallanglinks bug (duration: 00m 13s) [10:09:33] <_joe_> the only way is to keep a catalog of compiled nodes [10:09:34] Logged the message, Master [10:09:42] _joe_: in labs, I meant. [10:09:45] <_joe_> and grep through it in some (probably expensive) way [10:09:49] <_joe_> oh, ok [10:09:49] the whole dataset module is... ugh [10:09:59] why does it need all of these subclasses? [10:10:03] _joe_: but yeah, much more complicated for prod. [10:10:16] labs is trivial. Just LDAP. [10:10:20] because we can enable /disable them independently per host [10:10:41] had to do that in the past repeatdely [10:11:28] 3operations, ops-core: Setup videoscalers cluster in codfw - https://phabricator.wikimedia.org/T86891#978995 (10Joe) 3NEW [10:13:03] 3operations, ops-core: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#979006 (10Joe) 3NEW [10:15:30] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 728 [10:17:14] apergos: ok, next question [10:17:55] 3operations, ops-core: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#979013 (10Joe) 3NEW [10:18:07] is (role::)download::mediawiki used? [10:18:10] doesn't seem so? [10:19:53] 3operations, ops-core: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) 3NEW [10:20:30] RECOVERY - check_mysql on db1008 is OK: Uptime: 7937607 Threads: 1 Questions: 209577904 Slow queries: 56083 Opens: 146629 Flush tables: 2 Open tables: 64 Queries per second avg: 26.403 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:20:40] 3operations, ops-core: Setup videoscalers cluster in codfw - https://phabricator.wikimedia.org/T86891#979034 (10Joe) [10:20:41] 3operations, ops-core: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#979033 (10Joe) [10:20:42] 3operations, ops-core: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#979032 (10Joe) [10:20:43] 3operations, ops-core: Setup imagescalers cluster in codfw - https://phabricator.wikimedia.org/T86890#979035 (10Joe) [10:20:44] 3operations, ops-core: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#979037 (10Joe) [10:20:45] 3operations, ops-core: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#979038 (10Joe) [10:20:47] 3operations, ops-core: Setup jobrunners cluster in codfw - https://phabricator.wikimedia.org/T86889#979036 (10Joe) [10:20:48] 3operations, ops-core: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) [10:26:53] paravoid: no, that predates the releases server, it can go [10:28:14] (03PS1) 10Faidon Liambotis: releases: fold ::backup into role class [puppet] - 10https://gerrit.wikimedia.org/r/185145 [10:28:16] (03PS1) 10Faidon Liambotis: Remove role::mirror classes [puppet] - 10https://gerrit.wikimedia.org/r/185146 [10:28:18] (03PS1) 10Faidon Liambotis: Remove download::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/185147 [10:28:20] reviews on slightly saltish (but noop) patch welcome: https://gerrit.wikimedia.org/r/#/c/185137/ [10:28:20] (03PS1) 10Faidon Liambotis: Rename download::wikimedia to just "dumps" [puppet] - 10https://gerrit.wikimedia.org/r/185148 [10:28:22] apergos: can you review? [10:28:29] sure [10:28:45] * YuviPanda goes afk for food [10:29:52] (03CR) 10jenkins-bot: [V: 04-1] Rename download::wikimedia to just "dumps" [puppet] - 10https://gerrit.wikimedia.org/r/185148 (owner: 10Faidon Liambotis) [10:31:26] (03PS2) 10Faidon Liambotis: Rename download::wikimedia to just "dumps" [puppet] - 10https://gerrit.wikimedia.org/r/185148 [10:35:54] apergos: [10:35:58] url.redirect = ( "^/other/(iOS|PlayBook|win8|android)(|/.*)$" => "http://releases.wikimedia.org/mobile/$1$2", [10:36:01] "^/(other/)?mediawiki(|/.*)$" => "http://releases.wikimedia.org/mediawiki/$2" ) [10:36:08] are those redirects for download.mediawiki.org? download.wikimedia.org? [10:36:45] * kart_ is looking for akosiaris :( [10:37:39] 3ops-codfw: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#979075 (10Joe) 3NEW [10:38:33] paravoid: the changesets look good to me [10:39:59] 3ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979084 (10Joe) 3NEW [10:40:31] paravoid: those are redirects for download.wikimedia.org, i.e.. the datasets host, whcih used to serve those [10:40:48] the links were published around so... [10:42:19] 3operations, ops-core: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#979090 (10Joe) 3NEW [10:42:48] 3ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979099 (10Joe) [10:42:49] 3operations, ops-core: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#979098 (10Joe) [10:42:50] 3ops-codfw: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#979100 (10Joe) [10:42:51] 3operations, ops-core: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#979097 (10Joe) [10:49:48] (03PS1) 10Faidon Liambotis: dumps: switch from lighttpd to nginx [puppet] - 10https://gerrit.wikimedia.org/r/185151 [10:49:51] apergos: ^^ [10:50:01] looking [10:50:17] (03CR) 10Faidon Liambotis: [C: 032] releases: fold ::backup into role class [puppet] - 10https://gerrit.wikimedia.org/r/185145 (owner: 10Faidon Liambotis) [10:50:59] (03CR) 10Faidon Liambotis: [C: 032] Remove role::mirror classes [puppet] - 10https://gerrit.wikimedia.org/r/185146 (owner: 10Faidon Liambotis) [10:51:08] (03CR) 10Faidon Liambotis: [C: 032] Remove download::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/185147 (owner: 10Faidon Liambotis) [10:51:20] (03CR) 10Faidon Liambotis: [C: 032] Rename download::wikimedia to just "dumps" [puppet] - 10https://gerrit.wikimedia.org/r/185148 (owner: 10Faidon Liambotis) [10:51:59] apergos: why is role::dumps included in both ms1001 & dataset1001? [10:52:17] DNS points to just dataset1001 [10:53:04] since we do not any longer have a dataset2 in tampa, ms1001 is the fallback [10:53:20] manual fallback? [10:53:22] when we have a dataset3x in codfw then ms1001 can be repurposed [10:53:30] yes, manual [10:53:33] ok [10:54:40] can you review/merge the nginx change and babysit it? [10:55:02] it'll also need an apt-get remove --purge lighttpd [10:58:14] (03PS2) 10Faidon Liambotis: dumps: switch from lighttpd to nginx [puppet] - 10https://gerrit.wikimedia.org/r/185151 [10:59:57] just eyeballng it, the nginx switch looks ok [11:01:01] I can do that in 30 minutes; I need to get cat-sitting arranged sometime this afternoon, so let me do that now and be here to babysit afterwords [11:09:50] (03CR) 10Phuedx: [C: 031] "This LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185122 (https://phabricator.wikimedia.org/T86590) (owner: 10Mattflaschen) [11:12:26] 3Wikimedia-Labs-Infrastructure, operations, Beta-Cluster: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#979157 (10hashar) 3NEW [11:13:20] (03PS1) 10Giuseppe Lavagetto: snapshot: unify node declarations, use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/185152 [11:13:22] (03PS1) 10Giuseppe Lavagetto: virt: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/185153 [11:13:24] (03PS1) 10Giuseppe Lavagetto: services: create sca role [puppet] - 10https://gerrit.wikimedia.org/r/185154 [11:14:01] (03CR) 10Hashar: "I suspect it changed the mwdeploy user homedir as a side effect, but that might be unrelated." [puppet] - 10https://gerrit.wikimedia.org/r/185127 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [11:16:19] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#979167 (10hashar) I suspect https://gerrit.wikimedia.org/r/#/c/185127/ changed the mwdeploy user homedir as a side effect, but that might be unrelated. Beta cluster scap is broken: T86901 I ha... [11:41:44] 3operations, MediaWiki-extensions-CentralNotice: Use the appropriate content models for CentralNotice scripts and CSS - https://phabricator.wikimedia.org/T86904#979200 (10He7d3r) 3NEW [11:44:56] anyone having idea on: https://phabricator.wikimedia.org/T86847 [11:45:29] you only thought about this now!? [11:45:35] (related to: https://gerrit.wikimedia.org/r/#/c/183888/) [11:45:45] paravoid: no. from last night. [11:45:51] poked akosiaris :) [11:53:47] (03PS1) 10Amire80: Consistent hyphen in beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185155 [11:54:16] paravoid: can you help here? [11:54:47] (03CR) 10ArielGlenn: [C: 032] dumps: switch from lighttpd to nginx [puppet] - 10https://gerrit.wikimedia.org/r/185151 (owner: 10Faidon Liambotis) [11:54:52] no, wait for akosiaris [11:57:27] paravoid: sure [11:59:37] PROBLEM - HTTP on ms1001 is CRITICAL: Connection refused [12:00:00] <_joe_> apergos: I guess this is you right? [12:00:11] yep [12:01:15] (03PS1) 10ArielGlenn: fix typo in dumps nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/185156 [12:01:21] 3ops-core: Build a new HHVM package - https://phabricator.wikimedia.org/T86906#979239 (10Joe) 3NEW a:3Joe [12:01:37] apergos: just one of the two? [12:01:39] 3ops-core: Build a new HHVM package - https://phabricator.wikimedia.org/T86906#979248 (10Joe) [12:01:45] just one [12:01:48] damn! [12:02:17] (03CR) 10ArielGlenn: [C: 032] fix typo in dumps nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/185156 (owner: 10ArielGlenn) [12:03:20] 3ops-core: Build a new HHVM package - https://phabricator.wikimedia.org/T86906#979239 (10Joe) [12:03:20] (03PS1) 10KartikMistry: WIP: Use SSL in cxserver config [puppet] - 10https://gerrit.wikimedia.org/r/185157 [12:09:20] (03CR) 10KartikMistry: "We need sslkey: and cert: to fill this." [puppet] - 10https://gerrit.wikimedia.org/r/185157 (owner: 10KartikMistry) [12:15:10] (03PS1) 10ArielGlenn: fixup rewrite syntax in dumps nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/185158 [12:15:46] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [12:16:14] (03CR) 10ArielGlenn: [C: 032] fixup rewrite syntax in dumps nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/185158 (owner: 10ArielGlenn) [12:18:56] RECOVERY - HTTP on ms1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 0.069 second response time [12:33:56] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:57:07] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [13:04:33] apergos: so, besides those, everything ok? [13:04:45] yes, looks good so far [13:04:48] awesome [13:04:55] thanks for doing those [13:05:03] there's more coming :P [13:05:09] not for dataset/dumps though [13:12:28] paravoid: any idea when akosiaris will be around? [13:12:37] no [13:15:44] kart_: Other opsen may be able to help, depending on what you want doing [13:16:27] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:22:03] Reedy: yep. alex is around now. [13:32:14] 3Wikidata, wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#979383 (10Lydia_Pintscher) [13:42:49] 3Wikimedia-Labs-Infrastructure, operations, Beta-Cluster: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#979419 (10Reedy) I blame @yuvipanda ``` 07:19 YuviPanda: set home of mwdeploy to /home/mwdeploy in LDAP ``` [14:04:53] 3Wikidata, wikidata-query-service, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#979455 (10mark) Different rows for availability is one thing. We need to think about how to distribute the service to two different data centers as well, and should build this out in both fro... [14:12:56] (03PS1) 10Faidon Liambotis: Remove dns::recursor::statistics, unused [puppet] - 10https://gerrit.wikimedia.org/r/185170 [14:12:58] (03PS1) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [14:13:00] (03PS1) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [14:13:02] (03PS1) 10Faidon Liambotis: icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 [14:13:04] (03PS1) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [14:13:35] anyone willing to review? most of them are trivial :) [14:20:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:21] (03PS1) 10QChris: Allow inserting to external Hive partitions, as we're on HDFS anyways [puppet/cdh] - 10https://gerrit.wikimedia.org/r/185176 [14:27:32] aw shit [14:28:25] (03CR) 10Filippo Giunchedi: [C: 031] Remove dns::recursor::statistics, unused [puppet] - 10https://gerrit.wikimedia.org/r/185170 (owner: 10Faidon Liambotis) [14:28:31] paravoid: ye I'll take a look [14:28:50] that's actually wrong :) [14:28:54] it's used [14:29:34] http://nescio.esams.wikimedia.org/pdns/ [14:29:38] look at that [14:30:32] can someone please fix gitblit on antimony? [14:31:25] paravoid: indeed, does ganglia export all of that too? [14:31:31] yes [14:31:34] and more [14:31:55] (03CR) 10Hoo man: [C: 04-1] "Abandon in favour of I6deedfd ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [14:32:05] aude: yeah, restarting it [14:32:09] thanks [14:32:15] paravoid: hehe 301 to ganglia? [14:32:15] (03PS2) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [14:32:17] (03PS2) 10Faidon Liambotis: icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 [14:32:19] (03PS2) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [14:32:21] (03PS2) 10Faidon Liambotis: Remove dns::recursor::statistics [puppet] - 10https://gerrit.wikimedia.org/r/185170 [14:32:23] (03PS2) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [14:33:44] (03PS3) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [14:33:46] (03PS3) 10Faidon Liambotis: icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 [14:33:48] (03PS3) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [14:33:50] (03PS3) 10Faidon Liambotis: Remove dns::recursor::statistics [puppet] - 10https://gerrit.wikimedia.org/r/185170 [14:33:52] (03PS3) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [14:34:44] (03CR) 10Faidon Liambotis: [C: 032] Remove dns::recursor::statistics [puppet] - 10https://gerrit.wikimedia.org/r/185170 (owner: 10Faidon Liambotis) [14:36:18] paravoid: is the switch away from http digest in 185171 tracked somewhere already? [14:36:26] no [14:37:30] (03CR) 10Filippo Giunchedi: [C: 031] mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 (owner: 10Faidon Liambotis) [14:37:32] ack [14:37:54] (03CR) 10Filippo Giunchedi: [C: 031] Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 (owner: 10Faidon Liambotis) [14:37:57] <_joe_> oooh we are done with lighty? [14:38:01] almost [14:38:07] there's just toollabs now [14:38:14] which has user-supplied configs apparently [14:38:22] (03CR) 10Se4598: "If it is OK to wait some weeks (announcing etc.), then yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [14:38:36] but yeah we're done for all prod cares [14:38:39] <_joe_> well, I think yuvi killed it from the newer tools I guess [14:38:49] I'm also preparing patchsets to kill cert.pp :) [14:38:53] module etc. [14:39:03] also half of mail.pp [14:39:58] (03CR) 10Filippo Giunchedi: [C: 031] icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 (owner: 10Faidon Liambotis) [14:40:13] <_joe_> please leave something I can complain about hanging around [14:40:37] dns.pp can probably can go next, it's simplified now [14:40:46] (03CR) 10Alexandros Kosiaris: [C: 031] "Yup, you are right Daniel, let's do it then" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [14:40:50] <_joe_> if you keep up this pace, by 2016 there will be little left [14:41:09] someone should do ganglia/ganglia_new [14:41:13] and swift/swift_new [14:41:15] these never work... [14:41:22] <_joe_> the latter is easier [14:41:32] <_joe_> but more critical [14:41:49] mail.pp I'll finish [14:42:05] then manifests/ will basically be just misc analytics crap [14:42:08] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59705 bytes in 0.213 second response time [14:42:31] (03PS4) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [14:42:33] (03PS4) 10Faidon Liambotis: icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 [14:42:35] (03PS4) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [14:42:37] (03PS4) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [14:42:54] <_joe_> and ganglia [14:43:03] I said so above :) [14:43:06] <_joe_> I'll hijack mark sooner or later [14:43:08] (03CR) 10Faidon Liambotis: [C: 032 V: 032] icinga/tendril: switch crt to public key [puppet] - 10https://gerrit.wikimedia.org/r/185173 (owner: 10Faidon Liambotis) [14:43:12] <_joe_> and do that [14:43:26] I already killed generic the other day, not sure if you saw [14:43:30] what do you need to hijack me for? [14:43:34] (03CR) 10Filippo Giunchedi: [C: 031] certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 (owner: 10Faidon Liambotis) [14:43:35] <_joe_> :) [14:43:42] <_joe_> mark: ganglia/ganglia_new [14:43:55] why am I in your way for that? ;) [14:44:02] and iptables.pp [14:44:27] <_joe_> you're not in my way at all [14:44:45] you need to make ganglia_newer_now_with_systemd [14:44:52] <_joe_> I'm just not sure I got everything that has changed [14:45:23] ganglia_new2 would work too [14:45:29] it's just moving from multicast and distributed aggregators to unicast and a central set of dedicated aggregators [14:45:43] and for the latter, upstart jobs are used [14:46:00] i guess the manifests are slightly nicer too, but already outdated by today's standards ;) [14:46:32] <_joe_> ok, my question would be, do we need to set up aggregators in eqiad as well, right? [14:46:39] yes [14:46:42] every dc probably [14:46:44] <_joe_> and ulsfo if I'm not mistaken [14:46:48] yes [14:46:52] hiera based would be good too [14:49:37] any objections with me messing with lists' web server? [14:49:55] anyone wanna do an additional review just to triple check? [14:51:07] 3Scrum-of-Scrums, operations: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#979542 (10Andrew) I have no objection to allowing deployment access, at this point. What I'd /really/ like to do, though, is have wikitech hosted on the normal cluster rather than on virt1000. I'... [14:51:12] <_joe_> paravoid: I have some packaging work to finish, I can take a look afterwards [14:51:27] <_joe_> wikitech on the normal cluster?!? [14:51:55] <_joe_> when is Moritz going to begin? :P [14:52:41] Ι was skeptical about wikibugs in #-operations but now I am loving it [14:53:02] just my random thought for today :-) [14:54:00] (03PS5) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [14:54:02] (03PS5) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [14:54:04] (03PS5) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [14:54:20] 3Scrum-of-Scrums, operations: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#979547 (10Joe) We want it to be hosted on a separate host from the puppetmaster, we don't want it to be on the mediawiki cluster at all. Wikitech offers, if any, a completely different surface of a... [14:55:31] 3Scrum-of-Scrums, operations: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#979548 (10Andrew) That's fine with me as well. [14:58:27] <_joe_> akosiaris: I love it too [15:01:21] (03PS6) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [15:01:22] (03PS6) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [15:01:24] (03PS6) 10Faidon Liambotis: mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 [15:01:27] (03PS1) 10Alexandros Kosiaris: Add HTTPS support to parsoid varnishes [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) [15:02:03] why sni? [15:03:43] role::cache::ssl::misc should suffice, no? [15:04:16] (03CR) 10Faidon Liambotis: [C: 04-1] "This only needs *.wikimedia.org, not the whole set of SNI certificates, no?" [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) (owner: 10Alexandros Kosiaris) [15:05:56] (03CR) 10Faidon Liambotis: [C: 032] mailman: switch from lighttpd to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/185171 (owner: 10Faidon Liambotis) [15:06:12] fatal: Unable to create '/var/lib/git/operations/puppet/.git/refs/remotes/origin/production.lock': File exists. [15:06:15] argh [15:06:33] that's for strontium only [15:06:50] how can I forcibly resync it easily? [15:09:11] from palladium? it should be enough to ssh as gitpuppet to strontium to pull the trigger [15:09:15] (03CR) 10Alexandros Kosiaris: "Indeed. I just tried to reuse the same class to DRY. I can obviously create a new class that only does the unified (for older browsers) an" [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) (owner: 10Alexandros Kosiaris) [15:09:22] <_joe_> paravoid: puppet-merge on strontium [15:09:46] ok, what _joe_ said worked [15:09:46] paravoid: I usually pull the trigger... [15:09:53] akosiaris: role::cache::ssl::misc should work [15:09:59] no need for a new class [15:10:04] misc ? [15:10:08] ew.. [15:10:11] well misc is just *.wikimedia.org [15:10:32] and planet and wmfusercontent [15:10:57] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to parse template apache/sites/lists.wikimedia.org.erb: [15:11:09] and it does not have the sni certs from what I see [15:11:10] I love how we have CI to save us from these [15:11:50] puppet compiler would have saved you :P [15:12:50] so, paravoid role::cache::ssl::misc is missing sni and has wmfuserontent.org and planet.wikimedia.org certificates as an extra [15:12:53] preferences ? [15:13:02] new class :) [15:13:09] with just *.wikimedia.org [15:13:24] hehe, just what I was trying to avoid. OK will do [15:13:48] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: puppet fail [15:15:18] (03PS1) 10Faidon Liambotis: mailman: fix $ssl_settings for Apache [puppet] - 10https://gerrit.wikimedia.org/r/185183 [15:15:43] (03CR) 10Faidon Liambotis: [C: 032] mailman: fix $ssl_settings for Apache [puppet] - 10https://gerrit.wikimedia.org/r/185183 (owner: 10Faidon Liambotis) [15:16:34] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#979566 (10mark) [15:19:57] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:20:46] (03PS7) 10Faidon Liambotis: Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 [15:20:48] (03PS7) 10Faidon Liambotis: certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 [15:20:50] (03PS1) 10Faidon Liambotis: mailman: add mod headers to Apache [puppet] - 10https://gerrit.wikimedia.org/r/185184 [15:21:19] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: add mod headers to Apache [puppet] - 10https://gerrit.wikimedia.org/r/185184 (owner: 10Faidon Liambotis) [15:22:14] (03CR) 10Faidon Liambotis: [C: 032] Kill webserver::static, now unused [puppet] - 10https://gerrit.wikimedia.org/r/185172 (owner: 10Faidon Liambotis) [15:22:59] (03CR) 10Faidon Liambotis: [C: 032] certs: kill create_combined_cert [puppet] - 10https://gerrit.wikimedia.org/r/185174 (owner: 10Faidon Liambotis) [15:27:01] root 10853 0.0 0.2 109484 19212 ? Ss 2014 1:42 python /usr/local/sbin/grain-ensure contains trebuchet_master tin.eqiad.wmnet [15:27:04] root 15281 0.0 0.2 109480 19208 ? Ss 2014 1:44 python /usr/local/sbin/grain-ensure contains trebuchet_master tin.eqiad.wmnet [15:27:08] sigh [15:27:15] that salt grain logic is *so* crazy [15:27:19] I was looking at it last week [15:27:36] it used to be a lot crazier but ori cleaned it up a bit [15:27:41] it's still crazy as shit though [15:27:59] (03PS2) 10Alexandros Kosiaris: Add HTTPS support to parsoid varnishes [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) [15:28:24] the original plan was for puppet to run a script, that contacts the salt master, that contacts the salt minion and tells it to... [15:28:33] write a value to a file [15:29:11] <_joe_> wat? [15:29:22] basically, echo 'trebuchet_master: tin.eqiad.wmnet' > /etc/salt/grains [15:29:30] I swear [15:29:50] <_joe_> so a file resource in puppet [15:29:52] <_joe_> :) [15:29:53] ori changed the python scripts to use some salt hacky internals that tell it to not contact the master [15:30:11] but it still loads the whole of salt, execing about a dozen times to collect all kinds of facts first [15:30:32] on clean systems that just include standard, it takes about 25-30% of the total puppet run in run time [15:30:55] well, no, it's a definition that you can call from multiple places to set multiple different key/values [15:31:02] and salt doesn't support a grains.d [15:31:39] so we'd have to either hack it with an Exec resource, or do concat::fragment magic [15:31:56] now, why a DNS server needs to know the trebuchet_master... beats me [15:32:01] * _joe_ flees in horror [15:32:06] anyone brave enough to review https://gerrit.wikimedia.org/r/185181? [15:32:21] <_joe_> you said concat::fragment [15:32:28] <_joe_> I'd use augeas instead [15:32:31] well, or something :) [15:32:34] it's a yaml file [15:32:36] <_joe_> file_line [15:32:39] augeas doesn't have a yaml lens [15:32:51] what a shame [15:32:52] because someone may want to deploy something via trebuchet on the DNS server :P [15:33:07] <_joe_> yeah it was a statement like "I'd rather hang myself with a barb wire than use X" [15:33:26] I wonder if we can use hiera for this somehow [15:33:31] <_joe_> augeas being the equivalent of hanging yourself with a barb wire [15:33:34] I 'd be fine with a simple rope ... no need for blood [15:33:44] augeas is nice [15:33:52] augeas sucks [15:33:56] <_joe_> lol [15:34:03] mark: have you seen how I switched our grub stuff into augeas? [15:34:09] no [15:34:17] <_joe_> paravoid: that was awesome [15:34:26] <_joe_> and well, augeas is nice, until it [15:34:27] we have grub stuff? [15:34:27] well, it sucks less than other stuff, but anything complex and you are done [15:34:29] I needed to change it a bit for jessie, because jessie's default install doesn't add "splash" [15:34:34] <_joe_> becomes like a plague [15:34:53] <_joe_> akosiaris: I maintained a whole apache config with augeas [15:34:57] <_joe_> a very complex one [15:35:04] <_joe_> so I still have PTSD [15:35:04] mark: yes, for setting up grub-over-serial and setting the elevator to deadline [15:35:05] heh [15:35:07] I did the same with postgres [15:35:11] _joe_: that seems a recipe for disaster [15:35:15] my head still hurts [15:35:23] you're doing it wrong :( [15:35:38] I know !!! [15:35:45] it felt like a good idea at the time [15:35:45] mark: https://gerrit.wikimedia.org/r/#/c/178897/ [15:36:13] I blame postgres for now being able to include files in the pg_hba.conf [15:36:19] ah right [15:36:36] so what's wrong with that paravoid? [15:36:44] with what? [15:36:48] that use of augeas [15:36:52] nothing [15:37:05] those old sed commands were originally mine [15:37:09] you just said augeas is nice, so I pointed out a very recent example of me switching something you wrote to augeas [15:37:12] and originate from our original move to ubuntu in 2006 :) [15:37:22] used to be in debian postinst scripts at the time [15:37:34] and moved into puppet well before augeas was available, heh [15:37:41] and yet if you see the changeset, it's still only possible with jessie+ [15:37:48] (augeas 1.3.0) [15:37:57] that's a shame [15:38:11] what was the issue? [15:38:19] the lens is missing [15:38:25] ok [15:38:26] the shellvars_list one [15:38:28] shellvars exists [15:38:32] but it's a PITA to work with grub [15:39:13] (03PS1) 10Giuseppe Lavagetto: [WMF] New package with additional patches and fixes to the ini files and to the upstart/init scripts [debs/hhvm] - 10https://gerrit.wikimedia.org/r/185187 [15:40:05] there's this too https://github.com/reidmv/puppet-module-yamlfile [15:40:22] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Papaul) The SSDs are not installed into the system yet. The reason being that the adapter brackets that I have on-site are 2.5 and the new system adapter brackets are 3.5. [15:40:39] <_joe_> paravoid: I was about to propose something like our php_ini() function [15:40:43] using http://adrienthebo.github.io/puppet-filemapper/#How_it_works [15:40:53] our php_ini isn't enough [15:41:00] as we include yaml settings from all over the place [15:41:08] salt grains that is [15:41:11] <_joe_> yes right [15:41:55] facter's api is much better [15:42:51] <_joe_> not sure if it can add values to an existing hash/array (re: yamlfile) [15:43:10] <_joe_> but this one or a slightly modified version should work [15:45:59] <_joe_> yeah, it doesn't deep-merge as we would expect [15:48:54] it's not that [15:49:00] the problem is that you need ruby code on the agent [15:49:07] php_ini and friends run on the master [15:49:28] <_joe_> file_line as well, yes [15:49:42] !log many frack host package updates and reboots [15:49:46] Logged the message, Master [15:50:21] anomie: Ping for SWAT in 10 minutes [15:50:23] anomie: Ok [15:50:29] (03CR) 10Alexandros Kosiaris: [C: 032] Add HTTPS support to parsoid varnishes [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) (owner: 10Alexandros Kosiaris) [15:50:42] (03CR) 10Ottomata: "I want to wait on this until I upgrade Hadoop. The current version of Hue that we have won't let me properly set a redirect to force SSL." [dns] - 10https://gerrit.wikimedia.org/r/180471 (owner: 10Dzahn) [15:53:10] 3operations, ops-core: Policy and method for keeping packages on hosts up to date - https://phabricator.wikimedia.org/T86925#979597 (10Gage) 3NEW [15:55:13] grrrrrr. icinga [15:55:29] what? [15:55:45] paged for thulium, which is marked down for maintenance [15:58:13] 3MediaWiki-ResourceLoader, operations, MediaWiki-Core-Team: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#979615 (10Nemo_bis) > So we think this is inherent to the design of our deploy mechanisms? If so, this report should be moved rom M... [15:59:14] 3operations, ops-core: Policy and method for keeping packages on hosts up to date - https://phabricator.wikimedia.org/T86925#979618 (10Gage) My idea: run our own apt repo ("precise-wmf" or etc.) into which admins manually accept new versions of packages, similar to how packages are promoted from Testing to Stabl... [15:59:30] 3operations, ops-core: Policy and method for keeping packages on hosts up to date - https://phabricator.wikimedia.org/T86925#979620 (10Joe) Just to be sure I got this right: are you proposing of running unattended upgrades on our production cluster? I could second that (but with a grain of salt and some cautio... [15:59:43] 3MediaWiki-ResourceLoader, operations, MediaWiki-Core-Team: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#979621 (10matmarex) [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150115T1600). [16:00:11] Not it [16:00:14] * anomie begins SWAT [16:00:28] <^d> Was going to say, you have the only thing [16:00:57] 3MediaWiki-ResourceLoader, operations, MediaWiki-Core-Team: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#979631 (10Jdforrester-WMF) It's an over-optimisation in ResourceLoader which creates a bug exposed by the way we deploy. [16:01:09] !log Elasticsearch cluster for logstash has indices for events dated 2015-12-* again [16:01:14] Logged the message, Master [16:01:19] he's got a patch [16:01:20] yeah [16:01:34] 3operations, ops-core: Policy and method for keeping packages on hosts up to date - https://phabricator.wikimedia.org/T86925#979632 (10Gage) Yes, unattended upgrades of manually-approved updates. It sounds alarming, but to me it's not as alarming as the current situation. [16:01:37] <^d> bd808: Don't tell me 2015 is over already. [16:01:45] <^d> I mean time flies, but damn. [16:01:51] <^d> I thought we just celebrated the new year. [16:01:53] <_joe_> bd808: I restarted ES on logstash1002 this morning [16:01:56] are those just dated that way? [16:01:59] <_joe_> it wasn't able to find the master [16:02:04] lame [16:02:41] _joe_: ok. node restarts was what resurrected the bad indicies the last time too [16:03:08] <_joe_> bd808: sorry about that, but from my LS/ES experience it was the right thing to do [16:03:20] <_joe_> should I have done something else/something more? [16:03:26] manybubbles: At some point the syslog log input went a bit nuts and created weird event timestamps [16:03:36] _joe_: nope, you did the right thing [16:03:37] sad [16:04:05] there is just some bad data in the cluster that apparently doesn't want to disappear [16:04:20] <_joe_> jgage: unattended upgrades are a powerful tool, but it would need some careful planning I guess [16:04:29] the only problem is causes is using up a bunch of file handles [16:06:48] yeah, careful planning is what i'm hoping to stimulate by opening that task :) [16:07:31] !log anomie Synchronized php-1.25wmf14/extensions/FlaggedRevs/api/actions/ApiReview.php: SWAT: Fix FlaggedRevs action=review for binary flagging [[gerrit:185180]] (duration: 00m 07s) [16:07:33] anomie: ^ test please [16:07:36] Logged the message, Master [16:08:27] <_joe_> jgage: your idea however wouldn't work as well. [16:08:55] <_joe_> because there are a few softwares we definitely don't want to upgrade automagically in any case [16:09:01] !log Deleted 2015-12-* indices from logstash elasticsearch cluster [16:09:04] Logged the message, Master [16:09:06] <_joe_> like hhvm on the appservers, or mariadb on the databases [16:09:25] anomie: Can't test, I don't have the user right needed. [16:09:29] anomie: Oh well then. [16:09:31] <_joe_> or varnish on the cache hosts [16:09:40] * anomie is done with SWAT [16:09:48] yeah [16:10:34] but there must be a way to avoid having 141 packages out of date on prod hosts [16:10:47] <_joe_> yes, read the link I posted there [16:10:54] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [16:11:04] <_joe_> the unattended upgrades mechanism can be tuned on a per-server basis [16:11:14] <_joe_> not to update critical packages unattended [16:11:23] <_joe_> not sure if someone here hates it [16:11:41] no, that won't work [16:11:42] any chance you can comment on my hiera blocker this week? :) [16:11:52] <_joe_> paravoid: why? [16:12:07] unattended upgrades will upgrade daemons and postinsts call invoke-rc.d which would randomly restart daemons at random times [16:12:32] and at the same time, upgraded libraries (e.g. libssl) would not restart the daemons that use them [16:12:54] leaving us seemingly vulnerable despite appearances [16:12:57] <_joe_> this latter part is of course true [16:13:02] we need to write something better than this [16:13:42] 3ops-core: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#979659 (10mark) Also, loading iptables on these high performance boxes can have an impact on network performance, so we should hope to avoid it for that reason too. [16:16:19] (03PS1) 10Alexandros Kosiaris: LVS configuration for Parsoid/CXServer HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/185194 [16:16:38] (03PS2) 10QChris: Allow to insert to external Hive partitions, as we're on HDFS anyways [puppet/cdh] - 10https://gerrit.wikimedia.org/r/185176 [16:18:07] (03CR) 10Ottomata: [C: 032] Allow to insert to external Hive partitions, as we're on HDFS anyways [puppet/cdh] - 10https://gerrit.wikimedia.org/r/185176 (owner: 10QChris) [16:19:57] Jeff_Green: there's a bunch of icinga alerts for frack hosts [16:20:08] backup4001, barium, payments1004 [16:20:13] looking [16:21:33] ottomata: an1003 has kafkatee in a weird dpkg state and an acknowledged icinga alert saying so... for 97 days [16:21:37] can we fix that? [16:22:10] (03PS1) 10Andrew Bogott: Mount /var/log on lvm but not /var [puppet] - 10https://gerrit.wikimedia.org/r/185195 [16:22:18] paravoid: https://gerrit.wikimedia.org/r/#/c/185194/ I am not 100% sure about this. Anything I have missed ? [16:22:52] bgp no? [16:23:12] ah, because http does that [16:23:23] yup [16:23:30] paravoid, interesting, will check it out. analytics1003 can probably just be decomissioned, it is a cisco [16:24:55] RECOVERY - DPKG on analytics1003 is OK: All packages OK [16:25:00] hm, just removed kafkatee. [16:25:02] heh [16:25:02] das fine [16:25:09] (03CR) 10coren: [C: 04-1] "Slight error (detail in comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185195 (owner: 10Andrew Bogott) [16:25:53] paravoid, as for that analytics1002 hadoop namenode alert + ack, I am writing an email now to schedule a namenode migration, hopefully tomorrow morning [16:26:11] experimented in labs and wrote up process yesterday: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Migrating_to_new_HA_NameNodes [16:26:21] how about amssq42 varnishkafka delivery errors for 7d21h? [16:26:22] :) [16:27:09] wher edo you see that? [16:27:15] going to host.. [16:27:19] hm [16:27:22] per minute. [16:27:34] ah unknown [16:27:35] (03CR) 10Alexandros Kosiaris: [C: 032] LVS configuration for Parsoid/CXServer HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/185194 (owner: 10Alexandros Kosiaris) [16:28:44] (03PS2) 10Andrew Bogott: Mount /var/log on lvm but not /var [puppet] - 10https://gerrit.wikimedia.org/r/185195 [16:29:04] 3operations, ops-core: Policy and method for keeping packages on hosts up to date - https://phabricator.wikimedia.org/T86925#979695 (10Gage) 08:08 < _joe_> jgage: your idea however wouldn't work as well. 08:08 < _joe_> because there are a few softwares we definitely don't want to upgrade automagically in any cas... [16:29:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, minor comment about port" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184904 (https://phabricator.wikimedia.org/T76949) (owner: 10Jforrester) [16:30:20] 3ops-codfw: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#979697 (10Joe) a:3RobH [16:30:25] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:52] 3ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979699 (10Joe) a:3RobH [16:31:06] rbf? [16:31:09] (03CR) 10coren: [C: 031] "That should work." [puppet] - 10https://gerrit.wikimedia.org/r/185195 (owner: 10Andrew Bogott) [16:31:35] (03CR) 10Andrew Bogott: [C: 032] Mount /var/log on lvm but not /var [puppet] - 10https://gerrit.wikimedia.org/r/185195 (owner: 10Andrew Bogott) [16:33:58] (03CR) 10Jforrester: Re-use Parsoid Varnishes for citoid too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184904 (https://phabricator.wikimedia.org/T76949) (owner: 10Jforrester) [16:34:00] (03PS2) 10Jforrester: Re-use Parsoid Varnishes for citoid too [puppet] - 10https://gerrit.wikimedia.org/r/184904 (https://phabricator.wikimedia.org/T76949) [16:35:21] 3ops-network: Possible bandwidth saturation on bits esams - https://phabricator.wikimedia.org/T86749#979708 (10faidon) 5Open>3Resolved a:3faidon [16:35:38] 3ops-network: Possible bandwidth saturation on bits esams - https://phabricator.wikimedia.org/T86749#975518 (10faidon) 5Resolved>3Invalid [16:37:19] ottomata: so... why not upgrade to jessie instead of trusty? :) [16:37:36] paravoid, i asked this question back in november or something [16:37:38] might be a bit bleeding edge [16:37:41] "should I not upgrade to trusty then...?" [16:37:58] all the nodes are already trusty anyway, except for these two (and the kafka nodes) [16:38:01] ah okay [16:38:13] also :( http://archive-primary.cloudera.com/cdh5/debian/ [16:38:17] vs: http://archive-primary.cloudera.com/cdh5/ubuntu/ [16:39:06] paravoid, ammsq42 logster is not working, because it is trusty, and i htink there is not a logster package for trusty? at least not an up to date one? on it... [16:39:35] what's the vs.? [16:39:40] oh, [16:39:50] only wheezy dist in debian [16:40:00] well, yes, jessie hasn't been released yet :) [16:40:10] it's all java though, it will probably work [16:40:13] oh, it hasn't? (hm) [16:40:15] ya most likely [16:40:21] but if you're all trusty now it doesn't matter [16:40:28] we can talk about this in a year or whatever [16:40:30] i'm actually running the presice packages on the trusty nodes currently [16:40:31] yeah. [16:41:03] (03PS1) 10KartikMistry: Use https for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185196 [16:41:19] akosiaris: ^ [16:44:14] <_joe_> < paravoid> it's all java though, it will probably work [16:44:23] <_joe_> this quote has been saved for future mocking [16:44:30] haha [16:45:02] hahahah [16:46:07] (03PS2) 10KartikMistry: Do not use protocol for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185196 [16:48:19] Is the Foundation using Ceph or Swift for file storage? Wikitech-l pages on the subjects haven't been updated in a while and was wondering what the status was... [16:49:22] greg-g: https://gerrit.wikimedia.org/r/185196 is a very simple change in the configuration for the ContentTranslation extension. It does not work on many browsers without it. When is the best time to deploy it ? [16:50:27] <_joe_> Lcawte: swift [16:50:42] akosiaris: doit, easy/simple enough [16:50:52] kart_: ^ [16:50:54] greg-g: cool, thanks! [16:50:55] (03CR) 10Chad: [C: 032] Do not use protocol for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185196 (owner: 10KartikMistry) [16:51:00] (03Merged) 10jenkins-bot: Do not use protocol for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185196 (owner: 10KartikMistry) [16:51:15] looks like chad beat you [16:51:24] ah [16:51:26] :) [16:51:29] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 06s) [16:51:41] Logged the message, Master [16:51:54] 3ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979745 (10RobH) So these are dual cpu (4 core) systems with 32GB of memory. My spare systems in codfw are slightly better, but will work. Allocating: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32 GB Memory, (2) 500GB Disk... [16:52:00] :-) [16:52:14] 3ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979746 (10RobH) p:5Triage>3Normal [16:52:45] 3ops-codfw: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#979748 (10RobH) [16:53:05] Thanks akosiaris greg-g ! [16:53:15] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [16:53:26] and ^d too! [16:53:33] hm [16:54:33] <^d> yw [16:55:44] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:01:34] 3ops-core: Place ms1004 server back into the pool - https://phabricator.wikimedia.org/T86933#979759 (10faidon) 3NEW a:3RobH [17:16:43] (03PS1) 10Andrew Bogott: Add rootdelay=N to the kernel settings in debian images. [puppet] - 10https://gerrit.wikimedia.org/r/185199 [17:17:52] (03CR) 10Andrew Bogott: [C: 032] Add rootdelay=N to the kernel settings in debian images. [puppet] - 10https://gerrit.wikimedia.org/r/185199 (owner: 10Andrew Bogott) [17:23:56] #puppet is now a project? [17:24:36] ? [17:25:00] snide remark but kinda true: "Isn't everything you guys do puppet?" [17:25:17] :) [17:25:24] https://phabricator.wikimedia.org/tag/puppet/ [17:25:31] heh [17:25:53] (and it's a "tag" not a "project", according to it's color/icon, which is better yeah) [17:26:08] move along [17:26:10] :) [17:26:16] nothing to see here [17:26:37] I 'll create a tag #code next [17:27:19] akosiaris: :) :) [17:34:19] 3ops-codfw: es2010 raid degraded - https://phabricator.wikimedia.org/T85978#979849 (10Cmjohnson) [17:34:21] 3ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#979850 (10Cmjohnson) [17:37:45] another side comment: why do we give +v to anyone who connects with an unmasked account from the sf office's ip space? [17:38:11] which ends up making the voiced people not opsen and the opsen aren't voiced [17:41:14] (03PS1) 10Andrew Bogott: Set rootdelay=90 [puppet] - 10https://gerrit.wikimedia.org/r/185204 [17:42:27] (03PS2) 10Andrew Bogott: Set rootdelay=20 [puppet] - 10https://gerrit.wikimedia.org/r/185204 [17:43:49] 3ops-codfw: label & setup drac/basic setings for rbf2001 & rbf2002 - https://phabricator.wikimedia.org/T86940#979871 (10RobH) 3NEW a:3Papaul [17:44:10] (03CR) 10Andrew Bogott: [C: 032] Set rootdelay=20 [puppet] - 10https://gerrit.wikimedia.org/r/185204 (owner: 10Andrew Bogott) [17:44:54] (03PS1) 10RobH: setting mgmt info for codfw rbf servers [dns] - 10https://gerrit.wikimedia.org/r/185207 [17:45:34] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:22] (03CR) 10RobH: [C: 032] setting mgmt info for codfw rbf servers [dns] - 10https://gerrit.wikimedia.org/r/185207 (owner: 10RobH) [17:47:51] 3ops-core: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979890 (10RobH) [17:51:35] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59671 bytes in 6.638 second response time [17:55:26] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10Papaul) Received 2.5 to 3.5 adapters for the system [17:56:31] 3ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#979907 (10Papaul) Received 2 disks for the system. [17:57:12] 3ops-eqiad: Rack and setup graphite1001 - https://phabricator.wikimedia.org/T86939#979910 (10RobH) [17:57:43] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#979913 (10Jgreen) p:5Triage>3Normal [18:15:56] PROBLEM - SSH on labstore1001 is CRITICAL: Connection timed out [18:15:57] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:16:05] paravoid: I have an odd issue all of a sudden; only one of the two configured IPs on the labstore1001 bond interface actually responds to pings. [18:16:06] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:16:06] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [18:16:06] paravoid: Nevermind, seems to have been arp pollution. [18:17:55] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [18:18:05] PROBLEM - dhclient process on labstore1001 is CRITICAL: Connection refused by host [18:18:05] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [18:18:14] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [18:18:25] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [18:18:44] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [18:21:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:23:21] !log deployed patches for T85349 T85850 T86711 [18:23:29] Logged the message, Master [18:33:24] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:37:19] bd808: https://gerrit.wikimedia.org/r/185217 [18:37:24] no grrit-wm because labs downtime [18:37:45] Coren: why did you switch us to ".crt" instead of ".pem" when you did the localssl switch? [18:37:53] er, localcerts [18:37:59] localssl is a different thing :) [18:38:04] YuviPanda: +1 [18:38:30] Because ubuntu management of the symlink farm demands it; that was discussed at length at the time though mostly on IRC [18:38:44] what symlink farm? [18:38:46] (It ignores all non *.crt filename) [18:39:15] where? [18:39:27] that's for /var/lib/ca-certificates I assume [18:39:32] I'm asking about /etc/ssl/localcerts [18:39:35] No, /etc/ssl/certs/ [18:39:52] Oh, why localcerts also uses .crt? [18:39:56] yes [18:40:13] RECOVERY - DPKG on labstore1001 is OK: All packages OK [18:40:15] root@copper:~# ls /etc/ssl/certs/*.crt [18:40:15] /etc/ssl/certs/ca-certificates.crt [18:40:15] root@copper:~# ls /etc/ssl/certs/*.pem |wc -l [18:40:15] 168 [18:40:17] fwiw [18:40:22] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [18:40:22] Consistency, nothing else. At the time, pretty much hated .crt but agreed that having some .crt and some .pem was worse [18:40:43] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:40:58] Yeah, the .pem symlinks have been hacked in to avoid breaking prod. [18:41:02] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [18:41:02] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical [18:41:12] I don't understand a thing :) [18:41:33] paravoid: Sorry, I'm *really* distracted atm so I'm probably being very unclear. [18:42:02] we used to have certificates under /etc/ssl/certs/foo.pem [18:42:10] Give me an hour or two to make sure that my filesystem switch is working fine and I'll try to give you a saner summary? [18:42:13] you moved them to /etc/ssl/localcerts/foo.crt [18:42:31] Yes, because /etc/ssl/certs/ is a linkfarm automanaged by debian. [18:42:34] the localcerts change is very good, I'm wondering why the filesystem extension change [18:42:34] RECOVERY - configured eth on labstore1001 is OK: NRPE: Unable to read output [18:42:49] s/debian/ubuntu/ Might be debian too, but I know only for sure that it is with Ubuntu [18:43:15] it's not a linkfarm exclusively, but it's the CA store and individual certificates do not belong there, so that change is definitely good [18:49:55] I'll have to refresh my memory, but I know there was an issue with using .pem for some of them, and after a lot of discussion with Rob and others at the time we settled on using .crt as the normal "default"; it *had* to do with the CA store and what it expected + some desire for consistency but it's a while ago and I'll need to look back at context with a clearer head to recall the rest. [18:55:48] bd808: I’ll merge that patch now, although it won’t have any effect [18:56:20] YuviPanda: cool beans. greg-g will be happy to see it fixed [18:56:27] yup yup [18:58:31] which? [18:59:46] !log this works? [18:59:50] well, apparently not [18:59:52] Logged the message, Master [18:59:59] ottomata: where is /etc/ssl/certs/hue.cert coming from? [19:00:27] it is a self generated cert, and will be removed after I upgrade to CDH 5.3 and can force SSL and use misc-web-lb ssl proxy [19:00:39] that's why i haven't merged the DNS change for hue.wikimedia.org yet [19:00:41] ok [19:08:23] YuviPanda: it did [19:20:53] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: puppet fail [19:29:13] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:49:22] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [19:49:32] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [19:49:52] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [19:57:51] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:51] !log aborting labs filesystem move (not enough contiguous free space) and postponing until new shelf [19:57:52] Coren: Icopying that over to SAL [19:57:52] Ah, morebots timed out. How sad. [19:57:52] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [19:58:44] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 125, initializing_shards: 0, number_of_data_nodes: 3 [20:01:21] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 115, initializing_shards: 4, number_of_data_nodes: 3 [20:02:00] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 115, initializing_shards: 4, number_of_data_nodes: 3 [20:03:58] !log logstash redis queue backlog 384k events and climbing; likely related to the elasticsearch cluster flapping [20:04:05] Logged the message, Master [20:06:56] 3ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#980172 (10RobH) 3NEW [20:12:04] YuviPanda: actually I'd rather us move to the other direction and keep system users out of /home... [20:12:16] hmm [20:13:54] bd808: ^ [20:14:01] jgage: logstash cluster is really unhappy :( I think the increase in events caused by my partial conversion of MW to log via redis has pushed it over the edge [20:14:11] it was in /var/lib first, and then it went to /home... [20:14:15] YuviPanda: don't ^ me! ;) [20:14:17] I should dig up to see why [20:14:26] bd808: :P who should I poke about deployment-y things? [20:14:51] Ori did it when he setup the shared ssh-agent I think (moved the homedir) [20:15:00] hmm [20:16:00] https://github.com/wikimedia/operations-puppet/commit/802c7568a627cea53438425a0c450f93fd22e273 [20:16:16] bd808 :( [20:16:25] hmm, right [20:17:03] paravoid: keeping system users out of /home has the additional advantage of being much simpler to do things around with in labs, since /home is shared... [20:17:05] jgage: yeah... not sure how to fix it at the moment other than power through and finish the conversion so we can stop sending duplicate events via udp2log and redis [20:17:17] not sure why ori switched mwdeploy's [20:17:17] 3ops-requests, WMF-Design: optoutresearch@ list, add recipient - https://phabricator.wikimedia.org/T86551#980185 (10Aklapper) Let's try and see what happens :) [20:18:39] 3ops-eqiad: Rack and setup graphite1001 - https://phabricator.wikimedia.org/T86939#980188 (10Cmjohnson) graphite is racked, ssds installed, dns completed, update switch (ge-4/0/6), racktables completed and dhcpd file update. Needs Partitioning Install [20:18:53] jgage: These patches would greately decrease the load -- https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:logstash,n,z -- but the one I -1'd needs to wait on https://gerrit.wikimedia.org/r/#/c/185210/ [20:19:08] We could push https://gerrit.wikimedia.org/r/#/c/185210/ out today though [20:20:17] Everything is plugged up right now because of the cluster freaking out and changing the master (probably an OOM) [20:21:29] <^d> Anyone able to merge a .bashrc for me? https://gerrit.wikimedia.org/r/#/c/184783/ [20:21:37] There are ~300K events sitting in the 3 redis queues [20:22:30] bd808 ok i will merge 185210, looks simple enough. the irc bot is silent due to labs maintenance. [20:23:40] oh hm i thought it was a puppet change [20:23:48] i don't know how to push mediawiki changes [20:24:06] labs maintenance is done [20:24:10] I should restar the bot [20:25:50] hello, grrrit-wm [20:25:53] (03PS2) 10BryanDavis: logstash: Update apache2 parsing pattern [puppet] - 10https://gerrit.wikimedia.org/r/184112 [20:25:59] (03PS9) 10BryanDavis: logstash: parse json encoded hhvm fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/179759 [20:26:13] (03PS1) 10BryanDavis: Switch all wikis to monolog logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185210 [20:26:24] (03PS1) 10Cmjohnson: Adding dns for graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/185212 [20:26:26] (03CR) 10Cmjohnson: [C: 032] Adding dns for graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/185212 (owner: 10Cmjohnson) [20:26:50] (03PS1) 10Cmjohnson: Revert "Adding dns for graphite1001 ---added old and bad change" [dns] - 10https://gerrit.wikimedia.org/r/185213 [20:27:02] (03Abandoned) 10Cmjohnson: Revert "Adding dns for graphite1001 ---added old and bad change" [dns] - 10https://gerrit.wikimedia.org/r/185213 (owner: 10Cmjohnson) [20:27:59] (03PS1) 10Cmjohnson: Fixing a change that I didn't meant to push involving Dickson [dns] - 10https://gerrit.wikimedia.org/r/185216 [20:28:03] (03CR) 10Cmjohnson: [C: 032] Fixing a change that I didn't meant to push involving Dickson [dns] - 10https://gerrit.wikimedia.org/r/185216 (owner: 10Cmjohnson) [20:28:07] (03PS1) 10Yuvipanda: beta: Fix mwdeploy's ssh key path to point to correct path [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) [20:28:13] (03CR) 10BryanDavis: [C: 031] beta: Fix mwdeploy's ssh key path to point to correct path [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) (owner: 10Yuvipanda) [20:28:37] (03PS1) 10Dzahn: misc-varnish: add annual.wm.org -> zirconium [puppet] - 10https://gerrit.wikimedia.org/r/185218 (https://phabricator.wikimedia.org/T599) [20:28:43] (03PS2) 10Dzahn: misc-varnish: add annual.wm.org -> zirconium [puppet] - 10https://gerrit.wikimedia.org/r/185218 (https://phabricator.wikimedia.org/T599) [20:29:02] (03CR) 10Dzahn: [C: 032] "annual.wikimedia.org is an alias for misc-web-lb.eqiad.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/185218 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [20:29:06] (03PS2) 10Yuvipanda: beta: Fix mwdeploy's ssh key path to point to correct path [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) [20:29:15] (03PS1) 10BryanDavis: Filter apache2, hhvm and kernel logs from logstash feed [puppet] - 10https://gerrit.wikimedia.org/r/185221 [20:29:21] oh, it replays history [20:29:34] (03PS3) 10Yuvipanda: beta: Fix mwdeploy's ssh key path to point to correct path [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) [20:29:51] (03PS1) 10BryanDavis: Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 [20:29:53] (03PS2) 10BryanDavis: Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 [20:29:59] (03CR) 10BryanDavis: [C: 04-1] "Should not merge until all wikis are converted to Monolog logging via I250ecfceb5084db574a20d771ec38b8a7ef6674a in operations/mediawiki-co" [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [20:30:09] (03CR) 10Yuvipanda: [C: 032] beta: Fix mwdeploy's ssh key path to point to correct path [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) (owner: 10Yuvipanda) [20:30:18] (03PS1) 10Yuvipanda: Followup to I2c7599bc2c032300c1bbaeed898b89614a16931f [puppet] - 10https://gerrit.wikimedia.org/r/185226 [20:30:22] (03CR) 10Yuvipanda: [C: 032] Followup to I2c7599bc2c032300c1bbaeed898b89614a16931f [puppet] - 10https://gerrit.wikimedia.org/r/185226 (owner: 10Yuvipanda) [20:31:41] (03PS1) 10Cmjohnson: Adding dhcpd for graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/185228 [20:31:43] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd for graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/185228 (owner: 10Cmjohnson) [20:31:51] (03CR) 10Se4598: "inline question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/185217 (https://phabricator.wikimedia.org/T86901) (owner: 10Yuvipanda) [20:32:07] (03PS1) 10Ottomata: Update cdh module with change to all allow insert into hive external tables [puppet] - 10https://gerrit.wikimedia.org/r/185231 [20:32:11] (03PS2) 10Ottomata: Update cdh module with change to all allow insert into hive external tables [puppet] - 10https://gerrit.wikimedia.org/r/185231 [20:32:13] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with change to all allow insert into hive external tables [puppet] - 10https://gerrit.wikimedia.org/r/185231 (owner: 10Ottomata) [20:32:22] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:32:31] (03CR) 10Gage: [C: 032] Switch all wikis to monolog logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185210 (owner: 10BryanDavis) [20:32:33] (03Merged) 10jenkins-bot: Switch all wikis to monolog logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185210 (owner: 10BryanDavis) [20:32:35] (03PS1) 10Dzahn: Apache microsite for the WMF annual report [puppet] - 10https://gerrit.wikimedia.org/r/185235 (https://phabricator.wikimedia.org/T599) [20:32:37] (03PS2) 10Dzahn: Apache microsite for the WMF annual report [puppet] - 10https://gerrit.wikimedia.org/r/185235 (https://phabricator.wikimedia.org/T599) [20:32:41] (03PS3) 10Dzahn: Apache microsite for the WMF annual report [puppet] - 10https://gerrit.wikimedia.org/r/185235 (https://phabricator.wikimedia.org/T599) [20:34:04] bd808, greg-g: I +2'd https://gerrit.wikimedia.org/r/185210 for mediawiki-config but perhaps I should have left that for a swat deployer? [20:34:55] jgage: are you comfortable deploying it? :) [20:35:07] no :) [20:35:18] then yeah, should wait :) [20:35:22] well, i've never done one before [20:35:38] ok, can i undo my +2? it says Status: merged [20:35:38] it's a blast! [20:35:51] will need a revert commit [20:35:59] ok [20:36:11] (03PS1) 10Gage: Revert "Switch all wikis to monolog logger" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185238 [20:36:17] wait :P [20:36:42] * jgage waits [20:36:47] i'm ok deploying it if you like [20:36:52] thank you [20:37:02] shall i go ahead now? [20:37:06] ori: I trust you [20:37:27] * ori takes a screenshot [20:37:32] you saw it here, folks [20:37:38] haha [20:38:09] #instantregret [20:38:20] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:38:22] !log ori Synchronized wmf-config/InitialiseSettings.php: I250ecfceb: Switch all wikis to monolog logger (duration: 00m 05s) [20:38:28] Logged the message, Master [20:38:49] (03CR) 10Dzahn: [C: 032] Apache microsite for the WMF annual report [puppet] - 10https://gerrit.wikimedia.org/r/185235 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [20:38:52] thanks ori [20:39:24] np [20:39:32] 3ops-core: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#980209 (10RobH) 3NEW a:3RobH [20:39:33] (03Abandoned) 10Gage: Revert "Switch all wikis to monolog logger" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185238 (owner: 10Gage) [20:39:46] 3ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#980218 (10RobH) [20:39:47] blargh Duplicate declaration: File[/srv/org] [20:39:48] 3ops-core: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#980217 (10RobH) [20:42:24] (03PS1) 10Dzahn: annualreport: remove duplicate dir declaration [puppet] - 10https://gerrit.wikimedia.org/r/185240 (https://phabricator.wikimedia.org/T746) [20:42:35] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [20:43:22] 3ops-eqiad: please check db1051 and db1056 for 2.5" disk brackets - https://phabricator.wikimedia.org/T86788#980225 (10Cmjohnson) a:5Cmjohnson>3RobH [20:43:34] (03CR) 10Dzahn: [C: 032] annualreport: remove duplicate dir declaration [puppet] - 10https://gerrit.wikimedia.org/r/185240 (https://phabricator.wikimedia.org/T746) (owner: 10Dzahn) [20:44:56] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:44:59] Is it possible to attach a file to a phab task? [20:45:12] andrewbogott: yes, drag & drop it [20:45:17] hm [20:45:19] into the comment field [20:45:26] PROBLEM - Host mw1062 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:07] andrewbogott: far right button in the toolbar in Comment form, but it will only say to drag & drop it [20:49:20] mutante: thanks, drag-and-drop worked fine [20:49:39] <^d> Also if you've already uploaded a file to Phab you can include it in further ones by using the {F123} identifier in curly brackets. [20:51:45] RECOVERY - Host mw1062 is UP: PING WARNING - Packet loss = 73%, RTA = 0.90 ms [20:52:01] (03PS1) 10Mforns: Add mforns to receive icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/185242 [20:54:16] PROBLEM - dhclient process on mw1062 is CRITICAL: Connection refused by host [20:54:16] PROBLEM - DPKG on mw1062 is CRITICAL: Connection refused by host [20:54:26] PROBLEM - Apache HTTP on mw1062 is CRITICAL: Connection refused [20:54:36] PROBLEM - HHVM processes on mw1062 is CRITICAL: Connection refused by host [20:54:36] PROBLEM - HHVM rendering on mw1062 is CRITICAL: Connection refused [20:54:36] PROBLEM - salt-minion processes on mw1062 is CRITICAL: Connection refused by host [20:54:45] PROBLEM - Disk space on mw1062 is CRITICAL: Connection refused by host [20:54:45] PROBLEM - nutcracker process on mw1062 is CRITICAL: Connection refused by host [20:54:46] PROBLEM - nutcracker port on mw1062 is CRITICAL: Connection refused by host [20:55:05] PROBLEM - configured eth on mw1062 is CRITICAL: Connection refused by host [20:55:06] PROBLEM - RAID on mw1062 is CRITICAL: Connection refused by host [20:58:35] (03CR) 10BryanDavis: "Dependency is merged and deployed." [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [21:00:10] (03Abandoned) 10BryanDavis: Filter apache2, hhvm and kernel logs from logstash feed [puppet] - 10https://gerrit.wikimedia.org/r/185221 (owner: 10BryanDavis) [21:00:29] (03CR) 10Ori.livneh: "Needs rebase." [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [21:02:29] (03PS3) 10BryanDavis: Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 [21:02:52] (03PS1) 10Mforns: Add mforns to analytics shinken alerts [puppet] - 10https://gerrit.wikimedia.org/r/185245 [21:03:02] wow. my spelling is horrible [21:03:13] (03CR) 10Ori.livneh: "Is there a task in phab for migrating the remaining UDP-logged log channels?" [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [21:03:16] ori: you changed home of mwdeploy from /var/lib to /home [21:03:20] ori: any particular reason? [21:03:39] don't remember, would need to look [21:03:40] paravoid (and me) were thinking we should keep system users off /home, esp. for labs with shared /home [21:03:45] ori: it was in https://github.com/wikimedia/operations-puppet/commit/802c7568a627cea53438425a0c450f93fd22e273 [21:04:02] message doesn’t metnion a reason for the homedir change, only for the shell change [21:04:23] oh, it's for keyholder, so that mwdeploy can have an ~/.ssh [21:04:29] (03PS4) 10BryanDavis: Exclude most udp2log messages from logstash [puppet] - 10https://gerrit.wikimedia.org/r/185222 [21:04:50] /var/lib/mwdeploy/.ssh is still ~/.ssh :) [21:04:51] /var/lib is ugly anyway, you know i oppose making our configuration ugly because of indefensible limitations in labs [21:05:02] ori: There isn't a phab task yet but I can make some [21:05:11] why is it ugly? [21:05:22] I'm not a big fan of /home for system users [21:05:35] PROBLEM - Host mw1062 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:01] .ssh is not variable state [21:06:09] especially given that systemd is in everyone's future [21:06:13] https://phabricator.wikimedia.org/T86903 <-- /home vs /var/lib task [21:06:16] RECOVERY - Host mw1062 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [21:06:25] (.ssh is going away soon anyway) [21:06:28] systemd treats /home special for security restrictions (i.e. keep system daemons from being able to read/write the "user" home mount) [21:06:34] and dotfiles/dot dirs are horrible everywhere, but outside of /home even worst [21:06:36] worse [21:06:51] !log deployed parsoid version 2fdf9298 [21:06:56] .ssh isn't gonna live for long [21:07:00] bd808: please do and include it in the commit [21:07:01] Logged the message, Master [21:07:39] https://github.com/gdnsd/gdnsd/blob/master/sysd/gdnsd.service.tmpl#L31 <- example, gdnsd using that file can't see /home at all, so that non-escalating exploits of the daemon don't leak user-private data, etc [21:07:44] change it back if you need to, but make sure it works [21:08:04] ori: The annoying thing is going to be getting the redis password into those 3 apps, but we can figure it out [21:08:22] Maybe we should remove the password from the logstash redis instances [21:10:32] 3operations, Beta-Cluster, Wikimedia-Labs-Infrastructure: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#980386 (10hashar) >>! In T86903#980010, @yuvipanda wrote: > It's /home/mwdeploy in prod, should be /home/mwdeploy in beta. Yea... [21:10:39] bd808: do the logging apps use redis's publish command to send logs? [21:11:58] ori: They do an rpush -- https://github.com/Seldaek/monolog/blob/master/src/Monolog/Handler/RedisHandler.php#L48 [21:12:20] * greg-g not so secretly wants a separate virtualized cluster just for "beta cluster" that is not the same as WMF Labs [21:12:35] greg-g: +1 [21:12:55] greg-g: why? [21:13:22] bd808: why would it be tricky to get the password into those three apps? [21:13:23] more control, less variability due to other labs issues [21:13:33] more control of what? [21:14:06] paravoid: correct me if I'm wrong, but wouldn't this discussion of /home/ vs /var/lib/ be moot if /home/ weren't shared among all labs users? we'd just do what prod did and not worry about it [21:14:13] ori: It just means they need to dig around in puppet to find the password. It shouldn't be too hard I guess [21:14:28] no [21:14:53] bd808: 'include passwords::redis'? [21:15:03] greg-g: It would be moot if YuviPanda hadn't removed the patch that kept it at the old location ;) [21:15:05] I hadn't even realized there was a beta issue when I commented on that [21:15:35] I'm not a big fan of /home for system users [21:15:37] bd808: and manually changed the home on ldap :) [21:15:39] mwdeploy is not exactly a system user [21:15:42] scap relies on it having a login shell [21:15:46] so, looking at https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep, my naive assessment is all of the special paths listed there could be removed [21:15:47] it says system => true [21:16:12] (03PS1) 10BBlack: txstatsd s/upstart/systemd/ on jessie+ [puppet] - 10https://gerrit.wikimedia.org/r/185247 [21:16:24] possibly that should be removed [21:16:28] 3ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#980424 (10Cmjohnson) Moved the servers...but I made a few changes on the servers to be moved. please see below wmf3093 - dbproxy1004 wmf3094 dbproxy1005 wmf3099 -dbproxy1006 wmf3111 -dbproxy1007 wmf3110... [21:16:48] 3ops-eqiad: please check db1051 and db1056 for 2.5" disk brackets - https://phabricator.wikimedia.org/T86788#980428 (10Cmjohnson) No, they do not have the 2.5 disk slots. These are later gen R510's and would have the 2.5" disks in the back like the R720's [21:16:50] (03CR) 10jenkins-bot: [V: 04-1] txstatsd s/upstart/systemd/ on jessie+ [puppet] - 10https://gerrit.wikimedia.org/r/185247 (owner: 10BBlack) [21:17:23] (03CR) 10BryanDavis: "> Is there a task in phab for migrating the remaining UDP-logged log channels?" [puppet] - 10https://gerrit.wikimedia.org/r/185222 (owner: 10BryanDavis) [21:18:02] (03PS2) 10BBlack: txstatsd s/upstart/systemd/ on jessie+ [puppet] - 10https://gerrit.wikimedia.org/r/185247 [21:18:06] puckfuppet [21:18:22] greg-g: my gut tells me that 95% of the beta hacks could go if people stopped taking shortcuts [21:18:40] hashar has always taken the easy route of "if $::realm == 'labs' { do special stuff }" [21:19:32] I used to -1/-2 them like hell in the past (like circa 2y ago) and after this got escalated high-up I was told to stop slowing him down [21:20:12] in all fairness, it's a tradeoff between fast and loose and properly rearchitecting stuff to make them capable of running in multiple environments (QA, staging, prod) [21:20:52] paravoid: yeah, it's tough, I don't blame any one person/group, we did what we thought was the best thign at the time, now we're reassessing and we will work together :) to change for the good [21:21:04] but there is a pattern of trying to work around prod [21:21:17] (03CR) 10BBlack: [C: 032] "(perhaps not ideal factoring, but I'm not sure I trust the initsystem fact either, given that some systems might have both going forward.." [puppet] - 10https://gerrit.wikimedia.org/r/185247 (owner: 10BBlack) [21:21:20] ...for reasons that I can't inherently blame people for [21:21:47] for great justice? [21:21:48] given the trade off decision that was made ("done soon" + "one-ish person") [21:22:20] it wasn't really expedient in hindsight [21:22:27] it has been a huge time-sink for several people [21:22:31] including, ultimately, hashar [21:22:37] so i don't buy that [21:23:00] bblack: what do you mean both? [21:23:14] ori: well, hindsight:20/20 etc [21:23:31] 3ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#980462 (10Cmjohnson) Adding that the spare server page was updated to reflect my changes https://wikitech.wikimedia.org/wiki/Server_Spares#Dell_PowerEdge_R610.2C_Single_Intel_Xeon_E5640.2C_32GB_Memory.2C_D... [21:23:33] it didn't have near the dependency by other engineering teams before as it does now [21:23:35] well, foresight in faidon's case [21:23:42] is it a goal to remove module beta? [21:23:46] PROBLEM - NTP on mw1062 is CRITICAL: NTP CRITICAL: No response from NTP server [21:23:58] paravoid: might have both installed, like trusty does now? I checked and the initsystem fact detects that correctly for trusty+jessie, but it's fs path checks and such, they may not be valid everywhere. [21:24:11] s/it [21:24:13] bblack: and "provider =>" is redundant, puppet knows best [21:24:21] really? [21:24:24] yeah [21:24:50] are you recommending I trust $::initsystem instead of looking at os_version too then, or not? [21:25:09] well, I introduced $::initsystem for this exact purpose so... I guess I'm biased :) [21:25:22] I wrote that fact myself :) [21:25:32] ok ok :p it's a style thing. when in doubt I trust nothing I don't explicitly specify myself! [21:26:00] the proliferation of provider => upstart in our tree is because of a puppet 2.7 bug btw [21:26:01] 3ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#980487 (10Cmjohnson) Disk replaced...need to reinstall [21:26:11] the upstart provider extends the debian provider, and the debian provider extends init, and all of the upstart methods start with is_upstart? (do something) : super [21:26:19] not sure how systemd fits [21:26:30] 'provider => upstart' can just go [21:26:59] puppet will prefer the systemd units or upstart services over sysv init, if both are found [21:27:07] both systemd+init.d or upstart+init.d [21:27:11] hmm, I don’t think there’s a specific bug about /home vs /var/lib for mwdeploy. [21:27:13] * YuviPanda makes one [21:28:16] (03CR) 10Dzahn: [C: 032] "user already existed in private repo, so good to go" [puppet] - 10https://gerrit.wikimedia.org/r/185242 (owner: 10Mforns) [21:28:20] greg-g: I'm not convinced this has stopped in any way, btw [21:28:36] e.g. https://phabricator.wikimedia.org/T84853#941955 [21:28:40] 3 weeks ago [21:29:16] patch was proposed against ops/puppet; I downvoted it as an incorrect solution and gross hack; patch gets applies to the betalabs puppetmaster [21:29:23] I mean, I don't care [21:29:58] but if these things happen, we can't exactly be responsible for the state of beta [21:30:17] and have icinga alerts echoed here, as was recently proposed [21:31:03] paravoid: well, to make this less one-sided -- [21:31:11] 3Deployment-Systems, MediaWiki-Core-Team, operations, Release-Engineering: Update servers in scap rsync proxy pool - https://phabricator.wikimedia.org/T1342#980513 (10greg) >>! In T1342#974944, @Reedy wrote: > https://gerrit.wikimedia.org/r/#/c/184817/ was merged yesterday, we good here? [21:31:12] i've been pretty ferocious about beta kludges [21:31:24] but i think opsen would be more empathetic if they spent a week without root [21:31:34] maybe once a year, as a regular exercise. [21:32:07] (03CR) 10Dzahn: "@mforns: i noticed there was a typo in your email address. your contact already existed in the private repo but the email address was "mfr" [puppet] - 10https://gerrit.wikimedia.org/r/185242 (owner: 10Mforns) [21:32:07] Or used puppet in labs themselves [21:32:29] Or ... *crazy talk* maintained beta [21:32:34] how is what I described a root problem? [21:32:59] both of those are reasonable suggestions that I think we should talk about in the context of the "harmonizing ops/beta puppet practices" (or whatever we're calling it) [21:33:18] "synergy" [21:33:19] if this was a good patch that we delayed merging, I'd understand it [21:33:25] when you don't have root, you run into all sorts of walls, and you end up with poorer optics into how production is set up, so it's not always your fault that you don't have the full picture about why things are the way that they are. [21:33:27] but this was a bad patch [21:33:47] this was applied on the beta puppetmaster *after* it was downvoted [21:33:50] that's true ori, but it's also the reason for an ops +1 need :) [21:33:57] (03PS1) 10BBlack: Improve txstatsd-for-jessie patch 331bc33f [puppet] - 10https://gerrit.wikimedia.org/r/185255 [21:34:05] paravoid: ^ better? [21:34:21] at the end of the day, antoine needed hhvm installed on the CI slaves so we could runs tests that the entire org depends on [21:34:25] there is a vicious circle here: keep people at arm's length from production, then when they write changes that are blithely insensitive to how production is configured, say "see? that's why we need to keep them at arm's length from production" [21:34:40] (03CR) 10Faidon Liambotis: [C: 031] Improve txstatsd-for-jessie patch 331bc33f [puppet] - 10https://gerrit.wikimedia.org/r/185255 (owner: 10BBlack) [21:34:55] s/installed/upgraded/ [21:35:41] (03CR) 10BBlack: [C: 032] Improve txstatsd-for-jessie patch 331bc33f [puppet] - 10https://gerrit.wikimedia.org/r/185255 (owner: 10BBlack) [21:35:49] (03CR) 10Mforns: "@Dzahn Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/185242 (owner: 10Mforns) [21:36:35] I wouldn't mind if someone walked us through how beta is used really, the why and the what etc [21:36:40] I'm not sure what we're discussing [21:36:50] I have a rough idea based on activity I see but certainly not from non-ops perspective [21:36:51] I'm responding to the suggestion that beta should have its own prod-like cluster [21:36:59] I don't think this would change any of the above [21:37:15] I don't think beta's issues are Labs' fault, I think it's beta's fault [21:37:30] I'm also responding to the suggestion that beta alerts should be printed here and alerting ops [21:37:49] paravoid: yeah, topic ballooned [21:38:07] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#980532 (10yuvipanda) 3NEW [21:38:11] part of it was also about using ensure => latest, or not [21:38:12] chasemp: would love to chat about that next week if you want [21:38:24] I enjoy pie charts fwiw [21:38:25] it's hard for me to care about alerts when changes I've explicitly said are not okay are applied [21:38:28] I need to make quarterly review slides right now (this is a welcomed distraction, but not good for me long term...) [21:38:30] ^ well sometimes it's because ops was imprudent in making a change without being basically-sensitive to beta, but also sometimes it's because beta is unpredictably-different and someone who knows beta better will have to fix [21:38:37] chasemp: what about pie with some charts? [21:38:41] and we have enough noise here already [21:38:45] paravoid: I understand [21:38:57] paravoid: so, let's try to figure out where each other are coming from [21:39:04] and work from there [21:39:48] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#980545 (10yuvipanda) I couldn't really summarize what @faidon and @ori said (about /var/lib vs /home), so would be great if either of them could comment. I personally do not care as long as i... [21:39:50] that's beautiful, greg-g [21:39:52] i cried a little [21:39:57] lol [21:40:29] ops are from mars, relengs are from venus [21:40:32] greg's inviting us to the parking lot for beers metaphorically [21:40:40] pretty sure his beta peeps are going to jump us tho [21:40:42] and take our wallets [21:40:47] metaphorically [21:40:49] chasemp: fwiw https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit [21:40:54] bitcoin wallets, and not metaphorically! [21:40:58] that's how to test a change on beta puppetmaster [21:41:22] was sent there yesterday, and worked [21:42:44] oh well, so much for trying to get people to understand why others make the decisions they make. Let's just assume laziness and stupidness and see how far that gets us. [21:43:02] now we are talking, we'll evolve a two party system [21:43:07] and fight each other over token issues [21:43:15] greg-g: +1 [21:43:22] fwiw, I'm not assuming anything [21:43:23] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#980584 (10BBlack) In the coming world of systemd, it seems to be an implicit assumption for security that homedirs are for human users only, which could include cron/daemon -like things of tod... [21:43:25] except s/stupidness/stupidity/ [21:43:37] ori: see, I'm so stupid I fail at grammar :) [21:43:41] for me figuring our how beta is used would be really helpful [21:43:45] this https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#How_do_I_get_my_code_on_the_beta_cluster.3F [21:43:59] beta reviews was one of the first things I was assigned to work at the foundation and I think I have enough experience to have an opinion that is not based on assumptions [21:44:01] chasemp: definitely will clear that up for you in-person next week, pie or not [21:44:03] bblack: i imagine you saying that in a movie trailer voice... "in a world... where the init daemon is systemd... one man... " [21:44:06] is good but that's the what really, I don't know the why from anyone else's perspective I think that's just my ignorance [21:44:09] greg-g: yeah sounds good [21:44:25] and I'm certainly not implying laziness or stupidness [21:45:00] ori: :P [21:45:06] If I could parlay for greg, I think he is saying "hey I was never here before, so I'm here now let's do things better" [21:45:14] chasemp: in my case the why was because a prod change caused an issue in beta, so to see if it fixes it before merging in prod [21:48:18] chasemp got my point basically: let's reset the conversation and the decisions. without that we'll continue to point at past examples (even if just 2 weeks ago) and never get forward. We all want better things for this, so now let's really figure out how to get there. [21:48:51] which, sounds like: A) separate cluster isn't intrinsicly better [21:49:06] B) other things we should talk about :) [21:49:09] let's just merge our teams, fixed. /me hides [21:49:15] well if you are a hoarder and you buy a new house, you just have more room to hoard [21:49:23] C) I'm tired and need to get back to making slides and updating quarterly goals wiki pages and all that fun manager stuff [21:49:24] that's my guess on why a separate cluster isn't a silver bullet [21:49:26] (03PS1) 10BBlack: Support systemd for varnishkafka::instance [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185289 [21:49:35] chasemp: /me nods, touche [21:50:15] bblack: that could probably be a @-type systemd service btw [21:50:21] but... future iteration I guess :) [21:50:41] not important now [21:51:22] well I don't want to refactor everything right now anyways, I'm trying to minimize my risk here too :) [21:51:39] nod [21:52:14] did you build all packages already? [21:52:25] you'd think provider-aware puppet Services would implicitly know to require the corresponding File for the initfile anyways [21:52:55] paravoid: mostly, they didn't really need any changes to work, esp since the upstart stuff was in puppet anyways [21:53:08] awesome [21:53:20] I rebuilt them with temp commits, I didn't merge those yet, but they're just version bumps for jessie to avoid ~jessie1 or whatever [21:54:05] (03CR) 10BBlack: [C: 032] Support systemd for varnishkafka::instance [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185289 (owner: 10BBlack) [21:54:13] (03CR) 10BBlack: [V: 032] Support systemd for varnishkafka::instance [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185289 (owner: 10BBlack) [21:54:32] 3ops-eqiad: please check db1051 and db1056 for 2.5" disk brackets - https://phabricator.wikimedia.org/T86788#980610 (10RobH) 5Open>3Resolved Thanks, updated rt ticket, resolving this task. [21:57:08] (03PS1) 10BBlack: modules/varnishkafka => 5488770a [puppet] - 10https://gerrit.wikimedia.org/r/185317 [21:57:37] jgage, ori: now the redis queues are at 1.5M events each (4.5M total). [21:57:37] (03CR) 10BBlack: [C: 032 V: 032] modules/varnishkafka => 5488770a [puppet] - 10https://gerrit.wikimedia.org/r/185317 (owner: 10BBlack) [21:58:10] submodules are so annoying [21:58:18] +1 [21:58:28] jgage, ori: So we either need to apply the udp2log throttle or ... something else ... to turn the tide [21:58:29] yay I broke the submodule [21:58:32] incoming puppetspam :P [22:00:40] (03PS1) 10BBlack: syntax bugfix for 5488770a [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185318 [22:00:53] (03CR) 10BBlack: [C: 032 V: 032] syntax bugfix for 5488770a [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185318 (owner: 10BBlack) [22:01:01] git suckmodule [22:01:48] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [22:01:49] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [22:01:49] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: puppet fail [22:01:58] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [22:01:59] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [22:02:09] (03PS1) 10BBlack: modules/varnishkafka => 034904d7 [puppet] - 10https://gerrit.wikimedia.org/r/185319 [22:02:19] (03CR) 10BBlack: [C: 032 V: 032] modules/varnishkafka => 034904d7 [puppet] - 10https://gerrit.wikimedia.org/r/185319 (owner: 10BBlack) [22:02:28] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: puppet fail [22:02:39] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [22:02:49] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [22:02:50] merged, should stop the spam, unless there's another error [22:03:18] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [22:03:59] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: puppet fail [22:03:59] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [22:03:59] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [22:03:59] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [22:04:08] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: puppet fail [22:04:19] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [22:04:39] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: puppet fail [22:04:48] Who wants to be my hero and review/merge https://gerrit.wikimedia.org/r/#/c/185222/ ? 4.5M backed up events in the logstash redis queue at least in part because they are all dupes of events coming in via udp2log and logstash is processing both [22:04:58] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [22:05:05] blargh, expect more spam I guess [22:05:18] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [22:05:19] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: puppet fail [22:05:29] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [22:05:42] paravoid: on jessie: Error: /Stage[main]/Txstatsd/Service[txstatsd]: Could not evaluate: Could not find init script for 'txstatsd' [22:05:43] (03PS1) 10Dzahn: sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 [22:05:49] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: puppet fail [22:05:50] ^ this may be from removing provider => systemd [22:06:37] maybe the spam did get stopped, and those were just the last ones still invoking puppet at the time [22:06:59] yeah I think so, one checked out ok manually [22:07:08] (03PS2) 10Dzahn: sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 [22:07:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:08:10] I'm gonna float another test through prod gerrit, easiest way to check (for the provider thing) [22:08:44] !log restarted elasticsaerch on logstash1003; died from OOM [22:08:50] Logged the message, Master [22:10:38] (03PS1) 10BBlack: See if provider attribute fixes for systemd [puppet] - 10https://gerrit.wikimedia.org/r/185322 [22:11:48] (03CR) 10BBlack: [C: 032] See if provider attribute fixes for systemd [puppet] - 10https://gerrit.wikimedia.org/r/185322 (owner: 10BBlack) [22:12:38] 3ops-core, Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#980647 (10RobH) p:5High>3Triage a:5RobH>3None [22:13:06] yeah that fixes it (whether or not it's the best fix for it, I'm not sure) [22:15:20] (03PS1) 10Dzahn: sshd: use Chacha20-poly1305,AES-CGM ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185325 [22:15:25] (03CR) 10Faidon Liambotis: [C: 04-2] "I haven't tested this but I'm suspecting that merging this will lock us out of half of the fleet, so -2." [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [22:15:51] (03PS1) 10BBlack: Add provider to varnishkafka::instance Service [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185326 [22:16:00] bblack: so, bad advice? [22:16:21] only half-bad. upstart doesn't mind the lack of provider at all, just jessie puppet+systemd does. [22:17:01] anyways it's still cleaner this way with "provider => $::initsystem" :) [22:17:11] (03PS2) 10Dzahn: sshd: use Chacha20-poly1305,AES-CGM ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185325 [22:17:26] mutante: be really really careful [22:17:35] this article is using latest-and-greatest openssh [22:17:52] I would be very surprised if e.g. precise supports Chacha20 [22:18:38] also, we still have those 3x I pushed changes that touched two of them today and filed a bug to decom the third one [22:19:13] ;) [22:19:28] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:19:45] (03PS2) 10BBlack: Add provider to varnishkafka::instance Service [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185326 [22:19:48] wheee to getting rid of lucid [22:19:49] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:19:54] paravoid: ok, i was totally not going to merge that, but wanted to have the gerrit comments:) thankx [22:20:05] (03CR) 10BBlack: [C: 032 V: 032] Add provider to varnishkafka::instance Service [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/185326 (owner: 10BBlack) [22:20:09] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:20:18] mutante: I didn't think you would, but better safe than sorry [22:20:25] sure, yep [22:20:41] grr puppet [22:20:46] defaultfor :osfamily => [:archlinux] [22:20:47] defaultfor :osfamily => :redhat, :operatingsystemmajrelease => "7" [22:20:48] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:21:07] (03PS1) 10BBlack: modules/varnishkafka => a2c085d6 [puppet] - 10https://gerrit.wikimedia.org/r/185328 [22:21:09] (03CR) 10Matanya: [C: 031] sshd: use Chacha20-poly1305,AES-CGM ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [22:21:17] one page said: Ciphers [email protected],[email protected],[email protected],aes256-ctr because it expected the ciphers to be mail addresses :p [22:21:21] (03CR) 10BBlack: [C: 032 V: 032] modules/varnishkafka => a2c085d6 [puppet] - 10https://gerrit.wikimedia.org/r/185328 (owner: 10BBlack) [22:21:39] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:21:41] stupid puppet [22:21:49] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:22:08] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:22:19] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:22:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:22:48] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:22:59] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:22:59] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:22:59] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:23:09] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:23:19] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:23:38] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:23:39] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:24:18] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:18] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:19] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:24:57] 3ops-core: backup old blog server/holmium with bacula - server will be wiped post backup - https://phabricator.wikimedia.org/T86975#980672 (10RobH) 3NEW a:3akosiaris [22:25:12] 3ops-core: backup old blog server/holmium with bacula - server will be wiped post backup - https://phabricator.wikimedia.org/T86975#980684 (10RobH) [22:26:10] 3ops-core: backup old blog server/holmium with bacula - server will be wiped post backup - https://phabricator.wikimedia.org/T86975#980672 (10RobH) Please note if there is a backup length/duration limitation, so I can pass it along to the communication folks (just seems like something they should be aware of.) [22:29:54] (03PS1) 10Dzahn: sshd: set Message Authentication Code ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185329 [22:32:11] 3operations, Wikimedia-Site-requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#980726 (10jayvdb) [22:33:12] bblack: so it's funny [22:33:27] my initsystem fact is better than what puppetlabs is doing [22:33:41] they pick the provider based on $::operatingsystem [22:34:10] and they do the wrong thing for Debian of course [22:35:15] 3Wikimedia-Interwiki-links, operations: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915#980740 (10jayvdb) Some analysis of nan at T86915 [22:36:27] https://tickets.puppetlabs.com/browse/PUP-2023 [22:36:29] bleh [22:36:47] (03CR) 10Dzahn: [C: 032] Add mforns to analytics shinken alerts [puppet] - 10https://gerrit.wikimedia.org/r/185245 (owner: 10Mforns) [22:36:52] current master is gonna be crappy for ubuntu 15.04 too, which is systemd [22:37:13] puppet-merge issue? [22:37:19] jouncebot: next [22:37:19] In 0 hour(s) and 22 minute(s): Wikimania Scholarships app (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150115T2300) [22:37:41] bblack: error: Unable to append to .git/logs/refs/remotes/origin/production: Permission denied [22:41:09] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:41:18] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:52:39] how do you merge tickets in phab? [23:09:24] !log Updated scholarships.wikimedia.org to d598e0d [23:12:07] (03PS1) 10Ori.livneh: Don't omit file-scope from xenon stacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185333 [23:12:10] ^ TimStarling [23:13:15] why are closures omitted? [23:14:54] TimStarling: includes actually contain the file name in the 'function' key, so they can be aggregated. '{closure}' frames do not, though I think they have file / line keys that could be added to the name to differentiate them. It's just a bit of work I haven't done yet. [23:17:19] why do you have to differentiate them? [23:18:43] I suppose you could infer the identity of the closure by the frames above and below it [23:26:31] greg-g: scholarships deploy went well. App works and will go live automatically 2015-01-19T00:00Z [23:26:59] (03PS2) 10Ori.livneh: Don't omit file-scope and closures from xenon stacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185333 [23:27:26] (03CR) 10Ori.livneh: [C: 032] Don't omit file-scope and closures from xenon stacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185333 (owner: 10Ori.livneh) [23:27:30] (03Merged) 10jenkins-bot: Don't omit file-scope and closures from xenon stacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185333 (owner: 10Ori.livneh) [23:27:30] bd808: good deal [23:29:29] morebots is out? [23:29:42] !log morebots is dead [23:30:17] gj ori [23:30:18] ;) [23:30:33] * greg-g has to run, won't be here for swat [23:33:25] jouncebot: next [23:33:25] In 0 hour(s) and 26 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150116T0000) [23:43:49] (03PS1) 10Dzahn: WIP: add port forwarding to ferm [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) [23:48:34] (03PS2) 10Dzahn: WIP: add port forwarding to ferm [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) [23:55:09] (03PS3) 10Dzahn: WIP: add port forwarding to ferm [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) [23:55:38] (03PS4) 10Dzahn: WIP: add port forwarding to ferm [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713)