[00:00:20] subbu: cool. let me know if you need anything [00:00:35] i think the reason it didn't work for me in the office the other day is likely because it was taking osme time to restart + wifi connectivity was pretty bad which i interpreted as "dsh is not working for me". [00:00:43] ok, thanks. [00:00:44] subbu: remember to !log [00:00:49] yes, will do. [00:00:59] (also, it was nice seeing you in SF.) [00:01:44] indeed, me as well .. it is always good to join in in 'meatspace' discussions. [00:02:09] yeah [00:13:01] !log restarted parsoid service on the parsoid cluster to free up leaked memory on several processes (seems to have happened in the 21:30 - 22:30 UTC on 31st Jan time frame) [00:13:08] Logged the message, Master [00:17:39] YuviPanda, I'll hang around for 15 odd mins and head out for dinner .. and will check ganglia once more once I am back. [00:17:50] subbu: cool [00:18:29] ori, YuviPanda but, my initial hunch is that we hit another pathological parsing scenario y'day .. given the more or less identical spike seen on all nodes within a 60 min or less timeframe. [00:19:10] subbu: have you checked logstash / logs to see if that is the case? [00:19:31] subbu: also the previous time this happened we put a temp. workaround in place. is that still in place or have we fixed the underlying issues? [00:19:32] YuviPanda, i'll have to dig through the logs to find out more. [00:19:44] no, i fixed it and it went out next deploy. [00:19:57] subbu: yeah, it can wait till after dinner / tomorrow, I guess. [00:20:05] " We should investigate the problem on that title and make our list handling more robust -- now fixed with gerrit-183158 (tracked at T85744)" [00:20:20] from https://wikitech.wikimedia.org/wiki/Incident_documentation/20150103-Parsoid [00:20:29] ah, nice [00:28:40] * subbu heads out [00:38:05] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007212 (10yuvipanda) @Joe I think bash is a terrible language when you aren't just executing a sequence of commands (with or without piping). My personal guideline is to rewrite to python w... [00:47:32] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1007214 (10Chmarkine) It seems dumps is now using the default cipher suite of nginx and SSL 3.0 is still enabled. https://www.ssllabs.com/ssltest/analyze.html?d=dumps.wikimedia.org... [00:54:52] RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself [00:57:50] (03PS1) 10Yuvipanda: dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) [00:59:18] (03PS2) 10Yuvipanda: dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) [01:13:02] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:22] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:42] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:42] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:42] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:43] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:51] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:51] PROBLEM - puppet last run on mw1079 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:52] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:11] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:22] PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 1 failures [01:29:12] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:29:21] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:29:21] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:29:22] RECOVERY - puppet last run on mw1079 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:29:42] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:29:42] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:29:52] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [01:30:02] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:30:21] RECOVERY - puppet last run on mw1056 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:30:22] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:31:31] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:01:45] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1007226 (10Springle) My fault. I left pt-table-checksum running between sanitarium and labsdb. Normally a painless process except when I stupidly have it monitor the wrong db1069 instance for replag. [02:04:53] hey springle [02:05:06] springle: unrelated to labsdb, but mysql on wikitech has been dying almost every day now. [02:05:16] springle: think you’ll have time to take a look? [02:07:05] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1007228 (10Springle) @yuvipanda, godog and I have previously theorized about pushing replag (and all other stats) from the tendril log tables to graphite. Whatever the solution, we should probably also solve T87209 by switching... [02:07:41] YuviPanda: can do [02:07:46] springle: thanks! [02:07:51] springle: let me open up a ticket [02:08:13] heh still /a [02:08:46] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1007230 (10yuvipanda) 3NEW [02:08:55] springle: virt1000 is a nasty box. [02:09:02] needs to die at some point in the not too far away future [02:09:55] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 04s) [02:10:02] Logged the message, Master [02:11:02] !log LocalisationUpdate completed (1.25wmf14) at 2015-02-02 02:09:59+00:00 [02:11:05] Logged the message, Master [02:12:18] virt100 has been OOM killing mysqld [02:12:24] virt1000* [02:13:12] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [02:13:17] yeah, that’s what I was suspecting too [02:13:23] it’s been having OOM problems for a while now [02:13:42] YuviPanda: do you know anything about a 'novaold' db on virt1000? has it been touched, or deleted manually, recently [02:13:57] springle: hmm, nope. andrewbogott_afk would know [02:14:13] but it sounds like a backup of nova db during OpenStack migrations [02:14:20] springle: is that causing problems? [02:15:54] not as such, but it's throwing errors that make it look like something has pulled *.ibd files out from under mysqld [02:17:58] hmm, or upgraded iincorrectly perhaps. it's 5.5.40, which is recent [02:19:12] springle: all of these are very much possible, yeah [02:19:17] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:19:23] I haven’t touched that machine at all, so will have to wait for andrewbogott_afk [02:19:23] Logged the message, Master [02:20:24] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-02 02:19:21+00:00 [02:20:28] Logged the message, Master [02:32:52] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:32:59] (03PS1) 10Yuvipanda: puppetmaster: Send cherry-picked commit count stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/188017 (https://phabricator.wikimedia.org/T87616) [02:33:10] (03PS2) 10Yuvipanda: puppetmaster: Send cherry-picked commit count stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/188017 (https://phabricator.wikimedia.org/T87616) [02:33:44] (03PS3) 10Yuvipanda: puppetmaster: Send cherry-picked commit count stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/188017 (https://phabricator.wikimedia.org/T87616) [02:36:44] (03PS1) 10Yuvipanda: puppetmaster: Try to rebase from upstream even if it hasn't changed [puppet] - 10https://gerrit.wikimedia.org/r/188018 [02:41:43] (03PS4) 10Yuvipanda: puppetmaster: Send cherry-picked commit count stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/188017 (https://phabricator.wikimedia.org/T87616) [02:42:11] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Send cherry-picked commit count stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/188017 (https://phabricator.wikimedia.org/T87616) (owner: 10Yuvipanda) [02:43:38] (03PS2) 10Yuvipanda: puppetmaster: Try to rebase from upstream even if it hasn't changed [puppet] - 10https://gerrit.wikimedia.org/r/188018 [02:44:49] !log virt1000 mysqld restart, shrink buffer pool [02:44:55] Logged the message, Master [02:45:21] virt1000 is way over extended [02:45:42] yeah [02:45:45] moving wikitech off might help [02:45:47] Why is the DB even local? [02:45:51] the OOM events are caused by java ballooning, making kernel choose mysqld for sacrifice [02:45:54] wait [02:45:56] java? [02:46:17] hoo: it’s used by openstack too, but yeah, no need for the db to be local... [02:46:39] oh [02:46:40] opendj [02:46:44] should move to openldap [02:47:35] java, apache, and mysqld have all triggered OOM according to dmesg [02:47:40] at different times [02:48:13] yeah, need to disentangle the many things that box does [02:48:23] salt master, puppetmaster, opendj, wikitech [02:48:27] and nova controller [02:51:46] 3Beta-Cluster, operations: Set up an alert for unmerged changes in deployment-prep - https://phabricator.wikimedia.org/T87616#1007244 (10yuvipanda) [02:51:47] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007242 (10yuvipanda) 5Open>3declined On second thoughts, a lot of the complexity in that bash script can go away, so I'm just going to do that instead. Still think complicated bash scr... [02:52:59] If the DB isn't overly large, or has to much load it should be rather easy to move it to a m* or so [02:55:13] technicaly true. idk about network setup though; virt1000 can't access some stuff by design [02:55:24] (03PS1) 10Yuvipanda: puppetmaster: Throw away all local uncommited changes [puppet] - 10https://gerrit.wikimedia.org/r/188019 [02:55:32] hoo: springle oh yeah, that. virt1000 is firewalled off. [02:55:33] as well [02:55:42] we can open up that hole, though. [02:56:23] really it isn't the DB that is the issue either imo. the memcached process uses more resident mem, and java way more virt. mysqld is just the most visible impact when oom chooses it [02:56:48] wow... it has it's own memcached... [02:56:49] memcached could be dialed back actually. 2G limit, 700M resident [02:57:00] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007246 (10yuvipanda) (simplification done in https://gerrit.wikimedia.org/r/188019) and https://gerrit.wikimedia.org/r/#/c/188018/ [02:57:34] and opendj should be moved to openldap and on its own machine... [02:59:39] (03PS3) 10Yuvipanda: puppetmaster: Try to rebase from upstream even if it hasn't changed [puppet] - 10https://gerrit.wikimedia.org/r/188018 [03:00:12] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Try to rebase from upstream even if it hasn't changed [puppet] - 10https://gerrit.wikimedia.org/r/188018 (owner: 10Yuvipanda) [03:00:21] (03PS2) 10Yuvipanda: puppetmaster: Throw away all local uncommited changes [puppet] - 10https://gerrit.wikimedia.org/r/188019 [03:03:54] (03CR) 10Yuvipanda: [C: 032] "Checked deployment-prep and integration project puppetmasters for local uncommited changes, none found." [puppet] - 10https://gerrit.wikimedia.org/r/188019 (owner: 10Yuvipanda) [03:09:15] (03CR) 10Yuvipanda: "un-cherry-picking from beta." [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [03:09:48] (03CR) 10Yuvipanda: "un-cherry-picking from beta." [puppet] - 10https://gerrit.wikimedia.org/r/179759 (owner: 10BryanDavis) [03:12:09] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1007249 (10Chmarkine) I scanned all the public IP addresses in dns/templates/wikimedia.org, and found four SSL 3.0 enabled servers: ``` dataset1001 208.80.154.11 ms1001 208.80.15... [03:21:10] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1007265 (10Chmarkine) [03:36:15] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1007295 (10Springle) virt1000 is way over extended. The kernel OOM killer is choosing mysqld, at different times triggered by memory spikes from java, apache2, and mysqld itself. I reduced -- the unpuppetized -- InnoDB... [03:44:02] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1007303 (10Springle) Also, not causing outages but indicative of something funny happening recently; heaps of 'novaold' errors: ``` 150202 2:44:02 [ERROR] Cannot find or open table novaold/volumes from the internal data... [03:45:26] ori: hi [03:45:38] ori: what exactly was b17ec514abfb17766691770ded34dd9293b2b6bf (local commit in deployment-salt from you) trying to do? [03:47:04] (03PS1) 10Ori.livneh: vbench: report time on-CPU, not total time profiler was on [puppet] - 10https://gerrit.wikimedia.org/r/188023 [03:47:14] YuviPanda: i'll take a look if you review :P [03:47:24] such meatpuppetry [03:47:32] meatpuppetry? [03:47:37] * ori doesn't get it [03:47:47] ori: oh, enwiki term, I guess [03:47:55] like sockpuppet, but with real people [03:48:09] oh [03:48:13] (03PS2) 10Yuvipanda: vbench: report time on-CPU, not total time profiler was on [puppet] - 10https://gerrit.wikimedia.org/r/188023 (owner: 10Ori.livneh) [03:48:53] YuviPanda: i'm not sure; we can kill it [03:49:14] (03CR) 10Yuvipanda: [C: 032] vbench: report time on-CPU, not total time profiler was on [puppet] - 10https://gerrit.wikimedia.org/r/188023 (owner: 10Ori.livneh) [03:49:22] thanks [03:49:46] ori: cool. will we know if something barfs? [03:50:20] let me know when you delete it and i'll watch a puppet run on deployment-eventlogging [03:50:38] ori: deleted [03:50:49] thanks [03:54:36] ori: anything borked? [03:54:53] ori: the eventlogging puppet code could use a fixup, in general :) [03:58:28] sorry running now [03:58:33] what about it? [03:58:34] file bugs! [03:59:04] ori: will do! it works, but seems a bit… messy? Definitely not ori-standards, since it was written so long ago :) [03:59:13] $log_dir in particular, for example [03:59:22] it’s set in the eventlogging class, but actually used primarily only in the role [03:59:41] I shall file bugs once I’m more sure what bugs are there to be filed [04:00:51] YuviPanda: it was literally the first puppet i have ever written [04:04:02] ori: yeah, I’m aware :) [04:08:06] ori: I just hold you to higher standards now. [04:08:16] am off to eat breakfast, will keep an eye on IRC tho [04:08:17] brb [04:09:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Feb 2 04:08:14 UTC 2015 (duration 8m 13s) [04:09:23] Logged the message, Master [05:22:33] ori: I assume beta’s EL is ok? [05:22:40] YuviPanda: yes [05:22:46] \o/ cool. [05:22:47] thanks [05:22:53] local hacks down to two! [05:22:58] * YuviPanda goes to fix one more [05:40:36] (03PS4) 10Yuvipanda: deployment: Unify salt_masters role for prod / labs [puppet] - 10https://gerrit.wikimedia.org/r/185137 (https://phabricator.wikimedia.org/T86885) [05:41:58] (03CR) 10Yuvipanda: [C: 032] "Fixed for deployment-prep, let's see if this breaks anything else." [puppet] - 10https://gerrit.wikimedia.org/r/185137 (https://phabricator.wikimedia.org/T86885) (owner: 10Yuvipanda) [05:49:01] hello [05:49:04] what's this room about? [05:52:42] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [05:59:30] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#1007440 (10yuvipanda) [05:59:31] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#1007439 (10yuvipanda) 5Open>3Resolved [06:12:23] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:28:11] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:21] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:42] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:42] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:52] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:32] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:13:49] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1007566 (10Qgil) a:5LuisV_WMF>3Qgil [08:14:52] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:00] sigh [08:15:01] not this again [08:15:03] *again [08:15:32] that machine is OOMing to death [08:15:42] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 4.179 second response time [08:16:21] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 61 failures [08:16:29] andrewbogott_afk: opendj seems to be the biggest culprit [08:19:35] !log restarted opendj, pdns, apache, mysql on virt1000 [08:20:32] (03CR) 10Florianschmidtwelzow: "> Think this is in wrong place." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) (owner: 10Phuedx) [08:25:13] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:21] PROBLEM - configured eth on virt1000 is CRITICAL: NRPE: Unable to read output [08:25:59] !log restarted apache2 on virt1000 [08:26:13] RECOVERY - configured eth on virt1000 is OK: NRPE: Unable to read output [08:27:02] !log started mysql on virt1000 [08:27:11] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.026 second response time [08:28:07] !log hi [08:28:12] sigh, morebots [08:39:22] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [08:53:32] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:28:18] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#1007858 (10hashar) [09:48:43] (03CR) 10Hashar: contint: install Java 8 on Trusty servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) (owner: 10Hashar) [09:49:16] (03PS3) 10Hashar: contint: install Java 8 on Trusty servers [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) [09:49:18] (03PS1) 10Hashar: contint: migrate to require_package() [puppet] - 10https://gerrit.wikimedia.org/r/188034 [09:49:52] (03CR) 10Hashar: "The patch switching to require_package() is https://gerrit.wikimedia.org/r/#/c/188034/" [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) (owner: 10Hashar) [09:50:15] (03PS2) 10Filippo Giunchedi: fix check_elasticsearch CRITICAL output [puppet] - 10https://gerrit.wikimedia.org/r/186418 [09:50:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fix check_elasticsearch CRITICAL output [puppet] - 10https://gerrit.wikimedia.org/r/186418 (owner: 10Filippo Giunchedi) [09:50:57] (03PS2) 10Filippo Giunchedi: graphite: explicit install python-twisted-core [puppet] - 10https://gerrit.wikimedia.org/r/187683 (https://phabricator.wikimedia.org/T85909) [09:51:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: explicit install python-twisted-core [puppet] - 10https://gerrit.wikimedia.org/r/187683 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [09:54:12] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [10:10:12] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [10:15:11] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [10:18:02] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187914 (https://phabricator.wikimedia.org/T87104) (owner: 10Glaisher) [10:19:23] (03CR) 10Hashar: [C: 032] Add www.doria.fi to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187914 (https://phabricator.wikimedia.org/T87104) (owner: 10Glaisher) [10:19:28] (03Merged) 10jenkins-bot: Add www.doria.fi to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187914 (https://phabricator.wikimedia.org/T87104) (owner: 10Glaisher) [10:20:12] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [10:20:30] !log hashar Synchronized wmf-config/InitialiseSettings.php: Add www.doria.fi to $wgCopyUploadsDomains {{bug|T87104}} https://gerrit.wikimedia.org/r/#/c/187914/ (duration: 00m 07s) [10:20:55] (03CR) 10Hashar: "Deployed on production:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187914 (https://phabricator.wikimedia.org/T87104) (owner: 10Glaisher) [10:21:53] where is morebots [10:21:55] seriously [10:23:39] !log hey [10:24:49] LoginError: (, {u'result': u'WrongPass'}) [10:24:49] :( [10:25:04] grr [10:25:12] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [10:25:16] hashar: try restarting it again? [10:25:25] yuvipanda: wikitech auth is dead [10:25:34] I can't login either via the web interface [10:26:07] hashar: I just restarted keystone [10:26:20] didn't !log because no point yet [10:26:22] better :] [10:26:53] hashar: yeah :) [10:26:57] !log foo [10:27:27] it isn't back [10:27:30] yet [10:27:38] !log foo [10:27:56] !log foo [10:28:01] Logged the message, Master [10:28:06] \O/ [10:28:16] !log Synchronized wmf-config/InitialiseSettings.php: Add www.doria.fi to $wgCopyUploadsDomains {{bug|T87104}} https://gerrit.wikimedia.org/r/#/c/187914/ (duration: 00m 07s) [10:28:25] morebots is a mess [10:28:25] I am a logbot running on tools-exec-11. [10:28:25] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [10:28:25] To log a message, type !log . [10:28:37] !log Synchronized wmf-config/InitialiseSettings.php: Add www.doria.fi to $wgCopyUploadsDomains {{bug|T87104}} https://gerrit.wikimedia.org/r/#/c/187914/ (duration: 00m 07s) [10:28:37] !log been restarting pdns, opendj, apache, mysql, keystone left and right on virt1000 all day. [10:28:45] Logged the message, Master [10:29:08] TwitterError: [{u'message': u'Status is over 140 characters.', u'code': 186}] [10:29:10] seriously [10:29:24] !log Synchronized wmf-config/InitialiseSettings.php: Add www.doria.fi to $wgCopyUploadsDomains {{bug|T87104}} (duration: 00m 07s) [10:29:37] !log logme please [10:29:42] Logged the message, Master [10:30:00] haha, morebots [10:30:11] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [10:31:30] yuvipanda: sigh, what's there that explodes? [10:31:40] godog: virt1000? [10:31:41] EVERYTHING [10:31:46] godog: basically there's too much going on in that box [10:31:50] godog: and it OOMs all the time [10:31:59] godog: and kills one of these things, and everything explodes [10:32:09] joy [10:32:17] killing mysql/apache kills puppetmaster and wikitech. [10:32:28] keystone kills logging in to wikitech [10:32:34] time to split it on more boxes [10:32:35] opendj / pdns kills DNS [10:32:37] yeah [10:32:48] I'm going on Vacation from end of this week though :( [10:34:59] (03Abandoned) 10Hashar: import LogFormat s from apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [10:35:12] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 96 seconds ago with 0 failures [10:36:33] yuvipanda: found some bit rotting patch from you https://gerrit.wikimedia.org/r/#/c/176909/2 :D [10:36:40] atop log retention tweaking [10:36:50] heh [10:37:13] (03Abandoned) 10Yuvipanda: base: Make atop log retention configurable [puppet] - 10https://gerrit.wikimedia.org/r/176909 (https://phabricator.wikimedia.org/T71605) (owner: 10Yuvipanda) [10:42:47] (03PS2) 10Hashar: Add Dan Duvall to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/183062 (https://phabricator.wikimedia.org/T85936) (owner: 10John F. Lewis) [10:43:25] !log hoo Synchronized php-1.25wmf15/extensions/Wikidata/: Update Wikibase: Fixes for UsageTracking and the anon edit warning (duration: 00m 12s) [10:43:28] Logged the message, Master [10:43:33] (03CR) 10Hashar: [C: 031] "Rebased, the patch now just include dduvall since others have been added in different patches." [puppet] - 10https://gerrit.wikimedia.org/r/183062 (https://phabricator.wikimedia.org/T85936) (owner: 10John F. Lewis) [10:44:13] Works, yay [10:44:35] !log hoo Synchronized php-1.25wmf14/extensions/Wikidata/: Update Wikibase: Fixes for UsageTracking and the anon edit warning (duration: 00m 14s) [10:44:37] Logged the message, Master [10:45:01] (03PS2) 10Hoo man: Exempt Item and Property namespaces from ConfirmEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186828 (https://phabricator.wikimedia.org/T86453) [10:46:14] (03CR) 10Hoo man: [C: 032] Exempt Item and Property namespaces from ConfirmEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186828 (https://phabricator.wikimedia.org/T86453) (owner: 10Hoo man) [10:47:06] 3ops-eqiad, operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#1007964 (10fgiunchedi) moving to ops-eqiad [10:47:27] 3ops-eqiad, operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#1007966 (10fgiunchedi) a:5fgiunchedi>3RobH [10:49:28] * hoo looks suspiciously at jenkins... hung? [10:52:09] godog: hey! think you'll have some spare cycles at the end of this month to try set up a swift cluster on beta? [10:52:11] running on VMs [10:53:59] (03CR) 10Hoo man: [V: 032] "Passed the checks according to Zuul, but for some reason that doesn't make it into gerrit..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186828 (https://phabricator.wikimedia.org/T86453) (owner: 10Hoo man) [10:54:48] !log hoo Synchronized wmf-config/Wikibase.php: Exempt Item and Property namespaces from ConfirmEdit (duration: 00m 07s) [10:54:54] Logged the message, Master [10:55:46] yuvipanda: not 100% sure but it could be, this would be T64835 I take it? [10:59:34] godog: yeah [11:18:29] (03CR) 10Hashar: [C: 031] beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 (owner: 10Giuseppe Lavagetto) [11:18:54] (03CR) 10Hashar: [C: 031] maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (owner: 10Giuseppe Lavagetto) [12:16:19] (03PS1) 10Yuvipanda: beta: Add check for cherry-picked commits [puppet] - 10https://gerrit.wikimedia.org/r/188037 (https://phabricator.wikimedia.org/T87616) [12:16:53] (03PS2) 10Yuvipanda: beta: Add check for cherry-picked commits [puppet] - 10https://gerrit.wikimedia.org/r/188037 (https://phabricator.wikimedia.org/T87616) [12:17:18] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Add check for cherry-picked commits [puppet] - 10https://gerrit.wikimedia.org/r/188037 (https://phabricator.wikimedia.org/T87616) (owner: 10Yuvipanda) [12:18:34] (03PS1) 10Yuvipanda: beta: Followup for I9a288ccf6dadaff1ecfe3f2e931f43dae79cdd6c [puppet] - 10https://gerrit.wikimedia.org/r/188038 [12:18:40] (03CR) 10jenkins-bot: [V: 04-1] beta: Followup for I9a288ccf6dadaff1ecfe3f2e931f43dae79cdd6c [puppet] - 10https://gerrit.wikimedia.org/r/188038 (owner: 10Yuvipanda) [12:18:51] (03PS2) 10Yuvipanda: beta: Followup for I9a288ccf6dadaff1ecfe3f2e931f43dae79cdd6c [puppet] - 10https://gerrit.wikimedia.org/r/188038 [12:19:05] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Followup for I9a288ccf6dadaff1ecfe3f2e931f43dae79cdd6c [puppet] - 10https://gerrit.wikimedia.org/r/188038 (owner: 10Yuvipanda) [12:24:52] 3Beta-Cluster, operations: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1008115 (10yuvipanda) [12:24:53] 3Beta-Cluster, operations: Set up an alert for unmerged changes in deployment-prep - https://phabricator.wikimedia.org/T87616#1008111 (10yuvipanda) 5Open>3Resolved YESSSS http://shinken.wmflabs.org/service/deployment-salt/Long%20lived%20cherry-picks%20on%20puppetmaster alerts if there has been more than 1 c... [12:33:21] 3Beta-Cluster, operations: Move scap puppet code into a module - https://phabricator.wikimedia.org/T87221#1008119 (10yuvipanda) It is in a module now (yay!), but needs some more tweaks before it can replace beta::scap [12:58:39] hey mark__ you around ? [12:58:51] can you take a look at https://gerrit.wikimedia.org/r/#/c/186938/ ? [12:59:00] should be almost good to go [13:34:39] (03PS3) 10Yuvipanda: Made the deployment-mx talk back to deployment wiki on receiving a bounce [puppet] - 10https://gerrit.wikimedia.org/r/186938 (owner: 1001tonythomas) [13:36:17] (03CR) 10Yuvipanda: [C: 032] Made the deployment-mx talk back to deployment wiki on receiving a bounce [puppet] - 10https://gerrit.wikimedia.org/r/186938 (owner: 1001tonythomas) [14:13:19] 3ops-eqiad, operations: Rack and setup graphite1001 - https://phabricator.wikimedia.org/T86939#1008213 (10fgiunchedi) 5Open>3Resolved machine up and running, pending T85909 for migration [14:33:35] (03Draft1) 10Filippo Giunchedi: graphite: move to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/188035 (https://phabricator.wikimedia.org/T85909) [14:34:11] <_joe_> \o/ [14:34:14] (03CR) 10Filippo Giunchedi: [C: 04-1] "pending reviews and merge" [dns] - 10https://gerrit.wikimedia.org/r/188035 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [14:34:17] (03Draft1) 10Filippo Giunchedi: graphite: move to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188036 (https://phabricator.wikimedia.org/T85909) [14:34:23] (03CR) 10Filippo Giunchedi: [C: 04-1] "pending reviews and merge" [puppet] - 10https://gerrit.wikimedia.org/r/188036 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [14:34:38] <_joe_> +2 merged [14:34:55] :) [14:35:33] (03PS2) 10Filippo Giunchedi: graphite: format /var/lib/carbon [puppet] - 10https://gerrit.wikimedia.org/r/187690 (https://phabricator.wikimedia.org/T85909) [14:35:53] hehe [14:43:29] 3operations, Wikimedia-OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1008248 (10Jgreen) [14:58:00] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1008256 (10fgiunchedi) changes 188035 and 188036 should be enough to change traffic over from tungsten to graphite1001, the plan is to merge those and wait for dns and puppet to propagate and traffic to move to graphite1001... [15:00:27] (03PS3) 10Giuseppe Lavagetto: base::resolving: get rid of the global domain_search variable [puppet] - 10https://gerrit.wikimedia.org/r/185912 [15:02:23] (03CR) 10Anomie: "> un-cherry-picking from beta." [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [15:04:20] (03CR) 10Yuvipanda: "Basically, https://phabricator.wikimedia.org/T87616. Patches should be tested on beta-labs before being applied on prod, and not just be l" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [15:10:16] (03CR) 10Anomie: "So it's no longer recommended to test puppet changes in Beta Labs for more than a few hours? That'll make it hard to test things like "doe" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [15:13:51] (03CR) 10Yuvipanda: "Good point. 6h is just a random number I pulled out of nowhere :) Perhaps 1day is better?" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [15:14:50] <_joe_> yuvipanda: make it a week :) [15:14:55] <_joe_> it's a reasonable time [15:15:05] _joe_: well, if I make it a week I want to track *individual* commits :) [15:15:08] rather than total ones [15:15:19] to make sure that no *single* commit is there for more than a week [15:15:35] <_joe_> yuvipanda: yeah [15:15:56] since otherwise, overlapping commit tests would make the check always red [15:17:34] <_joe_> that was basically the point I tried to make with you [15:17:44] <_joe_> but I had no time to elaborate, so my bad :) [15:18:15] _joe_: :) [15:18:53] _joe_: I think having it be stricter for a few weeks would do good? [15:19:39] <_joe_> not sure [15:20:12] <_joe_> I think the main problem was cultural, but this way you'll just piss people off because of our (ops) failure to review patches in a semi-timely manner [15:20:44] _joe_: heh, of course, not having things there forever comes with a ops expectation of merging things :) [15:20:46] <_joe_> so I'd say you should ping about this in the ops meeting [15:20:52] _joe_: which perhaps is more plausible with me available [15:21:00] _joe_: yeah, I will! [15:21:06] _joe_: I also have a meeting with RelEng right before that [15:21:15] <_joe_> about what? [15:21:18] <_joe_> beta? [15:21:50] <_joe_> https://coreos.com/blog/etcd-2.0-release-first-major-stable-release/ [15:21:53] <_joe_> whoa [15:22:15] <_joe_> well, time to refurbish the thing and write a twisted client [15:22:30] _joe_: yeah, about beta [15:22:47] <_joe_> yuvipanda: /nick luckypanda [15:22:54] yuvipanda: woo, beta :p [15:23:30] :D [15:31:01] (03CR) 10Anomie: Publish article to Main namespace for cawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186358 (owner: 10KartikMistry) [15:34:48] (03CR) 10BBlack: [C: 031] dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) (owner: 10Yuvipanda) [15:35:04] bblack: ty! [15:44:16] (03PS1) 10BBlack: nginx ssl tuning: CPU binding, -accept_mutex [puppet] - 10https://gerrit.wikimedia.org/r/188056 [15:47:20] (03PS3) 10BBlack: fix git::clone umask issues T87843 [puppet] - 10https://gerrit.wikimedia.org/r/187331 [15:47:36] (03CR) 10BBlack: [C: 032 V: 032] fix git::clone umask issues T87843 [puppet] - 10https://gerrit.wikimedia.org/r/187331 (owner: 10BBlack) [15:47:51] (03PS3) 10BBlack: add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 [15:47:58] (03CR) 10BBlack: [C: 032 V: 032] add central systemctl daemon-reload exec [puppet] - 10https://gerrit.wikimedia.org/r/187701 (owner: 10BBlack) [15:48:15] bblack: I saw those two, sorry for not reviewing them earlier :) [15:48:24] bblack: any idea why those umask issues suddenly started appearing? [15:49:03] (03CR) 10Yuvipanda: "I must stress that this is not a mandate of any sort (yet, at least), and cannot become one without strong commitments from ops / releng. " [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [15:49:28] bblack: as for the second, akosiaris was raising the point today that it's going to make testing modules locally a pita :/ [15:49:34] but I couldn't think of a better idea [15:49:37] I'm pretty sure the proximal cause is the "umask 077" in the puppet-run script. And for some of these things, it may have taken weeks to really notice the fallout since things aren't always being re-created (it's not like umask would cause immediate chmod of existing things) [15:50:37] paravoid: yeah I don't see any way around the daemon-reload exec, unless the scope of puppet's "service" type expanded to know/care about managing init files internally [15:50:47] marktraceur: You going to SWAT today? [15:52:50] (03PS2) 10BBlack: nginx ssl tuning: CPU binding, -accept_mutex [puppet] - 10https://gerrit.wikimedia.org/r/188056 [15:53:27] bblack: we can do Service { require => Exec['systemctl-daemon-reload' } but that won't help much, as we still need the notify for File [15:54:10] it's not like the command is harmful anyways [15:54:14] also I think that adding a require later at some specific service may override this, which is a bit counterintuitive [15:54:35] nah, it's more of a how to keep code clean rambling :) [15:55:01] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [15:55:15] anyways, how does it interfere with testing? [15:55:40] well right now you can test a module locally, without needing base [15:55:44] with --noop [15:56:08] where should I schedule https://gerrit.wikimedia.org/r/#/c/186242/ to ? [15:56:24] I guess I don't know that workflow, but this would hardly be the first time something relies on base, no? that's kind of the point of base [15:56:46] tonythomas: poke greg-g and Reedy maybe? [15:56:52] I don't think we have strict dependencies on base so far [15:57:01] it's not a huge problem, you can always fake it [15:57:07] greg-g: seems to be away. [15:57:15] tonythomas: yup, it's very very early in SF [15:57:16] Reedy: you around to deploy https://gerrit.wikimedia.org/r/#/c/186242/ ? [15:57:16] I know I have strict dependencies on values from base in other modules already [15:57:26] oh. wrong time :) [15:57:29] values from facter recipes in base, I mean [15:59:03] paravoid: in any case, how are you testing modules locally usually? the only reliable way I had to test complex thigns was puppet-compiler, but it's borked for some time now :( [15:59:17] puppet apply --modulepath=... test.pp [15:59:28] -v --noop, usually [16:00:21] on a machine which is of the right target type / platform, where you've manually checked out your unmerged change to a local directory, I guess? [16:00:56] yeah [16:01:07] well depends on what I'm building/what kind of testing I want to do [16:02:04] (03CR) 10BryanDavis: "> Who owns logstash from the ops side of things?" [puppet] - 10https://gerrit.wikimedia.org/r/179759 (owner: 10BryanDavis) [16:05:50] paravoid: now that I look at actually using it, I guess I should provide a no-op version of the exec for non-systemd as well, otherwise in places the service require has to be conditional as well [16:05:54] ? [16:06:20] in which case, yes, I may as well do Service { require => Exec['systemctl-daemon-reload' } [16:06:40] (and then not have to do the no-op and sprinkle the require everywhere [16:07:14] yeah, good point [16:07:43] I *think* if you do a Service { require => foo } and then a service { 'test': require => bar }, foo won't be required, though [16:07:47] so that'd be a big catch [16:07:56] oh, right :( [16:08:05] most of our services have their own requirements already [16:08:13] so wait [16:08:29] a (systemd) service will always require the File /etc/systemd/..., right? [16:08:45] or subscribe, but yeah [16:09:20] hrm [16:09:23] jouncebot: next [16:09:24] the problem is without the strict ordering between daemon-reload and the service, puppet may do "systemctl restart foo.service" before the daemon-reload [16:09:31] which will restart it with the old config [16:10:12] we could have a systemd::service (or something) that contains the File resource and issues a daemon-reload [16:10:45] individual Services will require this anyway so it won't be counter-intuitive [16:10:51] and only use it for the ones where we're overriding the package's file? [16:11:08] yeah, when we're deploying files under /etc/systemd [16:11:41] (i.e. either overriding a package unit, or provisioning one that isn't packaged at all) [16:11:47] !log added and populated wbc_entity_usage table for wikidatawiki [16:11:55] Logged the message, Master [16:12:43] (03CR) 10BryanDavis: "I'm all for Ops being more responsive to beta, but breaking functionality in beta just because there is no ops response is a bit short sit" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [16:12:46] but what does that do for services like ganglia that run under both? should systemd::service also handle the non-systemd case gracefully? [16:12:59] or do we conditionalize service -vs- systemd::service? [16:13:42] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:14:15] sorry ganglia's a poor example, txstatsd is better [16:14:17] https://github.com/wikimedia/operations-puppet/blob/production/modules/txstatsd/manifests/init.pp [16:15:09] arguably txstatsd should sub rather than req its init file, but regardless [16:16:13] we could fix this with stages, too [16:16:41] do a special define that encloses the File definition for systemd unit files, which notifies a daemon-reload, and have both run in a stage before main [16:17:00] and then just subscribe the service to the file like normal, with the service being in main stage [16:17:15] hmm [16:17:23] that could work indeed [16:17:28] I see we hav manifests/stages.pp, I've never looked at it here before [16:17:46] mark: hey! [16:17:53] mark: who is responsible for logstash from the ops side? [16:17:56] I think I was about to kill this :) [16:17:59] or is there anyone at all doing that now? [16:18:07] yuvipanda: noone officially, jgage was doing most of it, aotto a bit [16:18:11] hmm [16:18:25] bblack: I'm toast right now, too tired [16:18:25] am thinking who would be responsible for patches like https://gerrit.wikimedia.org/r/#/c/173336/14 [16:18:34] can't think straight, really :) [16:18:39] it was on beta forever, I uncherry-picked it today [16:18:41] so, ttyl I guess [16:18:48] get some sleep :) [16:19:23] (03CR) 10Yuvipanda: "Good point on removing the cherry-picks causing less attention to be paid to these, not more." [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [16:19:35] (03PS1) 10BBlack: Revert "add central systemctl daemon-reload exec" [puppet] - 10https://gerrit.wikimedia.org/r/188057 [16:19:38] bblack: I have a few LVS / Varnish questions when you have some time. [16:19:40] swat still ongoing, or done? [16:19:48] (03CR) 10BBlack: [C: 032 V: 032] Revert "add central systemctl daemon-reload exec" [puppet] - 10https://gerrit.wikimedia.org/r/188057 (owner: 10BBlack) [16:20:20] yuvipanda: sure [16:21:01] yuvipanda: just +2 it. It works in beta and won't make the current logstash problems any worse [16:21:30] bd808: hmm, if I haven't determined someone to own logstash by end of ops meeting today, I'll +2 both of them. deal? [16:21:44] wfm [16:21:49] bd808: cool [16:22:09] have we made a hardware ticket yet? [16:22:21] bd808: nope, I should uncookielick it since I'm going on vacation at end of this week :( [16:23:00] bblack: so, what I want is to simulate LVS for beta somehow. We can't use LVS itself, since OpenStack networking doesn't support it for some reason (I'm not sure which, mark investigated this a while ago) [16:23:04] hi [16:23:31] yuvipanda: what's the purpose of simulating LVS on beta? to test puppet LVS config? [16:24:13] bblack: so my current idea is to: 1. setup DNS for *.svc.beta.eqiad.wmflabs to point to a instance, 2. have that instance run nginx / haproxy, that just proxies back to the appropriate pools (mw, parsoid, etc), 3. have varnish hit that one instance with a domain name, so the host header gets sent and we can use that to route things through [16:24:21] bblack: no, the purpose is to have 'pools' [16:24:41] so that requests aren't routed to a machine when it's already failing because it is restarting HHVM, for example. [16:24:49] and also for us to easily pool / depool machines, etc. [16:24:56] nginx has passive health checks, which would be good enough for this [16:25:06] greg-g: I had the same question :) [16:25:07] it's a difference from prod, and a SPOF, but it's very self contained. [16:25:42] bblack: so a couple of questions: 1. is this insane? 2. will varnish send a Host: header? If so will it be the *.svc.beta.eqiad.wmflabs one, or will it be whatever the user requested? [16:25:46] yuvipanda: that sounds sane, configuring nginx or haproxy to do the balancing in place of LVS. The potential issue is that the balancing won't be as transparent as it is with LVS... [16:26:42] you'll lose the "client" real IP, and have an extra X-F-F entry, etc [16:27:00] whereas the real machines behind the real LVS get the fiction that the traffic reached them directly [16:27:04] right [16:27:13] greg-g: saw your email. we still have time left in the morning swat ? [16:27:15] but how will the Host: header work? [16:27:23] ( I might be sleeping in the af one :( ) [16:27:27] <_joe_> yes [16:27:30] the proxy machine needs it to be set to, say, mwapi.svc.beta.eqiad.wmflabs so it knows which one to proxy back to [16:27:37] but the backend machine needs it to be set to en.wikipedia.beta.wmflabs.org [16:27:39] <_joe_> that's http [16:27:40] yuvipanda: your point (3) about varnish I don't get. varnish is behind all this, not in front of it, right? [16:27:42] so it knows it should serve up english wiki [16:28:58] maybe we're talking at cross-purposes here about different LVS layers [16:29:05] (03CR) 10Nikerabbit: Publish article to Main namespace for cawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186358 (owner: 10KartikMistry) [16:29:14] are you trying to simulate our front edge LVS, or the appserver pool LVS? [16:29:21] (or both?) [16:30:25] the actual flow of a request from the internet to an appserver (assuming no cache hits, and ignoring all kinds of subtleties) is basically LVS -> nginx/varnish -> LVS -> appserver [16:30:28] anomie: did swat happen? [16:30:33] I don't see jouncebot taling [16:30:35] +k [16:31:00] bblack: appserver pool LVS [16:31:04] sorry, shoudl've been clearer [16:31:21] oh in that case, I don't think you need to hack anything [16:31:41] varnish already supports multiple backends, we just happen to define it in prod as a single backend which is an LVS service IP [16:31:45] greg-g: No. I pinged marktraceur since he has patches in it, and he didn't reply. And the other patch had an outstanding comment that was just replied to. [16:31:48] but you could just put several backends directly into varnish [16:32:00] will it stop sending requests to an instance if it stops responding? [16:32:12] Nikerabbit: I can do your patch if you're ready [16:32:19] anomie: do you have time to deploy a change for tonythomas? tonythomas: can you put the change on the deploy calendar? [16:32:20] it should [16:32:34] bblack: ah, I see. [16:32:39] let me dig up the config template bits [16:32:41] greg-g: I could do that too [16:32:50] anomie: I am [16:32:57] greg-g: adding now [16:32:59] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186358 (owner: 10KartikMistry) [16:33:15] (03CR) 10BryanDavis: "I can see the point of nuke-from-orbit resets in beta and prod, but for a puppet n00b testing with their own self-hosted puppetmaster iter" [puppet] - 10https://gerrit.wikimedia.org/r/188019 (owner: 10Yuvipanda) [16:33:22] (03Merged) 10jenkins-bot: Publish article to Main namespace for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186358 (owner: 10KartikMistry) [16:33:26] yuvipanda: look in modules/varnish/templates/vcl/wikimedia.vcl.erb [16:33:27] <_joe_> anomie: are you SWATting? [16:33:38] <_joe_> if so, please tell me if you see errors on mw1018 [16:33:43] you basically want labs-specific definitions of the probe/backend/pool stuff there [16:34:00] err probe/backend/director [16:34:08] oh wow [16:34:09] I see [16:34:23] !log anomie Synchronized wmf-config: SWAT: Have ContentTranslate publish article to Main namespace for cawiki [[gerrit:186358]] (duration: 00m 07s) [16:34:27] Logged the message, Master [16:34:28] _joe_: No error seen [16:34:34] Nikerabbit: ^^^ Test please [16:34:39] <_joe_> anomie: great to hear! [16:34:52] <_joe_> (it's the test host for moving to www-dat) [16:34:59] bblack: I'm thinking if setting up an lvs simulator would have any advantages over doing this [16:35:12] one would be that the delta between prod / labs would be lesser, perhaps [16:35:46] well the delta in the configuration of that template would be lesser, but the delta in runtime behavior would be much greater with an additional non-transparent layer of software involved. [16:35:52] greg-g: added https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0February.C2.A002 :) [16:36:17] bblack: hmm, that's true. [16:36:54] I suspect you won't have to change much in practice, though. just figure out how to make it define those LVS backends like 10.2.2.2 as an array of multiple backend IPs for that director, and define the variable differently for labs (which is already the case I think) [16:37:41] bblack: yeah, role/cache.pp, line 160 [16:38:08] tonythomas: Ok, I'm just waiting for Nikerabbit to confirm his patch worked. Then I'll do yours. [16:38:19] anomie: okey :) [16:38:55] the lists in manifests/role/cache.pp are for the internal varnish<->varnish stuff for 2-layer in prod [16:39:20] I think what you're looking for is the stuff in modules/lvs/manifests/configuration.pp, where we have all those empty definitions for labs app-layer backends [16:39:39] L177 [16:39:46] bblack: hmm, I see mediawiki /parsoid backend IPs specifiied in role/cache too [16:40:04] I'll also admit to have never worked with either LVS or Varnish before, so am talking with a very periphary understanding of how they work. [16:40:46] I was thinking more of the apaches layer and such [16:41:04] e.g. 'apaches' => { 'eqiad' => "10.2.2.1", }, [16:41:21] one moment still [16:41:25] hmm [16:41:27] 'appservers' => { [16:41:27] 'eqiad' => [ [16:41:27] '10.68.17.96', # deployment-mediawiki01 [16:41:30] ^ in modules/lvs/manifests/configuration.pp , is how the final varnish backend layer knows to reach the LVS for appservers [16:41:42] aaah [16:41:56] so that's an lvs service IP that is load balanced to all apache nodes? [16:42:08] right, the 10.2.2 -looking ones [16:42:27] what makes this extra confusing is it's configured right alongside the other layer of LVS that does the very front edge with public IPs [16:43:11] oh, now I see [16:43:32] what makes this extra-extra-confusing is that for the production stuff in manifests/role/cache.pp related to your quote above [16:43:35] anomie: works [16:43:39] Nikerabbit: Thanks [16:43:44] we define the IPs in modules/lvs/manifests/configuration.pp and use the variables in cache.pp [16:43:53] for labs, we have the IPs hardcoded in cache.pp as you noted [16:44:02] $backends = { [16:44:02] 'production' => { [16:44:02] 'appservers' => $lvs::configuration::lvs_service_ips['production']['apaches'], [16:44:17] tonythomas: Needs rebase [16:44:24] it would probably be saner to move those IPs back to prod and then get rid of the labs/prod diff in cache.pp, IMHO [16:44:27] (03CR) 10Anomie: [C: 04-1] "Needs rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [16:44:34] anomie: oh. checking [16:44:43] s/back to prod/back to configuration.pp/ [16:45:10] bblack: yup. I think hiera-ification of the entire stuff would also be super helpful [16:45:23] well sure but that's gonna be a huge and dangerous undertaking :) [16:45:35] :D [16:45:36] yeah [16:45:37] veyr [16:45:41] there's a lot of strangely-shared data variables in that LVS/varnish config unfortunately [16:45:43] * yuvipanda acquires pony as well [16:46:11] bblack: so in prod, there's LVS -> Varnish -> Varnish -> LVS -> Apaches [16:46:14] so, since it's already configured for 2x backends over in cache.pp, then isn't this already working as you need? [16:46:31] and in labs, it is Varnish -> Varnish -> Apaches? [16:46:43] yuvipanda: it's more like LVS -> (edge layer) -> LVS -> (app layer) in prod [16:46:59] (edge layer) -> nginx / varnish combo mostly? [16:47:11] anomie: Oh, sorry, I forgot about that! [16:47:22] where edge layer is like: nginx? varnish-fe varnish-be? varnish-eqiad-be-if-cache-dc? [16:47:23] Still time, or are you closing the shop? [16:47:33] and (app layer) is all kinds of things aside from apaches [16:47:36] marktraceur: Well, once I finish tonythomas's config change then you can have it. [16:47:39] usually behind LVS [16:47:42] Coolio, thanks [16:48:52] for the case of a bits request directly to eqiad over http, the edge layer is just 1xVarnish and done. For an https request for text hitting esams, it's nginx -> varnish-fe-esams -> varnish-be-esams -> varnish-be-eqiad [16:50:21] (03PS3) 1001tonythomas: Added BounceHandler extension to group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 [16:50:43] bblack: is there an overall network diagram of our entire prod infrastructure? :) [16:51:25] I've asked that same question many times! :) [16:51:31] bblack: :D [16:51:32] 3hardware-requests, Labs, operations: Dedicated hardware for wikitech web server - https://phabricator.wikimedia.org/T88294#1008477 (10RobH) [16:51:46] anomie: https://gerrit.wikimedia.org/r/#/c/186242/ looks good ? [16:52:15] !log kill opendj on virt1000, it shouldn't have been running there in the first place [16:52:18] Logged the message, Master [16:52:22] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [16:52:29] (03Merged) 10jenkins-bot: Added BounceHandler extension to group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [16:52:35] there are too many layers you could draw that map at. I truly complete map of everything happening at every layer of abstraction would look like some sort of 5-dimensonial trace of all the star paths in the milky way [16:52:51] bblack: heh [16:52:56] anomie: \o/. Expected time for arrival in group0 ? [16:53:08] tonythomas: Just a second while I check something [16:53:17] okey :) [16:54:11] yuvipanda: I will make a map of prod traffic flow through the front edge though, it should be done [16:54:24] bblack: yeah, that would be nice :) [16:54:39] bblack: I've never dealt with varnish or lvs before, and starting with *our* varnish/lvs config seems a bit daunting :) [16:54:49] !log anomie Synchronized wmf-config: SWAT: Added BounceHandler extension to group0 wikis [[gerrit:186242]] (duration: 00m 07s) [16:54:52] Logged the message, Master [16:54:56] tonythomas: ^ Test please [16:55:07] okey! one min [16:57:22] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: puppet fail [16:58:30] anomie: its up in http://www.mediawiki.org/wiki/Special:Version [16:58:35] marktraceur: All yours [16:58:39] looks good ;) [16:58:40] anomie: Oh, OK [16:58:41] (03PS3) 10Yuvipanda: dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) [16:58:47] !next [16:59:00] jouncebot: next [16:59:04] Damn it [16:59:04] marktraceur: jouncebot seems to be non-functional this morning [16:59:05] (03CR) 10Yuvipanda: [C: 032] dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) (owner: 10Yuvipanda) [16:59:14] Guess so [16:59:28] Wikidata in two hours, so I have time [17:01:00] Patches +2'd, waiting for Jenkins... [17:03:11] <_joe_> marktraceur: are you swatting? [17:03:22] I'm swatting my patches [17:03:33] Because I failed to show up when anomie pinged me [17:03:36] _joe_: Why? [17:04:03] <_joe_> marktraceur: I am deploying a puppet change that should be a noop but could theoretically affect tin [17:04:15] Oh, 'kay [17:04:21] You want to just tell me when you're done? [17:04:26] We aren't in a rush [17:04:34] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1008501 (10fgiunchedi) note also that gdash won't be migrated at the moment, I've run into some (I think) ruby 1.8 -> 1.9 and rubygems which I don't want to get blocked by: ``` root@graphite1001:/var/log/upstart# tail -15 /... [17:06:00] 3hardware-requests, Labs, operations: Dedicated hardware for wikitech web server - https://phabricator.wikimedia.org/T88294#1008506 (10RobH) IRC discussion update: Joe picked out host silver from spares page for this task after we chatted about requirements. I'll setup this system with a public IP address, as... [17:08:22] Ugh [17:08:44] <_joe_> marktraceur: just go on [17:09:04] Have to wait for Jenkins again [17:09:08] 3ops-codfw, operations, hardware-requests: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1008509 (10RobH) [17:09:10] <_joe_> ok np [17:09:17] He came back and said zend and hhvm failed on both patches [17:09:20] I am unconvinced [17:09:32] marktraceur: Flow stuff? [17:09:40] maybe [17:09:50] Had this earlier today as well... decided to ignore it and the worl didn't explode(tm) [17:09:52] Yeah [17:10:00] I'll let it happen again and go [17:10:11] "failed contacting Parsoid" [17:10:14] That can't be good [17:10:20] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1008511 (10Joe) [17:10:32] How is it supposed to contact parsoid in a test? [17:10:51] Who knows [17:11:00] news of test isolation hasn't reached all parts of the kingdom [17:11:02] I'm sure it made sense at the time [17:11:35] (03CR) 10Nemo bis: "\o/ now let's start sending enotifs to ancient mediawiki.org users :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [17:11:44] <_joe_> srsly guys [17:11:52] <_joe_> tests should use mocks [17:11:55] (03PS1) 10RobH: setting silver's basic install params [puppet] - 10https://gerrit.wikimedia.org/r/188062 [17:12:02] you mock my pain! [17:12:14] <_joe_> ori: I heard sara golemon at FOSDEM [17:12:25] ("life *is* pain, highness. anyone claiming otherwise is selling something.") [17:12:34] <_joe_> she mentioned us ;) [17:12:35] _joe_: did you hear larry wall? are we migrating to perl 6? [17:12:39] <_joe_> yes [17:12:44] to which? [17:12:49] <_joe_> to both [17:12:54] <_joe_> :P [17:13:09] <_joe_> perl 6 is the most batshit crazy thing one could imagine [17:13:33] ori: does this ruby 1.9 + rubygems snafu and gdash make any sense to you? https://phabricator.wikimedia.org/T85909#1008501 [17:13:40] are they rewriting bugzilla? [17:13:49] I haven’t read about it — what is crazier about perl 6 than previous versions? [17:14:04] <_joe_> andrewbogott: perl 6 is a meta-language [17:14:06] i'll take a look [17:14:16] <_joe_> you can create your own syntax and "language" on the fly [17:14:31] oh! [17:14:33] _joe_: btw, brett has been very quiet lately so i decided to do a git-log on hhvm master to see what he has been up to [17:14:40] <_joe_> awesome for sure, but think of the children [17:14:45] 'kay, fuck it, Jenkins is being an ass. [17:14:45] he got most tests passing on the LLVM backend [17:14:52] Well, if the goal is to write code that no one else will ever, ever be able to read, that seems like the ultimate implementation [17:14:55] which is apparently a thing now [17:15:32] <_joe_> well llvm is the new shiny thing right? [17:16:12] "your mercy for graceful operations on workers is 60 seconds " heh [17:16:15] godog: sounds like uwsgi madness [17:17:11] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:17:16] ori: heh I poked at it a bit too, apparently in ruby 1.9 and rubygems the $: << or whatever it is trick for load_path doesn't seem to be working [17:18:20] (03PS4) 10Giuseppe Lavagetto: base::resolving: get rid of the global domain_search variable [puppet] - 10https://gerrit.wikimedia.org/r/185912 [17:18:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "Ran into the puppet compiler, seems sane." [puppet] - 10https://gerrit.wikimedia.org/r/185912 (owner: 10Giuseppe Lavagetto) [17:20:16] !log marktraceur Synchronized php-1.25wmf14/extensions/UploadWizard/resources/mw.FormDataTransport.js: [SWAT] [wmf14] Fix UploadWizard for ogg files (duration: 00m 06s) [17:20:22] Logged the message, Master [17:20:40] !log marktraceur Synchronized php-1.25wmf15/extensions/UploadWizard/resources/mw.FormDataTransport.js: [SWAT] [wmf15] Fix UploadWizard for ogg files (duration: 00m 07s) [17:20:44] Logged the message, Master [17:21:36] (03CR) 10Nemo bis: "Err, did I forget to include the class in a maintenance host?" [puppet] - 10https://gerrit.wikimedia.org/r/178170 (https://phabricator.wikimedia.org/T68867) (owner: 10Nemo bis) [17:22:50] I'm done now! [17:23:10] _joe_: Not sure if you were waiting for me [17:23:30] <_joe_> marktraceur: nope [17:24:32] (03PS1) 10Aude: Enable usage tracking on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188063 [17:24:36] (03PS1) 10RobH: setting silver public ip address [dns] - 10https://gerrit.wikimedia.org/r/188064 [17:24:36] KK [17:25:09] (03CR) 10RobH: [C: 032] setting silver's basic install params [puppet] - 10https://gerrit.wikimedia.org/r/188062 (owner: 10RobH) [17:25:51] (03CR) 10RobH: [C: 032] setting silver public ip address [dns] - 10https://gerrit.wikimedia.org/r/188064 (owner: 10RobH) [17:26:15] <_joe_> and another global variable in puppet bites the dust. [17:26:42] <_joe_> farewell $domain_search, you will not be missed [17:26:49] oh? [17:27:04] <_joe_> robh: my change I just merged [17:27:08] <_joe_> got rid of it [17:27:10] yea looking at it now =] [17:27:12] (03PS1) 10Nemo bis: Actually run misc::maintenance::update_article_count [puppet] - 10https://gerrit.wikimedia.org/r/188066 (https://phabricator.wikimedia.org/T68867) [17:27:20] +2 for trivial followup pls https://gerrit.wikimedia.org/r/188066 [17:27:59] (03PS2) 10Ori.livneh: Actually run misc::maintenance::update_article_count [puppet] - 10https://gerrit.wikimedia.org/r/188066 (https://phabricator.wikimedia.org/T68867) (owner: 10Nemo bis) [17:28:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Actually run misc::maintenance::update_article_count [puppet] - 10https://gerrit.wikimedia.org/r/188066 (https://phabricator.wikimedia.org/T68867) (owner: 10Nemo bis) [17:28:29] Thanks [17:29:01] there is also a change with admin::groups? [17:29:10] <_joe_> mutante: yes, why? [17:29:13] <_joe_> any problems? [17:29:40] just because the message says it's about base::resolving and then i see admin::groups being added and wonder how it's related [17:30:05] <_joe_> I have also cleaned that from some hosts, yes [17:30:16] godog: do you need me to look for a fix? [17:31:14] trying to understand how to use that next time there's an access request..looking [17:31:26] ori: if you can timebox it I could use your ruby experience, it isn't an hard blocker but it'll be to migrate to trusty/jessie anyway [17:31:31] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:03] (03PS4) 10Yuvipanda: dumps: Strengthen ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/188015 (https://phabricator.wikimedia.org/T74072) [17:33:04] godog: can i disable puppet on graphite1001 and live-hack /etc/gdash/config.ru for a minute? [17:33:12] ori: sure go for it [17:34:33] 3Wikimedia-Logstash, hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#1008549 (10bd808) I think the IRC and IRL discussions on this last week came down with @mark being more in favor of putting though a procurement tick... [17:38:28] (03PS1) 10Ori.livneh: Make gdash's uWSGI config.ru Ruby 1.9-compatible [puppet] - 10https://gerrit.wikimedia.org/r/188069 [17:38:32] godog: ^ [17:38:35] should do the trick [17:41:17] ori: did it start on graphite1001? [17:41:36] i didn't test [17:42:31] 3Wikimedia-Logstash, hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#1008558 (10mark) Alright. @RobH: can you look at what it would take to procure 3 additional nodes, similar to the recent ElasticSearch orders, but wi... [17:43:16] ori: heh, I asked because I suspected that wouldn't work (it doesn't) anyways thanks for taking a look :| [17:44:02] PROBLEM - HTTP on ms1001 is CRITICAL: Connection refused [17:44:27] <_joe_> what's ms1001 doing btw? [17:44:44] godog: how do you test? uwsgictl? [17:44:46] <_joe_> oh it's 6:44 PM, I suppose US people could take a look [17:45:21] ori: service uwsgi restart [17:45:56] godog: * Restarting app server(s) uwsgi [ OK ] [17:46:37] ori: look at the process list, gdash isn't there [17:46:56] hmm [17:48:53] ok guys rewrite gdash in nodejs ;D [17:48:55] <_joe_> DING DING OPSENS [17:49:07] <_joe_> icinga is complaining about ms1001 [17:49:13] <_joe_> will someone take a look? [17:49:37] <_joe_> someone who did not fly back today, and who is not @ 7 PM maybe? [17:50:01] PROBLEM - HTTP on dataset1001 is CRITICAL: Connection refused [17:50:12] _joe_: looking [17:50:21] <_joe_> thanks [17:50:35] <_joe_> mh dataset1001 as well, nice [17:50:44] _joe_: I'm looking tho not sure what for yet [17:50:48] Restarting nginx: nginx: [emerg] unknown directive "<%=" in /etc/nginx/sites-enabled/dumps:13 [17:51:00] did someone merge a change to the template that left an artifact? [17:51:08] yuvi [17:51:14] the ssl stuff I think [17:51:38] paging yuvipanda, dr yuvipanda [17:51:38] yuvipanda: ping ^ [17:51:49] might be able to revert his change [17:52:07] bah [17:52:12] yea, what chase said already [17:52:26] (03PS1) 10Yuvipanda: Revert "dumps: Strengthen ssl settings" [puppet] - 10https://gerrit.wikimedia.org/r/188076 [17:52:51] (03CR) 10Rush: [C: 031] "Restarting nginx: nginx: [emerg] unknown directive "<%=" in /etc/nginx/sites-enabled/dumps:13" [puppet] - 10https://gerrit.wikimedia.org/r/188076 (owner: 10Yuvipanda) [17:52:53] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "dumps: Strengthen ssl settings" [puppet] - 10https://gerrit.wikimedia.org/r/188076 (owner: 10Yuvipanda) [17:52:56] thanks yuvipanda [17:53:58] ran puppet. applied revert [17:54:05] sorry guys. [17:54:22] RECOVERY - HTTP on ms1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 0.008 second response time [17:54:22] sorry we doubled up mutante, I was pulling up icinga and didn't see you were already on it [17:54:45] i think puppet didnt start it right [17:54:51] i started it with the init script [17:54:53] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1008579 (10Joe) I have manually converted one host to use www-data, and took it out of rotation. A smoke test shows it working correctly. To make the conversion happen, I created a series of puppet patches... [17:55:24] (03PS1) 10Andrew Bogott: Remove ajax_proxy_url setting from nova config. [puppet] - 10https://gerrit.wikimedia.org/r/188078 [17:56:04] I got pulled into a meeting right after I merged that, which I shouldn't have let happen. [17:56:05] oh well [17:56:08] yuvipanda: root cause: that .conf is not ab erb, right [17:56:20] mutante: yup. I didn't see that it was files/ not templates/ [17:57:17] * yuvipanda feels like an idiot now [17:57:43] (03CR) 10Phuedx: [C: 031] Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) (owner: 10Jdlrobson) [18:00:16] !log restarted gitblit [18:00:19] Logged the message, Master [18:02:35] (03CR) 10Yuvipanda: [C: 031] "This has no effect now anyway, since idmapping is off on most hosts." [puppet] - 10https://gerrit.wikimedia.org/r/187686 (owner: 10Giuseppe Lavagetto) [18:04:07] (03CR) 10Yuvipanda: [C: 031] "LGTM, haven't tested (or checked for other occurances of the apache user)" [puppet] - 10https://gerrit.wikimedia.org/r/187259 (owner: 10Giuseppe Lavagetto) [18:04:09] it's still broken on dataset1001 [18:04:19] runs puppet and stuff [18:04:57] !log started nginx on ms1001, dataset1001 [18:05:02] Logged the message, Master [18:05:10] (03CR) 10Florianschmidtwelzow: [C: 031] Enable JS console recruitment on mobile. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187823 (https://phabricator.wikimedia.org/T85815) (owner: 10Jdlrobson) [18:05:15] (03CR) 10Yuvipanda: [C: 031] maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (owner: 10Giuseppe Lavagetto) [18:05:26] mutante: thanks for cleaning up! [18:05:32] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 0.020 second response time [18:05:47] can someone restart gitblit? [18:05:51] i just did [18:05:54] ah, ok [18:06:05] 3hardware-requests, Labs, operations: eqiad: Dedicated hardware for wikitech web server - silver allocated - https://phabricator.wikimedia.org/T88294#1008636 (10RobH) [18:06:06] still broken SqlUsageTrackerSchemaUpdater.php [18:06:07] ah [18:06:09] https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FWikibase [18:06:36] why oh why does jenkins depend on it for wikibase jobs :( [18:06:59] 3hardware-requests, Labs, operations: eqiad: Dedicated hardware for wikitech web server - silver allocated - https://phabricator.wikimedia.org/T88294#1008469 (10RobH) 5Open>3Resolved I've deployed silver in row B with a public IP address. It has been installed with basic raid1.cfg (so raid1 with a /srv xfs)... [18:08:04] 3hardware-requests, Labs, operations: eqiad: Dedicated hardware for wikitech web server - silver allocated - https://phabricator.wikimedia.org/T88294#1008651 (10RobH) [18:08:24] back :) [18:08:28] (03CR) 10Yuvipanda: [C: 031] "Ok for now, I suppose the default will change to www-data at some point." [puppet] - 10https://gerrit.wikimedia.org/r/187688 (owner: 10Giuseppe Lavagetto) [18:08:47] and what's wrong with mw1207? [18:08:52] _joe_: I can setup a mw instance with the patches applied + hiera setting them to www-data if you'd like [18:08:55] hhvm running but icinga [18:09:01] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59709 bytes in 0.145 second response time [18:09:47] (03PS1) 10Yuvipanda: Revert "Revert "dumps: Strengthen ssl settings"" [puppet] - 10https://gerrit.wikimedia.org/r/188082 [18:11:06] yuvipanda: https://gerrit.wikimedia.org/r/#/c/187121/ [18:11:33] (because it's also dumps, also, let's add ferm) [18:13:34] ? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=mw1207&nostatusheader [18:13:56] (03Abandoned) 10Phuedx: Configure JS console recruitment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187815 (https://phabricator.wikimedia.org/T85815) (owner: 10Phuedx) [18:14:02] (03PS2) 10Yuvipanda: Revert "Revert "dumps: Strengthen ssl settings"" [puppet] - 10https://gerrit.wikimedia.org/r/188082 [18:14:03] 3Wikimedia-General-or-Unknown, WMF-Legal, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1008668 (10chasemp) This kind of stalled out here. Am I right in thinking that this would only apply to code that does not have a more specific license and that that where the... [18:15:02] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.063 second response time [18:15:03] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 71429 bytes in 0.319 second response time [18:15:52] !log restarted hhvm on mw1207 [18:15:57] Logged the message, Master [18:17:08] <_joe_> yuvipanda: in beta you mean? [18:17:15] <_joe_> let's take about that tomorrow [18:17:23] <_joe_> *talk [18:17:35] _joe_: context? [18:17:39] * yuvipanda is doing too many things atm [18:17:42] _joe_: www-data you mean? [18:17:48] _joe_: yeah, sure. let's talk about that tomorrow [18:17:49] <_joe_> yeah [18:18:02] <_joe_> sorry I was upgrading my laptop [18:18:16] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1008688 (10RobH) [18:18:22] I'm somewhat sleepy, and might operate on an IST timezone for the next few days [18:18:57] <_joe_> IST? good idea! [18:19:16] <_joe_> we'd still overlap mostly [18:19:23] <_joe_> and I think it's sane for you [18:19:27] _joe_: my girlfriend is visiting for a month starting tomorrow. IST is going to be fairly enforced :) [18:19:36] <_joe_> eheh cool [18:19:45] _joe_: I'm also going on vacation from end of next week. I should email ops@ [18:19:47] err [18:19:48] end of this week [18:19:49] yuvipanda: where are you now? [18:19:58] chasemp: I'm in Bangalore. [18:20:04] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#824247 (10RobH) Latest update from VAR: //The servers have arrived at VAR. The Samsung SSD and drive carriers are supposed to arrive tomorrow 2/3. Kitting of the drives into the... [18:20:05] doing a round trip around south India starting thursday [18:20:31] (03PS5) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [18:20:33] <_joe_> yuvipanda: for how long? [18:20:41] yuvipanda: enjoy the vacation when it comes :p [18:20:50] _joe_: the trip? until March 5th, when she leaves. [18:21:03] _joe_: I've two weeks off, and I'll be in places with less internet accessibility then [18:21:06] <_joe_> no the vacation from work [18:21:17] <_joe_> if you're off, don't work!!! [18:21:19] _joe_: oh, two weeks [18:21:26] _joe_: yeah, am considering leaving my laptop and keyboard behind [18:21:35] <_joe_> good idea [18:21:52] <_joe_> you will regret it, but it's actually a good idea :P [18:22:06] _joe_: but might not happen, since I won't be at the same place when my vacation ends... [18:22:10] so might have to carry it anyway [18:22:29] _joe_: heh. I'll basically be unreachable during those two weeks. [18:22:39] probably exploring ruins in Hampi [18:22:39] <_joe_> can't you mail it to some deposit box at your arrival place? [18:22:40] ok then ... [18:22:40] +2 for vacations that do not include any way to check work email or submit patches [18:22:55] <_joe_> bd808: I agree with the former [18:22:56] RECOVERY - HHVM busy threads on mw1207 is OK: OK: Less than 30.00% above the threshold [76.8] [18:22:57] bd808: haha [18:23:05] RECOVERY - HHVM queue size on mw1207 is OK: OK: Less than 30.00% above the threshold [10.0] [18:23:14] _joe_: I don't know where my arrival place would be. We have no concrete plans, just vague ideas. That's how we've always travelled. [18:23:33] _joe_: and I'm not trusting my laptop and keyboard to India's postal / Courier setups :) [18:23:45] <_joe_> eheh [18:23:47] <_joe_> ok [18:24:09] yuvipanda: are they bad with handling or just not trustworthy? [18:24:27] JohnFLewis: both? I will be worried about it 1. arriving, 2. arriving just as I left it [18:25:01] Ah [18:25:11] but yes, I'll mostly take bd808's advice to heart :) [18:25:21] and just not check IRC or email or phab. [18:25:27] So you'll worry about what state it arrives, if any :p [18:25:36] JohnFLewis: basically, yeah [18:25:55] anyone wanna review https://gerrit.wikimedia.org/r/#/c/188082/? I fucked it up the last time [18:26:14] chasemp: mutante ^ [18:27:20] (03PS4) 10Yuvipanda: add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [18:27:55] (03CR) 10Yuvipanda: [C: 031] add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [18:28:04] (03PS2) 10Ottomata: Install MaxMind DBs on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/187901 (owner: 10QChris) [18:28:20] mutante: you should probably merge + babysit the ipv6 change, since I'm not sure how excatly to test for it [18:28:25] but it looks good to me! [18:28:58] (03CR) 10Ottomata: [C: 032] Install MaxMind DBs on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/187901 (owner: 10QChris) [18:29:06] (03PS1) 10Andrew Bogott: Add site.pp entry for silver. [puppet] - 10https://gerrit.wikimedia.org/r/188087 [18:30:04] (03CR) 10Rush: [C: 04-1] "Should be in templates dir?" [puppet] - 10https://gerrit.wikimedia.org/r/188082 (owner: 10Yuvipanda) [18:30:49] chasemp: ... [18:30:56] (03PS3) 10Yuvipanda: Revert "Revert "dumps: Strengthen ssl settings"" [puppet] - 10https://gerrit.wikimedia.org/r/188082 [18:31:06] chasemp: I updated, but yeah, I should probably not write any more code tonight [18:31:22] (03CR) 10John F. Lewis: [C: 031] "Now that's a template :)" [puppet] - 10https://gerrit.wikimedia.org/r/188082 (owner: 10Yuvipanda) [18:32:18] yuvipanda: you've done it at last :D [18:32:37] (03CR) 10Andrew Bogott: [C: 032] Remove ajax_proxy_url setting from nova config. [puppet] - 10https://gerrit.wikimedia.org/r/188078 (owner: 10Andrew Bogott) [18:33:44] woo, gerrit is sloowwwwww this mornin [18:37:06] (03CR) 10Andrew Bogott: [C: 032] Add site.pp entry for silver. [puppet] - 10https://gerrit.wikimedia.org/r/188087 (owner: 10Andrew Bogott) [18:38:00] (03PS4) 10Rush: Revert "Revert "dumps: Strengthen ssl settings"" [puppet] - 10https://gerrit.wikimedia.org/r/188082 (owner: 10Yuvipanda) [18:38:18] (03CR) 10Rush: [C: 031] "Haven't verified the actual ssl changes but the puppet code seems good to me." [puppet] - 10https://gerrit.wikimedia.org/r/188082 (owner: 10Yuvipanda) [18:39:23] (03CR) 10Yuvipanda: [C: 032] Revert "Revert "dumps: Strengthen ssl settings"" [puppet] - 10https://gerrit.wikimedia.org/r/188082 (owner: 10Yuvipanda) [18:44:05] PROBLEM - DPKG on silver is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:44:43] awww, silver was the first box I ever killed [18:46:06] RECOVERY - DPKG on silver is OK: All packages OK [18:46:56] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Puppet has 1 failures [18:48:16] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 2 failures [18:49:58] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1008872 (10greg) Ideally before, yes. [18:49:59] 3operations: deploy services on rbf2001-2002 - https://phabricator.wikimedia.org/T88309#1008873 (10RobH) 3NEW [18:50:39] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1008882 (10RobH) [18:50:40] 3operations: deploy services on rbf2001-2002 - https://phabricator.wikimedia.org/T88309#1008873 (10RobH) [18:51:17] (03PS1) 10RobH: setting rbf2002-2002 dns [dns] - 10https://gerrit.wikimedia.org/r/188092 [18:51:18] 3MediaWiki-Core-Team, operations, Deployment-Systems, Release-Engineering: Update servers in scap rsync proxy pool - https://phabricator.wikimedia.org/T1342#1008884 (10bd808) 5Open>3Resolved [18:51:36] RECOVERY - gdash.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 9447 bytes in 0.041 second response time [18:54:15] godog: still there? [18:55:17] ori: sure [18:55:25] so, i think i fixed it, patch incoming [18:56:10] ori: sweet! glad that piqued your interest [18:59:38] (03PS2) 10Ori.livneh: Make gdash's uWSGI config.ru Ruby 1.9-compatible [puppet] - 10https://gerrit.wikimedia.org/r/188069 [18:59:57] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1008913 (10yuvipanda) dumps.wikimedia.org reports an A now. what exactly is ms1001.wikimedia.org anyway? its ssl certificate calls it dumps.wikimedia.org [19:01:29] (03CR) 10Ori.livneh: "In addition to this fix, I also needed to make the library accessible to www-data. I did this in an unsubtle fashion, by running 'chmod -R" [puppet] - 10https://gerrit.wikimedia.org/r/188069 (owner: 10Ori.livneh) [19:01:39] godog: ^ [19:02:16] ori: sweet, I'll take a look after the ops meeting, thanks! [19:02:26] np [19:03:55] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:05:26] (03CR) 10Ori.livneh: "(Also, this needs to be tested on Precise -- I only confirmed that it worked on graphite1001)" [puppet] - 10https://gerrit.wikimedia.org/r/188069 (owner: 10Ori.livneh) [19:08:59] (03CR) 10Aude: [C: 032] Enable usage tracking on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188063 (owner: 10Aude) [19:10:14] (03Merged) 10jenkins-bot: Enable usage tracking on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188063 (owner: 10Aude) [19:11:10] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable usage tracking on Wikidata (duration: 00m 07s) [19:11:16] Logged the message, Master [19:11:30] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10JKrauska) Ubuntu 14.04 would be fine or Current Debian if that's the best practice today. The Sandbox LAN would be a nice option. (there's no reason this machine should need to talk to other node... [19:12:18] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10JKrauska) Can someone also identify the current disk layout so we can briefly talk about RAID? It would be nice to isolate the OS drives/partition from the DATA drives/partition. [19:15:02] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1008957 (10GWicke) @robh, thanks for the update! It looks like we are still on track for mid-February deploy, but it's getting tighter with about a week left for racking & node bring... [19:24:00] (03PS3) 10BBlack: nginx ssl tuning: CPU binding, -accept_mutex [puppet] - 10https://gerrit.wikimedia.org/r/188056 [19:25:25] (03CR) 10BBlack: [C: 032] nginx ssl tuning: CPU binding, -accept_mutex [puppet] - 10https://gerrit.wikimedia.org/r/188056 (owner: 10BBlack) [19:25:56] (03CR) 10Dzahn: [C: 031] Add Dev namespace on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187278 (https://phabricator.wikimedia.org/T369) (owner: 10Spage) [19:27:34] (03CR) 10Gage: "Unless or until Mark nominates someone else, you can consider me the Ops owner of Logstash." [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [19:30:42] anomie, bd808, yuvipanda: so are we ready to merge this? i'm confused by the cherry picking. https://gerrit.wikimedia.org/r/#/c/173336/ [19:30:50] jgage: we are! [19:30:55] cool ok [19:31:13] (03PS15) 10Gage: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [19:31:23] jgage: The cherry picking was to test that it actually worked right (which it did) and to allow testing of the extension that uses the data it provides. [19:31:36] gotcha, ok [19:32:12] (03CR) 10Gage: [C: 032] Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [19:37:35] 3RESTBase, operations: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1009082 (10GWicke) @akosiaris, @fgiunchedi: It would be great if we could tackle this before the planned deploy mid-February. [19:38:08] 3RESTBase, operations: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1009085 (10GWicke) [19:38:12] 3Scrum-of-Scrums, operations, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1009084 (10GWicke) [19:51:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [19:53:08] that's just prod, nothing to worry [19:53:22] we don't pay attention to that [19:57:06] :p [19:58:50] <_joe_> ori: it was a spike, I checked btw [19:58:56] <_joe_> even if in a meeting [19:59:08] <_joe_> you know, some of us do care. Not everybody in fact [20:00:13] <_joe_> and such spikes are usually some connectivity flicking, not an appserver failure [20:03:14] 3operations: plan workflow for blocked on ops patches - https://phabricator.wikimedia.org/T88315#1009248 (10RobH) 3NEW a:3yuvipanda [20:05:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:09:05] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) It seems there is already a sandbox vlan setup in row A codfw, so we can proceed with this. Joel: I'll have to have papaul download and install an ISO image off the web. We have not yet pu... [20:09:17] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) a:5RobH>3JKrauska [20:26:14] (03PS1) 10Andrew Bogott: Modest changes in the wikitech config for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/188111 [20:26:37] (03PS1) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [20:26:55] (03CR) 10jenkins-bot: [V: 04-1] Modest changes in the wikitech config for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/188111 (owner: 10Andrew Bogott) [20:27:20] (03CR) 10jenkins-bot: [V: 04-1] Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 (owner: 10BBlack) [20:28:05] (03PS2) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [20:28:43] (03CR) 10jenkins-bot: [V: 04-1] Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 (owner: 10BBlack) [20:28:51] (03PS3) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [20:28:53] (03PS2) 10Andrew Bogott: Modest changes in the wikitech config for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/188111 [20:30:13] (03CR) 10Andrew Bogott: [C: 032] Modest changes in the wikitech config for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/188111 (owner: 10Andrew Bogott) [20:30:29] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10JKrauska) FWIW: I like to use Unetbootin for creating USB bootable install 'cds'. http://unetbootin.sourceforge.net/ I would like a 300G OS root / partition, 8G SWAP, and the rest in /DATA. (all... [20:31:03] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10JKrauska) a:5JKrauska>3RobH [20:31:53] <_joe_> bblack: did I ever tell you how nice is that rps script? It's pretty awesome. We should write some tech blog article about these optimizations [20:32:48] <_joe_> I think a lot of people would benefit from the kind of understanding of those problems you have. Me included :) [20:33:03] :) [20:33:45] I'm going to try spreading the RPS love to the new jessie cache boxes too, I think it will play well with nginx processes pinned to CPUs are in: https://gerrit.wikimedia.org/r/#/c/188056/3/templates/nginx/nginx.conf.erb [20:37:25] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) Joel: You want to software raid10 the 4 disks? That will only give you 6TB of RAW disk when complete. [20:41:05] (03PS4) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [20:43:56] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:45:10] (03CR) 10Hashar: "Thanks Brandon!" [puppet] - 10https://gerrit.wikimedia.org/r/187331 (owner: 10BBlack) [20:45:38] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1009375 (10Dzahn) >>! In T74072#1008913, @yuvipanda wrote: > what exactly is ms1001.wikimedia.org anyway? ms deprecated - media storage [1] ms1001 is a nfs server of dumps and... [20:45:48] (03PS1) 10RobH: setting server procyon dns [dns] - 10https://gerrit.wikimedia.org/r/188119 [20:46:24] (03CR) 10RobH: [C: 032] setting rbf2002-2002 dns [dns] - 10https://gerrit.wikimedia.org/r/188092 (owner: 10RobH) [20:46:28] (03CR) 10Hashar: "Thanks! I never bothered to did it since I lack access on the prod private repo :-}" [puppet] - 10https://gerrit.wikimedia.org/r/186515 (https://phabricator.wikimedia.org/T84731) (owner: 10Dzahn) [20:46:44] (03CR) 10RobH: [C: 032] setting server procyon dns [dns] - 10https://gerrit.wikimedia.org/r/188119 (owner: 10RobH) [20:49:05] 3operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) a:5RobH>3Papaul Papaul, Please download and create an Ubuntu 14.04 server install image. This system will be used by the corporate IT department, and thus will be in our sandbox vlan.... [20:49:26] 3ops-codfw, operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) p:5High>3Normal [20:52:58] Status: Up | Wikimedia Platform operations, serious stuff | Log: http://bit.ly/wikisal | Channel logs: http://ur1.ca/edq22 | MediaWiki error counts: http://ur1.ca/i4fdw | On Ops duty: A somewhat over-extended andrewbogott | On Product duty: James_F [20:53:01] oops [21:07:39] 3Phabricator, operations: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1009432 (10chasemp) [21:08:49] 3Phabricator, operations: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#995092 (10chasemp) @RobH, yes? Anyone have objections to adding @emailbot to #operations so it can relay comments to private issues? [21:16:43] !log deployed parsoid version e3c9ae99 [21:16:46] Logged the message, Master [21:18:09] (03PS2) 10Gergő Tisza: Add -labs settings for Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181358 (https://phabricator.wikimedia.org/T85049) [21:18:16] PROBLEM - HTTP on silver is CRITICAL: Connection refused [21:24:16] 3Parsoid, Parsoid-Team, operations: Multiple Parsoid crashes due to "Parse Error" on Russian Wikinews (WNRU) - https://phabricator.wikimedia.org/T88100#1009469 (10Arlolra) 5Open>3Resolved [21:29:46] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures [21:31:14] (03PS1) 10Dzahn: add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/188188 [21:34:36] (03PS2) 10Dzahn: add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/188188 [21:40:24] (03PS3) 10Dzahn: add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/188188 [21:45:03] (03PS1) 10Dzahn: add ferm service for rsyncd to dataset role [puppet] - 10https://gerrit.wikimedia.org/r/188204 [21:53:10] (03PS2) 10Dzahn: add ferm service for rsyncd to dumps role [puppet] - 10https://gerrit.wikimedia.org/r/188204 [21:54:54] 3operations, ops-codfw, hardware-requests: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1009642 (10RobH) Tracking Dell quote on https://rt.wikimedia.org/Ticket/Display.html?id=9172 [22:05:46] (03PS1) 10Dzahn: fix puppet compiler warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 [22:10:06] (03PS2) 10Dzahn: fix another 28 puppet compiler warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 [22:10:13] <_joe_> compiler? [22:10:17] <_joe_> linter maybe [22:10:25] <_joe_> (yeah can't sleep) [22:10:43] (03PS3) 10Dzahn: fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 [22:11:05] <_joe_> :) [22:12:12] (03PS1) 10Ori.livneh: Add XAnalytics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188211 [22:16:35] we have ./role/codfw/swift/proxy.yaml . does that belong in hiera? [22:16:51] also, do we want ./role/codfw/ for other stuff? [22:18:41] !log ori Synchronized php-1.25wmf15/extensions/XAnalytics: (no message) (duration: 00m 05s) [22:18:46] !log ori Synchronized php-1.25wmf14/extensions/XAnalytics: (no message) (duration: 00m 05s) [22:18:47] Logged the message, Master [22:18:47] (03CR) 10Ori.livneh: [C: 032] Add XAnalytics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188211 (owner: 10Ori.livneh) [22:18:52] Logged the message, Master [22:18:52] JohnFLewis: https://phabricator.wikimedia.org/T88041 [22:18:58] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1009756 (10RobH) rbf2001 is installed and ready for service implementation rbf2002 is having install issues detecting disks, and I need to further troubleshoot the installation. [22:19:10] <_joe_> mutante: hold on for now, I need to merge a patch tomorrow [22:19:19] _joe_: ok [22:19:36] <_joe_> that's probably broken atm [22:20:29] mutante: Siko's email? (Because I'm too lazy to search it :p) [22:20:56] JohnFLewis: there were 2. this is about creating a new list [22:21:07] JohnFLewis: the other one was about stats and is done [22:21:29] A completed ticket relating to mailman :o [22:21:49] i guess we can add matanya, since it's about lists AND grantmaking at the same time , heh [22:22:54] JohnFLewis: well, ./list_members | wc -l :) [22:23:45] also, the sheer number of people on a list, why does it matter if it's public [22:23:55] it's not like the list of addresses [22:26:03] (03CR) 10Ori.livneh: [V: 032] Add XAnalytics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188211 (owner: 10Ori.livneh) [22:26:37] (03CR) 10RobH: [C: 032] "3 business day wait has passed without objection, access granted." [puppet] - 10https://gerrit.wikimedia.org/r/187271 (https://phabricator.wikimedia.org/T87816) (owner: 10Mforns) [22:26:50] (03PS1) 10Ori.livneh: Load XAnalytics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188216 [22:27:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Load XAnalytics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188216 (owner: 10Ori.livneh) [22:27:14] 3Ops-Access-Requests: Requesting access to EventLogging cluster for mforns - https://phabricator.wikimedia.org/T87816#1009793 (10RobH) 5Open>3Resolved a:3RobH This has had three business days for objection since it was submitted, and no objections were raised. As such, the linked patchset has been pushed... [22:27:25] 3operations, Ops-Access-Requests: Requesting access to EventLogging cluster for mforns - https://phabricator.wikimedia.org/T87816#1009796 (10RobH) [22:31:46] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#1009801 (10RobH) [22:31:47] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1009802 (10RobH) [22:31:50] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#994821 (10RobH) [22:31:51] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10RobH) [22:32:38] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10RobH) [22:32:39] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#994821 (10RobH) [22:32:42] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1009809 (10greg) @zeljkofilipin ping :) [22:33:16] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10RobH) [22:33:36] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#1009823 (10RobH) 5Open>3Resolved As the three days has long since passed, I'm merging https://gerrit.wikimedia.org/r/#/c/183062 live... [22:33:59] (03PS3) 10RobH: Add Dan Duvall to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/183062 (https://phabricator.wikimedia.org/T85936) (owner: 10John F. Lewis) [22:34:27] ori, seen https://github.com/facebook/hhvm/issues/4744 ? [22:34:30] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#1009827 (10RobH) 5Resolved>3Open [22:34:43] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10RobH) [22:34:46] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#994821 (10RobH) [22:35:02] MaxSem: where do we sleep()? [22:35:09] (03PS1) 10Reedy: Undeploy Solarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 [22:35:19] ori, see below, it breaks all syscalls [22:36:12] MaxSem: can you do me a huge favor and file a phab task for this? [22:36:14] i'm doing 4 things atm [22:36:21] (03CR) 10RobH: [C: 032] Add Dan Duvall to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/183062 (https://phabricator.wikimedia.org/T85936) (owner: 10John F. Lewis) [22:36:22] ok [22:36:34] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1009836 (10RobH) [22:38:11] 3Ops-Access-Requests: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#1009857 (10RobH) Well, this was created on January 3rd, and folks now have 3 working days to raise objection. As Toby has approved as manager, and no one has objected, I'll be merging t... [22:38:32] (03CR) 10Nemo bis: [C: 031] "There are no solr servers left, AFAIK; and no intention to add any." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 (owner: 10Reedy) [22:38:37] (03PS4) 10Dzahn: fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) [22:39:58] (03PS1) 10RobH: adding ananthrk to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/188220 [22:41:20] (03CR) 10RobH: [C: 032] adding ananthrk to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/188220 (owner: 10RobH) [22:42:51] 3Ops-Access-Requests: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#1009886 (10RobH) 5Open>3Resolved Ananth, change https://gerrit.wikimedia.org/r/#/c/188220/ has been pushed live to stat1003, you should be good to access it. [22:43:04] 3Ops-Access-Requests: Provide Damon Sicore with access to WMF-NDA tickets - https://phabricator.wikimedia.org/T88350#1009888 (10mark) 3NEW [22:46:00] 3operations: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1009905 (10RobH) a:3Cmjohnson most of the time this can be fixed by a hardware power removal. As such, I'm assigning the onsite project tag to this, as well as assigning to Chris to look at next time he is onsite. (I imagine he e... [22:46:10] 3operations, ops-eqiad: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1009911 (10RobH) [22:47:35] 3operations, ops-eqiad: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#1009913 (10RobH) I'll keep this ticket assgined to me, but put in sub-tasks for the clearing of data and such. I'm not sure if we want to keep these at all, but they certainly need the disks wiped. [22:48:47] 3operations, ops-eqiad: wipe lsearch machines - https://phabricator.wikimedia.org/T88352#1009917 (10RobH) 3NEW a:3Cmjohnson [22:51:05] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:52:08] 3Ops-Access-Requests: Provide Damon Sicore with access to WMF-NDA tickets - https://phabricator.wikimedia.org/T88350#1009944 (10mark) [22:53:54] 3Ops-Access-Requests: Provide Damon Sicore with access to WMF-NDA tickets - https://phabricator.wikimedia.org/T88350#1009947 (10Dzahn) Added @damons to the Phabricator WMF-NDA group. [22:55:13] 3Ops-Access-Requests: Provide Damon Sicore with access to WMF-NDA tickets - https://phabricator.wikimedia.org/T88350#1009951 (10RobH) 5Open>3Resolved a:3RobH In phab they have to be added to the project: https://phabricator.wikimedia.org/project/members/61/ I've added him (user damons) [22:55:30] (03PS1) 10Ori.livneh: vbench: log total time as well [puppet] - 10https://gerrit.wikimedia.org/r/188223 [22:55:59] (03PS2) 10Ori.livneh: vbench: log total time as well [puppet] - 10https://gerrit.wikimedia.org/r/188223 [22:56:09] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: log total time as well [puppet] - 10https://gerrit.wikimedia.org/r/188223 (owner: 10Ori.livneh) [22:58:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:58:46] 3operations, ops-eqiad: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#1009955 (10Dzahn) a:5Dzahn>3None [23:04:07] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet has 1 failures [23:05:26] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [23:07:16] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:09:35] RECOVERY - HTTP on silver is OK: HTTP OK: HTTP/1.1 302 Found - 418 bytes in 0.023 second response time [23:10:55] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:12:07] 3Ops-Access-Requests: Add James F. to pager duty - https://phabricator.wikimedia.org/T88153#1010001 (10RobH) Well, right now there is a parsoid group that Roan is already in, so I am simply adding you. I assume you want SMS to your cell (# pulled from office wiki). I need to know your provider so I can have th... [23:15:53] (03PS1) 10RobH: Adding JamesF to parsoid paging [puppet] - 10https://gerrit.wikimedia.org/r/188226 [23:16:31] bleh spaces are evil. [23:16:40] (03PS2) 10RobH: Adding JamesF to parsoid paging [puppet] - 10https://gerrit.wikimedia.org/r/188226 [23:17:26] 3Ops-Access-Requests: Add James F. to pager duty - https://phabricator.wikimedia.org/T88153#1010013 (10RobH) irc update: got the mobile info from jamesf in channel, updated and pushed live for parsoid paging. Now, I suppose we need to determine if we need to have a special paging group for citoid. [23:17:47] (03CR) 10RobH: [C: 032] Adding JamesF to parsoid paging [puppet] - 10https://gerrit.wikimedia.org/r/188226 (owner: 10RobH) [23:19:32] robh: just create a special group for everything, and name it - literally - everything :p [23:19:44] (03PS1) 10Dzahn: add rbf2001/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) [23:19:51] isnt that the ops page group then? [23:21:02] robh: but it needs a more, interesting name. Like 'everything', 'ops' is too boring [23:21:20] (03PS2) 10Dzahn: add rbf2001/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) [23:22:31] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010014 (10Catrope) [23:22:42] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010014 (10Catrope) [23:22:54] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:23:50] (03PS3) 10Dzahn: add rbf2001/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) [23:26:11] (03PS1) 10RobH: remove roan from main paging contact group [puppet] - 10https://gerrit.wikimedia.org/r/188228 [23:26:59] (03CR) 10RobH: [C: 032] remove roan from main paging contact group [puppet] - 10https://gerrit.wikimedia.org/r/188228 (owner: 10RobH) [23:28:01] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010065 (10Reedy) I guess this needs doing manually? https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/files/apache/sites/remnant.conf#L262-L263 ``` RewriteEngine O... [23:28:32] (03PS1) 10Andrew Bogott: Include a database on silver, for wikitech mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/188230 [23:32:02] (03PS2) 10Andrew Bogott: Include a database on silver, for wikitech mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/188230 [23:33:06] (03CR) 10Andrew Bogott: [C: 032] Include a database on silver, for wikitech mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/188230 (owner: 10Andrew Bogott) [23:36:44] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [23:37:10] (03PS1) 10Andrew Bogott: Revert "Include a database on silver, for wikitech mediawiki." [puppet] - 10https://gerrit.wikimedia.org/r/188232 [23:37:36] (03PS1) 10RobH: adding citoid monitoring to icinga [puppet] - 10https://gerrit.wikimedia.org/r/188233 [23:38:00] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010080 (10Chmarkine) Are radium and wikitech-static still in use? Is it needed to disable SSL3 on these two servers? [23:38:30] (03CR) 10RobH: [C: 031] "So since this generates pages, and changes icinga's monitoring list, I'd like more eyes on this than simply my own." [puppet] - 10https://gerrit.wikimedia.org/r/188233 (owner: 10RobH) [23:38:50] (03CR) 10Dzahn: "we already have that further down" [puppet] - 10https://gerrit.wikimedia.org/r/188233 (owner: 10RobH) [23:39:17] mutante: ooohhhhh im stupidddd [23:39:26] (03Abandoned) 10RobH: adding citoid monitoring to icinga [puppet] - 10https://gerrit.wikimedia.org/r/188233 (owner: 10RobH) [23:42:14] (03PS1) 10RobH: setting citoid paging group to admins & parsoid [puppet] - 10https://gerrit.wikimedia.org/r/188236 [23:42:15] well, good to see i did it right even though it was already mostly done. [23:43:03] mutante: care to give https://gerrit.wikimedia.org/r/#/c/188236/1 a review? =] [23:45:16] (03CR) 10Dzahn: [C: 031] "lgtm, just like for parsoid above on line 14 (even though it would be cleaner to have a citoid group)" [puppet] - 10https://gerrit.wikimedia.org/r/188236 (owner: 10RobH) [23:45:32] thank you =] [23:46:25] (03CR) 10RobH: [C: 032] "I'm going to go ahead and merge this now, and then manually fire puppet on neon to babysit." [puppet] - 10https://gerrit.wikimedia.org/r/188236 (owner: 10RobH) [23:48:27] (03PS4) 10Dzahn: add rbf2001/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) [23:50:38] (03PS5) 10Dzahn: add rbf2001/2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) [23:52:34] 3Ops-Access-Requests: Add James F. to pager duty - https://phabricator.wikimedia.org/T88153#1010112 (10RobH) 5Open>3Resolved a:3RobH Ok, I added in the contact groups https://gerrit.wikimedia.org/r/#/c/188236/ so now roan and james get paged for parsoid and citoid ONLY and not other outages. (IRC update f... [23:52:38] (03CR) 10Dzahn: [C: 032] "adding without role for the moment being, to get puppet up and running, allow us to login with normal keys etc.." [puppet] - 10https://gerrit.wikimedia.org/r/188227 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [23:59:26] !log signing puppet cert for rbf2001, PXE booting rbf2002 [23:59:32] Logged the message, Master