[00:00:17] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:00:29] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering, 5Patch-For-Review: Chad H. needs access to iridium (Phabricator host) to manage repos - https://phabricator.wikimedia.org/T92564#1115397 (10Dzahn) gotcha, then to the meeting Etherpad it goes [00:01:23] Is ops aware of "Mar 13 00:01:14 mw1244: #012Warning: Failed connecting to redis server at 10.64.0.163: Connection timed out"? [00:01:43] also happening with 10.64.0.162 [00:07:07] !log krenair Synchronized php-1.25wmf21/extensions/TemplateData: https://gerrit.wikimedia.org/r/#/c/196439/ (duration: 00m 12s) [00:07:08] James_F, ^ [00:07:11] Logged the message, Master [00:07:14] 6operations: Delete stat1002:/a/squid/archive/edits-geocoded - https://phabricator.wikimedia.org/T92332#1115400 (10ezachte) I did create these files for ad hoc analysis. I just deleted them.I can't delete the directory. [00:08:57] seems good to me James_F [00:09:00] 6operations: Delete stat1002:/a/squid/archive/sampled-geocoded - https://phabricator.wikimedia.org/T92334#1115404 (10ezachte) I did create part of these files, for ad hoc analysis. Some are from stephan who no longer works for us. I think it is safe to assume no-one else uses them. [00:09:07] Krenair: Yeah, go for it. [00:10:00] Krenair: where did you pull the error from? redis box looks ok, and mw1244 can reach teh first one at least [00:10:07] not sure what's up there if anything [00:10:18] (03CR) 10Dzahn: nova: lint compute.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195535 (owner: 10Matanya) [00:10:53] chasemp, fluorine.eqiad.wmnet:/a/mw-log/hhvm.log [00:10:53] I watch fatalmonitor there during deployments [00:12:13] (03CR) 10Dzahn: [C: 04-2] nova: lint compute.pp [puppet] - 10https://gerrit.wikimedia.org/r/195535 (owner: 10Matanya) [00:12:58] bmansurov, hey [00:13:03] hello [00:13:09] 6operations: Delete stat1002:/a/squid/archive/mobile-geocoded - https://phabricator.wikimedia.org/T92333#1115406 (10ezachte) I used these long ago as example for similar ad hoc tests. (sampled/edit geo) BTW I have a script to generate such files again when needed, e.g. for grep. Wikistats squid log processing... [00:13:28] bmansurov, are you able to make a submodule update for https://gerrit.wikimedia.org/r/#/c/196448/ ? [00:13:54] Krenair: it's my first time doing this, so I'm not quite sure what you mean [00:14:00] ok, that's fine [00:17:17] (03CR) 10Dzahn: "how about i do this first: https://gerrit.wikimedia.org/r/#/c/196454/" [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [00:17:53] !log krenair Synchronized php-1.25wmf21/resources/src/mediawiki.ui/components/inputs.less: https://gerrit.wikimedia.org/r/#/c/196308/ (duration: 00m 07s) [00:17:54] James_F, ^ [00:17:58] Logged the message, Master [00:18:14] Krenair: Yup, https://www.mediawiki.org/wiki/Special:UserLogin?debug=true looks good. [00:18:16] PROBLEM - Redis on rbf1001 is CRITICAL: Connection refused [00:18:37] PROBLEM - Redis on rbf1002 is CRITICAL: Connection refused [00:19:08] chasemp, ^ [00:20:34] !log starting redis on rbf1002 [00:20:36] RECOVERY - Redis on rbf1001 is OK: TCP OK - 0.009 second response time on port 6379 [00:20:38] Logged the message, Master [00:20:57] RECOVERY - Redis on rbf1002 is OK: TCP OK - 0.006 second response time on port 6379 [00:21:05] i restarted the other one, but out now [00:21:27] !log krenair Synchronized php-1.25wmf20/includes/api: https://gerrit.wikimedia.org/r/#/c/196317/ (duration: 00m 12s) [00:21:32] Logged the message, Master [00:21:40] mutante: thanks I think we were there at same time :) [00:21:48] k [00:22:14] chasemp: :) yep [00:25:09] (03PS1) 10EBernhardson: Update flow pages on te.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196496 [00:25:13] (03CR) 10jenkins-bot: [V: 04-1] Update flow pages on te.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196496 (owner: 10EBernhardson) [00:27:04] (03PS2) 10EBernhardson: Update flow pages on te.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196496 [00:28:32] * gwicke assumes ebernhardson is still deploying [00:28:54] Krenair: could you ping me when you are done? [00:28:59] yes [00:29:03] thx! [00:29:24] (03PS1) 10Gage: IPsec: big off switch [puppet] - 10https://gerrit.wikimedia.org/r/196498 [00:29:47] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [00:29:47] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [00:30:21] Krenair: well, if your not done can you ship my patch too :) just config change [00:30:29] https://gerrit.wikimedia.org/r/196496 [00:30:37] should work this time around [00:30:42] I still have 2 patches to sync and -30 minutes to do it <_< [00:31:00] but ok [00:31:54] !log krenair Synchronized php-1.25wmf21/includes/api: https://gerrit.wikimedia.org/r/#/c/196313/ (duration: 00m 08s) [00:31:59] Logged the message, Master [00:33:33] ok [00:35:24] looks like the 5xx spike fixed itself [00:36:53] (03CR) 10Alex Monk: [C: 032] Update flow pages on te.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196496 (owner: 10EBernhardson) [00:37:03] (03Merged) 10jenkins-bot: Update flow pages on te.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196496 (owner: 10EBernhardson) [00:37:44] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, 10Wikimedia-Labs-Infrastructure: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1115471 (10scfc) I don't think so because that was merged earlier. But on March 6th https... [00:38:54] !log krenair Synchronized php-1.25wmf21/extensions/MobileFrontend/javascripts/modules/mediaViewer/ImageOverlay.js: https://gerrit.wikimedia.org/r/#/c/196497/ (duration: 00m 09s) [00:38:57] bmansurov, ^ [00:39:50] Krenair: should I test here: https://test.m.wikipedia.org/ ? [00:40:01] yes [00:40:10] (03Abandoned) 10GWicke: WIP: Merge Iaa3bbf07b6053e139dc [puppet] - 10https://gerrit.wikimedia.org/r/196128 (owner: 10GWicke) [00:40:28] Krenair: working! thanks [00:41:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:42:07] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:42:32] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/196496 (duration: 00m 09s) [00:42:36] Logged the message, Master [00:42:41] ebernhardson, ^ [00:42:54] Krenair: worked this time :) thanks a bunch [00:43:19] gwicke, hey [00:43:48] * gwicke wakes up [00:44:01] you are all done? [00:44:04] yes [00:44:19] (03CR) 10GWicke: [C: 032] Enable RESTBase updates on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196472 (owner: 10GWicke) [00:44:22] we had like 11 different patches. :/ [00:44:25] ok, going for it [00:44:26] (03Merged) 10jenkins-bot: Enable RESTBase updates on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196472 (owner: 10GWicke) [00:44:55] !log gwicke Synchronized wmf-config/InitialiseSettings.php: Enable RESTBase updates (duration: 00m 09s) [00:45:01] Logged the message, Master [00:45:04] and done. [00:47:06] Krenair: just saw one notice in the VE extension about Undefined index: x-parsoid-performance [00:47:21] ApiVisualEditor.php on line 100 [00:47:31] didn't we fix that already? [00:47:45] it says it's in wmf21 [00:48:20] ah we did x-cache before [00:48:52] Krenair: is there anything else I need to do? Or is my patch good to go? [00:48:56] don't see it in wmw20 [00:49:07] bmansurov, your patch was done [00:49:12] Krenair: thanks [00:49:24] gwicke, it only happens when you get a cache hit [00:49:58] restbase has actual cache hits but parsoid didn't? [00:50:56] gwicke, but yeah can reproduce it [00:51:03] Krenair: I think the main reason is that restbase doesn't preserve that header [00:51:13] while Varnish did [00:51:20] ah [00:51:50] VE currently only uses RB on phase0 wikis [00:52:19] should fix the notice before rolling out further [00:52:49] yeah, am making a patch [00:53:02] is that header still used? [00:53:48] I wonder if all the statsd instrumentation in parsoid and restbase has made this less interesting [00:54:08] gwicke, https://gerrit.wikimedia.org/r/196503 [00:55:23] lgtm, +2ed [00:55:48] Krenair: worth checking if there is still anything using it though [00:56:09] it'll become pretty useless soon as we ramp up restbase usage [00:56:34] VE client uses it for... something [00:57:02] Krenair: gwicke do you know, do we store non-ephemeral data in redis? [00:57:06] like what happens if we lose redis data [00:57:12] yeah it goes into performance.system.* metrics, gwicke [00:57:21] because I think AOF (the persistence) is hanging so hard it is dropping clients [00:57:28] and it's happening quite often [00:57:28] chasemp: we do store jobs in there [00:57:35] chasemp, user sessions [00:57:48] what kind of jobs? [00:57:49] log messages for logstash [00:57:54] mediawiki job queue [00:58:19] these two redis boxes are just constantly churning on Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis [00:58:24] and it's hanging things / slowing them down [00:58:32] but turning it off is not without possible issue [00:58:32] I'm not too familiar with which redis cluster does what [00:58:50] pretty sure we have separate ones for jobs [00:58:52] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1115512 (10Eevans) a:3Eevans [00:59:02] chasemp: which boxes is this on? [00:59:17] rbf1001/rbf1002 [00:59:20] rbf100[12] [01:01:45] mediawiki-config says that these are BloomCacheRedis [01:02:01] what does that translate to for me? :) [01:02:09] good question ;) [01:02:17] bloom filters for something [01:02:36] wgBloomFilterStores [01:03:10] we are stashing one big persistent set or something? [01:03:25] "Persistent bloom filter used to avoid expensive lookups" [01:03:49] (thanks for looking at this) ok lookups of? [01:04:31] whether to display a deletion log or not [01:04:44] it's not super duper crucial [01:04:59] ok so if this fails to connect (as it's often doing) we just do it the hard way? [01:05:14] I believe so, yes [01:05:18] can you point me to where this code is? [01:05:31] https://phabricator.wikimedia.org/diffusion/MW/ ? [01:06:59] includes/logging/LogEventsList.php [01:07:06] https://github.com/wikimedia/mediawiki/blob/master/includes/logging/LogEntry.php#L537 [01:07:08] and includes/logging/LogEntry.php [01:07:15] https://github.com/wikimedia/mediawiki/blob/master/includes/logging/LogEventsList.php#L551 [01:07:23] ;) [01:07:28] ok thanks :D [01:07:31] looks like this is fairly new [01:07:31] pretty similar latency [01:07:31] https://phabricator.wikimedia.org/rOPUP457d58535e9b3e49e0eb7a91c42b76316f84c44f [01:07:53] or maybe that's just a reorg [01:08:15] here it is https://phabricator.wikimedia.org/rOPUP69322ed601a0634ef694d922b7e17b5cadb086ca [01:08:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1115527 (10Eevans) 3NEW a:3Eevans [01:09:15] chasemp: looks harmless to me [01:09:32] yeah I think no emergency but need to let them know their stuff is failing hard I imagine [01:09:42] *nod* [01:09:42] which is good but now my dinner is cold :) [01:11:21] it's good to see this research chain in action, definitely a good idea to add notes in site.pp about what a set of nodes is responsible for [01:15:44] agreed [01:18:21] 6operations: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1115548 (10chasemp) 3NEW a:3aaron [01:18:31] https://phabricator.wikimedia.org/T92591 [01:19:13] thanks chase [01:20:23] yup now I'm off thanks gents and have a good night [01:56:31] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193762 (owner: 10Jforrester) [02:10:34] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, 10Wikimedia-Labs-Infrastructure: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1115608 (10coren) The only net effect the change can make is that //iff// the fqdn has exa... [02:28:25] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 07m 08s) [02:28:35] Logged the message, Master [02:33:16] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-13 02:32:12+00:00 [02:33:21] Logged the message, Master [02:43:48] bd808: ^ looks like it is working? [02:48:30] greg-g: I always thought that I think. Need to check the logs on tin and fluroine to be sure [02:48:40] * bd808 is not on a laptop with a prod key [02:48:55] I'll check first thing in the morning [02:53:25] sync-common: 78% (ok: 209; fail: 0; left: 58 [02:54:02] !log l10nupdate Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 07m 02s) [02:54:10] Logged the message, Master [02:58:51] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-13 02:57:48+00:00 [02:58:58] Logged the message, Master [03:47:25] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10Wikimedia-Logstash, 5Patch-For-Review: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019000 (10bd808) [03:58:37] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [03:59:57] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:00:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:02:07] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [04:11:16] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:11:17] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:13:57] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [04:24:37] (03PS1) 10BBlack: depool cp1048 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196522 [04:24:54] (03CR) 10BBlack: [C: 032 V: 032] depool cp1048 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196522 (owner: 10BBlack) [04:30:57] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:37:39] (03CR) 10BBlack: "Personally, I tend think we'd want the blocking version of the command, so that we know when the action is complete by when the script exi" [puppet] - 10https://gerrit.wikimedia.org/r/196498 (owner: 10Gage) [04:41:00] (03CR) 10BBlack: "(I could see the merit in having a nonblock option available though. Perhaps make it a commandline flag like -n?)" [puppet] - 10https://gerrit.wikimedia.org/r/196498 (owner: 10Gage) [04:51:18] 6operations, 6Project-Creators: Create #site-incident tag and use it for incident reports - https://phabricator.wikimedia.org/T85889#1115740 (10GWicke) 5stalled>3Open We now have several individual incidents on phabricator, but no project / tag to unify them in. Any objections against creating this tag? [05:00:14] (03PS1) 10BBlack: repool cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/196523 [05:00:27] (03CR) 10BBlack: [C: 032 V: 032] repool cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/196523 (owner: 10BBlack) [05:10:48] (03PS1) 10BBlack: depool cp3007 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196525 [05:11:09] (03CR) 10BBlack: [C: 032 V: 032] depool cp3007 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196525 (owner: 10BBlack) [05:19:20] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: Puppet has 1 failures [05:27:39] (03PS1) 10BBlack: remove cp300[679] old pub dns [dns] - 10https://gerrit.wikimedia.org/r/196526 [05:28:14] (03CR) 10BBlack: [C: 032] remove cp300[679] old pub dns [dns] - 10https://gerrit.wikimedia.org/r/196526 (owner: 10BBlack) [05:30:37] (03PS1) 10BBlack: update hieradata for cache hostnames [puppet] - 10https://gerrit.wikimedia.org/r/196527 [05:31:29] (03CR) 10BBlack: [C: 032 V: 032] update hieradata for cache hostnames [puppet] - 10https://gerrit.wikimedia.org/r/196527 (owner: 10BBlack) [05:33:34] (03PS2) 10Yuvipanda: dsh: delete empty groups [puppet] - 10https://gerrit.wikimedia.org/r/196454 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [05:34:00] (03CR) 10Yuvipanda: [C: 032] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/196454 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [05:34:09] (03CR) 10Yuvipanda: [V: 032] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/196454 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [05:34:38] (03PS4) 10Yuvipanda: dsh: delete most groups [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [05:34:58] 6operations, 3Interdatacenter-IPsec: Implement a big off switch - https://phabricator.wikimedia.org/T88536#1115755 (10Gage) [05:35:10] 6operations, 3Interdatacenter-IPsec: Implement a big off switch - https://phabricator.wikimedia.org/T88536#1014239 (10Gage) Proposed solution: https://gerrit.wikimedia.org/r/#/c/196498/ This script may be run as a salt command. I haven't explored the Salt API. To clarify the task description: I do not predic... [05:35:44] 6operations: Migrate host lists out of cache.pp to reference values in Hiera - https://phabricator.wikimedia.org/T92601#1115767 (10Gage) 3NEW [05:36:17] 6operations, 10Analytics-Cluster, 3Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10Gage) 3NEW [05:36:20] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:36:37] 6operations, 3Interdatacenter-IPsec, 7Monitoring: Monitor IPsec status - https://phabricator.wikimedia.org/T92603#1115786 (10Gage) 3NEW [05:36:42] 6operations, 3Interdatacenter-IPsec: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1115796 (10Gage) 3NEW [05:37:02] 6operations: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#955555 (10Gage) If implemented, this task will be completed after T92604. Once we are confident in traffic flows over IPsec we may wish to use firewall rules to explicitly disallow unencrypted traffic. [05:37:40] bedtime :D [05:43:50] nite :) [05:44:26] 6operations: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#1115812 (10BBlack) >>! In T85823#974563, @Gage wrote: > Given that Varnish nodes have only private IPs there is no explicit need for this. Proper configuration of security associations between nodes in cache colos and main colos wo... [05:44:28] (03CR) 10GWicke: Don't include a node in its own seeds (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [05:45:41] (03PS1) 10BBlack: update varnish disk stuff for cp104[34] (misc) [puppet] - 10https://gerrit.wikimedia.org/r/196528 [05:46:16] (03CR) 10BBlack: [C: 032 V: 032] update varnish disk stuff for cp104[34] (misc) [puppet] - 10https://gerrit.wikimedia.org/r/196528 (owner: 10BBlack) [05:49:29] 6operations: Migrate host lists out of cache.pp to reference values in Hiera - https://phabricator.wikimedia.org/T92601#1115813 (10BBlack) I support this goal. It's going to be tricky to pull off in the midst of other active changes there without a hitch, though. Getting puppet-compiler working on all caches a... [05:50:57] (03PS1) 10BBlack: repool cp3007 [puppet] - 10https://gerrit.wikimedia.org/r/196529 [05:51:10] (03CR) 10BBlack: [C: 032 V: 032] repool cp3007 [puppet] - 10https://gerrit.wikimedia.org/r/196529 (owner: 10BBlack) [05:56:52] (03PS1) 10BBlack: depool cp104[35] (1/2 misc/parsoid) [puppet] - 10https://gerrit.wikimedia.org/r/196530 [05:57:07] (03CR) 10BBlack: [C: 032 V: 032] depool cp104[35] (1/2 misc/parsoid) [puppet] - 10https://gerrit.wikimedia.org/r/196530 (owner: 10BBlack) [05:59:59] 6operations, 10Parsoid, 10RESTBase, 6Services: Revision updates with Jobrunner for Parsoid and RESTBase - https://phabricator.wikimedia.org/T92490#1115814 (10GWicke) This doesn't look very accurate. 1) The duplicate generation is temporary until the Parsoid v1 API can be retired, which is approximately ri... [06:01:25] 6operations, 10Parsoid, 10RESTBase, 6Services: Revision updates with Jobrunner for Parsoid and RESTBase - https://phabricator.wikimedia.org/T92490#1115818 (10GWicke) 5Open>3Invalid a:3GWicke I'm going ahead and closing this as invalid, as I think it's mostly based on a misunderstanding. Please reopen... [06:07:21] (03PS1) 10GWicke: Raise number of RESTBase job runners to four [puppet] - 10https://gerrit.wikimedia.org/r/196531 [06:08:49] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:08:53] ugh [06:09:23] stupid puppet-merge :P [06:09:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:27:23] (03PS1) 10BBlack: repool cp104[35] [puppet] - 10https://gerrit.wikimedia.org/r/196535 [06:27:36] (03CR) 10BBlack: [C: 032 V: 032] repool cp104[35] [puppet] - 10https://gerrit.wikimedia.org/r/196535 (owner: 10BBlack) [06:28:03] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:43] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:03] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:49] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1115834 (10BBlack) The new eqiad hardware is fully up and pooled for prod traffic. Still waiting on esams hardware to arrive circa Mar 17. esams deployment will be a little more... [06:44:10] (03PS1) 10Yuvipanda: puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) [06:46:33] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:44] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:34] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:10:24] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/3/2: down - Transit: ! NTT {#3475} (service ID 234630) [10Gbps]BR [07:17:51] (03PS2) 10Yuvipanda: puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) [07:22:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [07:28:33] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [07:34:43] PROBLEM - Host db2042 is DOWN: PING CRITICAL - Packet loss = 100% [07:40:14] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Mar 13 07:39:11 UTC 2015 (duration 39m 10s) [07:40:21] Logged the message, Master [07:45:24] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:51:31] (03PS2) 10Giuseppe Lavagetto: memcached: add host entry for mc2001 [puppet] - 10https://gerrit.wikimedia.org/r/196284 [07:51:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] memcached: add host entry for mc2001 [puppet] - 10https://gerrit.wikimedia.org/r/196284 (owner: 10Giuseppe Lavagetto) [07:58:34] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [08:02:04] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [08:02:35] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:02:50] _joe_: ^ strontium again [08:02:55] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:03:17] <_joe_> YuviPanda: ugh sorry my bad [08:03:34] <_joe_> I forgot the "yes" there [08:03:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:04:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:04:17] _joe_: you can type ‘y’ as well now :) [08:04:24] the yes was bothering me, I made a patch a few weeks ago [08:45:45] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:46:04] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK: OK: No anomaly detected [08:54:37] (03CR) 10Hashar: "Thanks for the review, it seems to me an '=' comparison is more appropriate than '-eq' in case the START_DAEMON variable is not set. See m" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [08:59:46] (03PS3) 1020after4: puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) (owner: 10Yuvipanda) [08:59:56] (03CR) 1020after4: [C: 031] puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) (owner: 10Yuvipanda) [09:00:14] (03PS4) 10Yuvipanda: puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) [09:09:23] PROBLEM - configured eth on mc2002 is CRITICAL: Connection refused by host [09:09:34] PROBLEM - Memcached on mc2002 is CRITICAL: Connection refused [09:09:34] PROBLEM - Redis on mc2002 is CRITICAL: Connection refused [09:09:34] PROBLEM - Disk space on mc2002 is CRITICAL: Connection refused by host [09:09:43] PROBLEM - RAID on mc2002 is CRITICAL: Connection refused by host [09:09:44] PROBLEM - salt-minion processes on mc2002 is CRITICAL: Connection refused by host [09:09:54] PROBLEM - dhclient process on mc2002 is CRITICAL: Connection refused by host [09:10:03] PROBLEM - DPKG on mc2002 is CRITICAL: Connection refused by host [09:10:04] PROBLEM - puppet last run on mc2002 is CRITICAL: Connection refused by host [09:10:17] <_joe_> wat? [09:10:50] <_joe_> mmmh I just rebooted mc2017... [09:18:47] <_joe_> and mc2001, for some interesting reason it seems mc2001 points to mc2002 [09:21:11] <_joe_> oh snap, that's on me :P [09:22:17] <_joe_> brb [09:23:23] PROBLEM - Host mc2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:14] RECOVERY - configured eth on mc2002 is OK: NRPE: Unable to read output [09:25:24] RECOVERY - Memcached on mc2002 is OK: TCP OK - 0.044 second response time on port 11211 [09:25:24] RECOVERY - Host mc2002 is UP: PING OK - Packet loss = 0%, RTA = 44.56 ms [09:25:24] RECOVERY - Redis on mc2002 is OK: TCP OK - 0.047 second response time on port 6379 [09:25:24] RECOVERY - Disk space on mc2002 is OK: DISK OK [09:25:33] RECOVERY - RAID on mc2002 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:25:34] RECOVERY - salt-minion processes on mc2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:25:53] RECOVERY - dhclient process on mc2002 is OK: PROCS OK: 0 processes with command name dhclient [09:25:54] RECOVERY - DPKG on mc2002 is OK: All packages OK [09:26:03] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [09:45:48] 6operations, 6Project-Creators: Create #site-incident tag and use it for incident reports - https://phabricator.wikimedia.org/T85889#1116046 (10Aklapper) Yes, two objections. 1) I'd prefer someone to //first// push the actual *social* part (acceptance of proposed workflow!) of the process instead of creating... [09:49:06] (03PS1) 10Giuseppe Lavagetto: dhcp: correct mc2001 fixed-address [puppet] - 10https://gerrit.wikimedia.org/r/196546 [09:49:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] dhcp: correct mc2001 fixed-address [puppet] - 10https://gerrit.wikimedia.org/r/196546 (owner: 10Giuseppe Lavagetto) [10:02:55] (03CR) 10Faidon Liambotis: [C: 04-1] "Do we *really* need all those domains?" [dns] - 10https://gerrit.wikimedia.org/r/196473 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [10:03:57] (03PS1) 10Giuseppe Lavagetto: mediawiki: disable checks, jobrunner in codfw [puppet] - 10https://gerrit.wikimedia.org/r/196548 [10:06:12] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: disable checks, jobrunner in codfw [puppet] - 10https://gerrit.wikimedia.org/r/196548 (owner: 10Giuseppe Lavagetto) [10:17:54] RECOVERY - uWSGI web apps on graphite2001 is OK: OK: All defined uWSGI apps are runnning. [10:21:14] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [10:25:48] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1116168 (10Aklapper) @RobH: See https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines#Auto-linking_and_cross-referencing for automatic notifications in Phab by Gerritbot (needs "Bug: "... [10:26:47] (03PS1) 10Giuseppe Lavagetto: nutcracker: add mc2001 and mc2004 to the config [puppet] - 10https://gerrit.wikimedia.org/r/196555 [10:27:06] (03PS3) 10Giuseppe Lavagetto: mediawiki: add appserver cluster IPs in codfw [dns] - 10https://gerrit.wikimedia.org/r/195887 (https://phabricator.wikimedia.org/T92377) [10:27:28] (03CR) 10Alexandros Kosiaris: zuul: init scripts now have START_DAEMON (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [10:30:11] (03CR) 10Glaisher: "typo in file name: xzy to xyz" [dns] - 10https://gerrit.wikimedia.org/r/196312 (owner: 10RobH) [10:33:20] akosiaris: sorry that is messy :-( [10:33:28] nature's call [10:36:46] (03PS1) 10Faidon Liambotis: Kill misc/scripts/mwscriptwikiset-quiet, unused [puppet] - 10https://gerrit.wikimedia.org/r/196558 [10:36:48] (03PS1) 10Faidon Liambotis: scap: set $TERM in mwscript before spawning php [puppet] - 10https://gerrit.wikimedia.org/r/196559 [10:37:42] (03CR) 10Faidon Liambotis: [C: 032] Kill misc/scripts/mwscriptwikiset-quiet, unused [puppet] - 10https://gerrit.wikimedia.org/r/196558 (owner: 10Faidon Liambotis) [10:38:11] (03PS2) 10Faidon Liambotis: scap: set $TERM in mwscript before spawning php [puppet] - 10https://gerrit.wikimedia.org/r/196559 [10:38:29] (03CR) 10Faidon Liambotis: [C: 032] scap: set $TERM in mwscript before spawning php [puppet] - 10https://gerrit.wikimedia.org/r/196559 (owner: 10Faidon Liambotis) [10:38:55] (03CR) 10Faidon Liambotis: [V: 032] scap: set $TERM in mwscript before spawning php [puppet] - 10https://gerrit.wikimedia.org/r/196559 (owner: 10Faidon Liambotis) [10:39:44] Marking 3975 messages deleted... [10:39:45] heh [10:40:07] * YuviPanda makes paravoid manager again for a few weeks... [10:40:09] that is a lot of message processed since you woke up [10:40:24] (03CR) 10Alexandros Kosiaris: [C: 032] selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195519 (owner: 10Matanya) [10:40:29] what did I do wrong? [10:40:31] hashar: any time for the parsoid patch again? I would just like you to respond to the comments I already made… :) [10:41:33] hm that didn't fix it [10:41:33] hmm [10:41:56] oh I think I know [10:42:14] PROBLEM - puppet last run on mw2005 is CRITICAL: Connection refused by host [10:42:14] PROBLEM - RAID on mw2002 is CRITICAL: Connection refused by host [10:42:14] PROBLEM - nutcracker port on mw2004 is CRITICAL: Connection refused by host [10:42:14] PROBLEM - configured eth on mw2003 is CRITICAL: Connection refused by host [10:42:18] it's the sudo [10:42:33] PROBLEM - dhclient process on mw2003 is CRITICAL: Connection refused by host [10:42:33] PROBLEM - salt-minion processes on mw2005 is CRITICAL: Connection refused by host [10:42:33] PROBLEM - nutcracker process on mw2004 is CRITICAL: Connection refused by host [10:42:43] PROBLEM - puppet last run on mw2004 is CRITICAL: Connection refused by host [10:42:43] PROBLEM - nutcracker port on mw2003 is CRITICAL: Connection refused by host [10:42:43] PROBLEM - configured eth on mw2002 is CRITICAL: Connection refused by host [10:42:48] which doesn't preserve the environment [10:42:53] and it was hoo's change that broke this [10:43:02] PROBLEM - nutcracker process on mw2003 is CRITICAL: Connection refused by host [10:43:02] PROBLEM - salt-minion processes on mw2004 is CRITICAL: Connection refused by host [10:43:02] PROBLEM - dhclient process on mw2002 is CRITICAL: Connection refused by host [10:43:03] PROBLEM - puppet last run on mw2003 is CRITICAL: Connection refused by host [10:43:04] PROBLEM - nutcracker port on mw2002 is CRITICAL: Connection refused by host [10:43:04] PROBLEM - DPKG on mw2005 is CRITICAL: Connection refused by host [10:43:12] PROBLEM - salt-minion processes on mw2003 is CRITICAL: Connection refused by host [10:43:13] PROBLEM - Disk space on mw2005 is CRITICAL: Connection refused by host [10:43:13] PROBLEM - nutcracker process on mw2002 is CRITICAL: Connection refused by host [10:43:32] PROBLEM - puppet last run on mw2002 is CRITICAL: Connection refused by host [10:43:33] PROBLEM - DPKG on mw2004 is CRITICAL: Connection refused by host [10:43:42] PROBLEM - salt-minion processes on mw2002 is CRITICAL: Connection refused by host [10:43:43] PROBLEM - Disk space on mw2004 is CRITICAL: Connection refused by host [10:43:43] PROBLEM - RAID on mw2005 is CRITICAL: Connection refused by host [10:43:53] PROBLEM - DPKG on mw2003 is CRITICAL: Connection refused by host [10:43:54] that you _joe_? [10:44:12] PROBLEM - Disk space on mw2003 is CRITICAL: Connection refused by host [10:44:12] PROBLEM - RAID on mw2004 is CRITICAL: Connection refused by host [10:44:12] PROBLEM - configured eth on mw2005 is CRITICAL: Connection refused by host [10:44:22] PROBLEM - DPKG on mw2002 is CRITICAL: Connection refused by host [10:44:22] PROBLEM - dhclient process on mw2005 is CRITICAL: Connection refused by host [10:44:23] PROBLEM - nutcracker port on mw2005 is CRITICAL: Connection refused by host [10:44:24] PROBLEM - RAID on mw2003 is CRITICAL: Connection refused by host [10:44:24] PROBLEM - configured eth on mw2004 is CRITICAL: Connection refused by host [10:44:24] PROBLEM - Disk space on mw2002 is CRITICAL: Connection refused by host [10:44:42] PROBLEM - nutcracker process on mw2005 is CRITICAL: Connection refused by host [10:44:42] PROBLEM - dhclient process on mw2004 is CRITICAL: Connection refused by host [10:49:18] 6operations: Remove knsq16-30 and prepare OE13 for new servers - https://phabricator.wikimedia.org/T92519#1116256 (10Aklapper) [10:49:33] 6operations: Remove all Toolserver equipment - https://phabricator.wikimedia.org/T92518#1116257 (10Aklapper) [10:50:19] <_joe_> paravoid: yes [10:50:28] <_joe_> sorry [10:52:26] YuviPanda: I am unlikely to follow up today sorry ;( [10:54:41] hashar: ok. it’s been two weeks though :( [10:54:49] hashar: you should ask greg-g to get more people on CI :) [10:55:12] RECOVERY - nutcracker port on mw2003 is OK: TCP OK - 0.000 second response time on port 11212 [10:55:13] RECOVERY - DPKG on mw2003 is OK: All packages OK [10:55:23] RECOVERY - nutcracker process on mw2003 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:55:23] RECOVERY - salt-minion processes on mw2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:55:23] RECOVERY - Disk space on mw2003 is OK: DISK OK [10:55:23] RECOVERY - configured eth on mw2005 is OK: NRPE: Unable to read output [10:55:23] RECOVERY - RAID on mw2004 is OK: OK: no RAID installed [10:55:33] RECOVERY - DPKG on mw2005 is OK: All packages OK [10:55:33] RECOVERY - nutcracker port on mw2002 is OK: TCP OK - 0.000 second response time on port 11212 [10:55:33] RECOVERY - DPKG on mw2002 is OK: All packages OK [10:55:33] RECOVERY - dhclient process on mw2005 is OK: PROCS OK: 0 processes with command name dhclient [10:55:42] RECOVERY - Disk space on mw2005 is OK: DISK OK [10:55:42] RECOVERY - nutcracker process on mw2002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:55:42] RECOVERY - salt-minion processes on mw2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:55:43] RECOVERY - nutcracker port on mw2005 is OK: TCP OK - 0.000 second response time on port 11212 [10:55:43] RECOVERY - Disk space on mw2002 is OK: DISK OK [10:55:43] RECOVERY - configured eth on mw2004 is OK: NRPE: Unable to read output [10:55:43] RECOVERY - RAID on mw2003 is OK: OK: no RAID installed [10:55:53] RECOVERY - nutcracker port on mw2004 is OK: TCP OK - 0.000 second response time on port 11212 [10:55:53] RECOVERY - configured eth on mw2003 is OK: NRPE: Unable to read output [10:55:53] RECOVERY - RAID on mw2002 is OK: OK: no RAID installed [10:55:53] RECOVERY - DPKG on mw2004 is OK: All packages OK [10:55:54] RECOVERY - nutcracker process on mw2005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:55:54] RECOVERY - dhclient process on mw2004 is OK: PROCS OK: 0 processes with command name dhclient [10:56:03] RECOVERY - salt-minion processes on mw2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:56:03] RECOVERY - dhclient process on mw2003 is OK: PROCS OK: 0 processes with command name dhclient [10:56:04] RECOVERY - nutcracker process on mw2004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:56:04] RECOVERY - salt-minion processes on mw2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:56:04] RECOVERY - RAID on mw2005 is OK: OK: no RAID installed [10:56:12] RECOVERY - Disk space on mw2004 is OK: DISK OK [10:56:13] RECOVERY - configured eth on mw2002 is OK: NRPE: Unable to read output [10:56:32] RECOVERY - dhclient process on mw2002 is OK: PROCS OK: 0 processes with command name dhclient [10:58:04] 10Ops-Access-Requests, 6operations, 10Citoid, 6Services, 5Patch-For-Review: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#1116264 (10mobrovac) @robh I'd also need the merge right for the `citoid/deploy` repo. Currently, I've got +2, but can't merge patches there. D... [11:00:59] (03PS5) 10Yuvipanda: parsoid: Remove parsoid beta role [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) [11:01:16] could possibly any of you nice ops people reveal a mystery to me ? :) [11:01:27] so, i've got my extension repo https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/RestBaseUpdateJobs [11:01:40] and yesterday wanted to deploy that [11:01:55] but couldn't push changes to the new branch on origin [11:02:09] apparently i don't have the right to do so [11:02:27] (even though i asked for the creation of the repo in the first place) [11:02:29] <_joe_> push? [11:02:33] yep [11:02:42] <_joe_> mobrovac: push is not enabled on gerrit by default [11:02:51] <_joe_> did you try to use git review? [11:03:00] ah i see [11:03:01] nop [11:03:08] i just followed https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1c:_extension_update [11:03:14] which said to push [11:03:24] <_joe_> uhm no idea about that page [11:03:28] :) [11:04:31] ok so which of the submit types there would allow me to continue using review but also push when i want to deploy? [11:04:46] right now it's set on "merge if necessary" [11:05:14] <_joe_> mobrovac: I think those instructions refer to what you should do on tin [11:05:18] <_joe_> I think [11:05:32] <_joe_> so what you do on your own repo is just the usual gerrit flow [11:05:37] (03PS1) 10Faidon Liambotis: scap: silly fix for cf3672d [puppet] - 10https://gerrit.wikimedia.org/r/196562 [11:05:59] ah right [11:06:14] uf ok, it was pretty late last night when i was doing that :) [11:06:24] _joe_: thnx for pointing that out [11:06:43] <_joe_> oh np [11:06:53] (03CR) 10Hashar: "Indentation / comment are addressed in next patchset. I also fixed the code indentation for the status commands." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [11:07:10] (03PS2) 10Hashar: zuul: init scripts now have START_DAEMON [puppet] - 10https://gerrit.wikimedia.org/r/196198 [11:07:43] YuviPanda: that is really a matter of priority / focus :D [11:08:16] hashar: in that case, you should perhaps remove your -1. Giving a -1 and then not responding for weeks holds up my work as well. [11:08:47] YuviPanda: sure, I just have to process half a thousands of Gerrit notifications I received since last week [11:09:02] been mostly only working on packaging Zuul as a deb package [11:09:09] will give it a shot right now [11:09:15] thanks [11:09:19] https://gerrit.wikimedia.org/r/#/c/193082/ is the patch. [11:09:39] I’ll fight it around to make sure it all works by the end (that’s what happened with scap yesterday). [11:09:51] got a bunch of reviews to do https://gerrit.wikimedia.org/r/#/q/is:open+reviewer:hashar,n,z :D [11:11:52] RECOVERY - puppet last run on mw2005 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:14:54] RECOVERY - puppet last run on mw2003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:16:14] is there somewhere a layout of the SCA cluster? [11:16:23] RECOVERY - puppet last run on mw2002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:16:33] https://wikitech.wikimedia.org/wiki/Category:Eqiad_cluster doesn't seem to list it at all [11:17:41] (heh, it's friday and i'm full of questions) :P [11:18:35] (03PS2) 10Faidon Liambotis: scap: silly fix for cf3672d [puppet] - 10https://gerrit.wikimedia.org/r/196562 [11:18:35] mobrovac: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520A%2520eqiad&tab=m&vn=&hide-hf=false [11:19:07] cool, akosiaris thnx [11:19:28] (03CR) 10Faidon Liambotis: [C: 032] scap: silly fix for cf3672d [puppet] - 10https://gerrit.wikimedia.org/r/196562 (owner: 10Faidon Liambotis) [11:22:01] (03PS1) 10Faidon Liambotis: Add pfw1-codfw to rancid, torrus [puppet] - 10https://gerrit.wikimedia.org/r/196563 [11:22:18] (03CR) 10Alexandros Kosiaris: [C: 032] zuul: init scripts now have START_DAEMON [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [11:22:24] (03CR) 10Faidon Liambotis: [C: 032] Add pfw1-codfw to rancid, torrus [puppet] - 10https://gerrit.wikimedia.org/r/196563 (owner: 10Faidon Liambotis) [11:23:43] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:28:33] 6operations, 6Labs: Network port saturated for labstore1001 - https://phabricator.wikimedia.org/T92614#1116297 (10yuvipanda) 3NEW [11:28:40] paravoid: ^ i filed a bug. am investigating. [11:29:11] paravoid: where was the saturation warning from? [11:29:16] gaaaah, icinga has a puppet failure warning. [11:29:27] * YuviPanda sees what that is [11:30:22] 6operations, 6Labs: Puppet failure on labstore1001 - https://phabricator.wikimedia.org/T92615#1116306 (10yuvipanda) 3NEW [11:32:53] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:34:23] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:35:05] paravoid: ^ you forgot to puppet-merge [11:42:57] PROBLEM - nutcracker port on mw2009 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:43:25] <_joe_> !log installing the new libicu48 package on the canary appservers [11:43:34] Logged the message, Master [11:44:07] RECOVERY - nutcracker port on mw2009 is OK: TCP OK - 0.000 second response time on port 11212 [11:51:24] (03CR) 10Hashar: "Some more comments in the inline diff. Rebases do not reset the -1 score so the change was still hidden from ReviewQueue query :/" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [11:51:43] YuviPanda: I followed up in the inline diff https://gerrit.wikimedia.org/r/#/c/193082/3/manifests/role/parsoid.pp,unified [11:52:48] 6operations, 7network: cr1/cr2-codfw QSFP+ errors every second for qsfp-0/0/0 - https://phabricator.wikimedia.org/T92616#1116360 (10faidon) 3NEW a:3faidon [11:53:33] hashar: thank you :) [11:54:04] hashar: re: one role vs multiple roles, I far prefer composition and death to realm branches ;) I know that the wikitech interface is clunky, we’re working on a better interface for tha.t [11:54:20] I prefer to keep the jenkins_access role out of role::parsoid and set it up to elsewhere [11:54:38] I’ll go through and amend / test later. [11:56:11] (03PS1) 10Hashar: zuul: restore trailing ':' command in init scripts [puppet] - 10https://gerrit.wikimedia.org/r/196570 [12:00:47] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1116387 (10Joe) [12:00:49] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2001 is unresponsive - https://phabricator.wikimedia.org/T90559#1116386 (10Joe) 5Open>3Resolved [12:01:18] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#978973 (10Joe) [12:01:19] 6operations, 10ops-codfw, 3wikis-in-codfw: mc2004 console is unreadable remotely - https://phabricator.wikimedia.org/T90883#1116388 (10Joe) 5Open>3Resolved [12:01:37] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#978973 (10Joe) [12:01:38] YuviPanda: yeah horizon [12:01:38] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1116397 (10Joe) [12:05:58] (03PS1) 10Hashar: zuul: daemon is zuul-server [puppet] - 10https://gerrit.wikimedia.org/r/196572 [12:07:05] (03CR) 10Hashar: "In previous change, I though the last two lines were empty and killed them. Havent noticed the very first line is a lonely ':'." [puppet] - 10https://gerrit.wikimedia.org/r/196570 (owner: 10Hashar) [12:08:41] (03CR) 10Hashar: "That is a mistake I made in a previous commit:" [puppet] - 10https://gerrit.wikimedia.org/r/196572 (owner: 10Hashar) [12:09:12] YuviPanda: and the parsoid settings.js file comes in two flavor in the deploy repo. One version for prod, the other for beta [12:09:13] :( [12:09:33] hashar: yeah, so I was wondering how it determines which file to use. [12:11:35] YuviPanda: it uses an arg in upstart afaik [12:11:42] to select the correct one [12:11:55] mobrovac: right, so to rephrase, I’m trying to figure out how the correct file gets put there. [12:12:04] in the beta one, I didn’t actually see the prod file at all [12:12:04] hehe [12:12:12] (03CR) 10Alexandros Kosiaris: [C: 032] zuul: daemon is zuul-server [puppet] - 10https://gerrit.wikimedia.org/r/196572 (owner: 10Hashar) [12:12:27] akosiaris: sorry, should have tested it better :( [12:13:53] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1116419 (10Jgreen) OH CERTS!! This explains the mw redirect scheme too, I hereby withdraw all of my previous comments on this. jg [12:16:07] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:42] ... [12:17:45] !log restarted apache on virt1000, looks like the puppetmaster died [12:17:58] YuviPanda: it seems it's got its own localsettings in the repo itself, one for prod, one for betalabs - https://github.com/wikimedia/mediawiki-services-parsoid-deploy/tree/master/conf/wmf [12:18:23] but yeah, parsoid stuff seems a bit too messy [12:18:40] mobrovac: yup, cleanup patch https://gerrit.wikimedia.org/r/#/c/193082/ [12:19:38] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.791 second response time [12:20:05] mobrovac: so when I tested it, it had only localsettings.js, and not the betalabs config. but mayybe I’m misremembering. I’ll try again [12:22:26] PROBLEM - configured eth on mw2014 is CRITICAL: Connection refused by host [12:22:26] PROBLEM - configured eth on mw2011 is CRITICAL: Connection refused by host [12:22:26] PROBLEM - configured eth on mw2013 is CRITICAL: Connection refused by host [12:22:26] PROBLEM - configured eth on mw2012 is CRITICAL: Connection refused by host [12:22:30] hm, strange because both files are in the repo, and you patch removes the file removal stuff [12:22:38] PROBLEM - dhclient process on mw2013 is CRITICAL: Connection refused by host [12:22:38] PROBLEM - dhclient process on mw2011 is CRITICAL: Connection refused by host [12:22:38] PROBLEM - dhclient process on mw2014 is CRITICAL: Connection refused by host [12:22:38] PROBLEM - dhclient process on mw2012 is CRITICAL: Connection refused by host [12:22:56] PROBLEM - nutcracker port on mw2011 is CRITICAL: Connection refused by host [12:22:56] PROBLEM - nutcracker port on mw2013 is CRITICAL: Connection refused by host [12:22:56] PROBLEM - nutcracker port on mw2012 is CRITICAL: Connection refused by host [12:22:56] PROBLEM - nutcracker port on mw2014 is CRITICAL: Connection refused by host [12:23:06] PROBLEM - nutcracker process on mw2013 is CRITICAL: Connection refused by host [12:23:07] PROBLEM - nutcracker process on mw2014 is CRITICAL: Connection refused by host [12:23:07] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:07] 6operations, 6Labs: Network port saturated for labstore1001 - https://phabricator.wikimedia.org/T92614#1116420 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Turns out it was a tool that had started up a thousand or so jobs that all hit NFS, saturating everything. I've killed all the jobs, and will notify th... [12:23:43] mobrovac: yup. although [12:23:51] mobrovac: I’ll create another host and test it in a while... [12:24:05] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1116424 (10faidon) I'll copy my comment from the Gerrit changeset: "Do we *really* need all those domains? I sincerely doubt that anything but .wikipedia.org and .wikimedia.org are being used. I... [12:24:07] (03CR) 10Mobrovac: parsoid: Remove parsoid beta role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [12:24:35] YuviPanda: gr8, let me know if i can help [12:24:46] mobrovac: your review helped! I’ll get rid of those now. [12:27:36] mobrovac: feel free to modify that patch to simplify as well. I don’t fully know how parsoid runs, the refactor was just to try to remove code duplication [12:28:26] this is a great first step actually [12:28:37] mobrovac: yup :) [12:28:45] YuviPanda: it'd be also really really cool to get some of its settings into hiera [12:28:51] mobrovac: +1 [12:28:52] namely, at least the port [12:29:16] using the config directly from the deploy repo seems like ... a hack, to say the least [12:29:16] mobrovac: so what we’ve been mostly doing is to make it a parameter, set its default to a sane value in the code, and then we can override in hiera if needed [12:29:53] mobrovac: I think partly it’s also a matter of who has +2 on where. config repo means its parsoid’s responsibility, kindof... [12:29:59] anyway, I agree on moving things to hiera [12:30:02] than having two different files [12:30:13] because those will always drift [12:30:16] DRIIIIIFFFFT [12:30:17] away [12:30:20] !log restarting Jenkins to remove some Beta jobs deadlock. Updated a few plugins as well. [12:30:33] YuviPanda: so, using that logic, if i write another module which wants to use the parsoid port, how i do get it if it's not set to a different value in hiera? [12:31:07] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.494 second response time [12:31:12] akosiaris: and I had another zuul/puppet patch https://gerrit.wikimedia.org/r/#/c/196570/ :D [12:31:28] mobrovac: I think you can just do a lookup and it just works? [12:31:34] we do that somewhere, let me find it [12:32:49] mobrovac: mediawiki::users has ‘web’, defaults to apache, but is actually overriden to be www-data everywhere. lots of places just use $::mediawiki::users::web and it works [12:33:00] obviously, only if mediawiki::users is also included... [12:33:06] RECOVERY - dhclient process on mw2011 is OK: PROCS OK: 0 processes with command name dhclient [12:33:06] RECOVERY - dhclient process on mw2012 is OK: PROCS OK: 0 processes with command name dhclient [12:33:16] so that doesn’t answer your ‘what if I want to use it from another module’ question [12:33:18] RECOVERY - nutcracker port on mw2014 is OK: TCP OK - 0.000 second response time on port 11212 [12:33:18] RECOVERY - nutcracker port on mw2011 is OK: TCP OK - 0.000 second response time on port 11212 [12:33:18] RECOVERY - nutcracker port on mw2012 is OK: TCP OK - 0.000 second response time on port 11212 [12:33:21] I don’t think we’ve run into that yet [12:33:27] RECOVERY - nutcracker process on mw2013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:33:27] RECOVERY - nutcracker process on mw2014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:33:56] RECOVERY - configured eth on mw2012 is OK: NRPE: Unable to read output [12:33:56] RECOVERY - configured eth on mw2011 is OK: NRPE: Unable to read output [12:33:56] RECOVERY - configured eth on mw2014 is OK: NRPE: Unable to read output [12:33:57] RECOVERY - configured eth on mw2013 is OK: NRPE: Unable to read output [12:34:14] mobrovac: although, assuming you want to know it in another module so you can make requests to it, it sounds like a service discovery thing that we should do perhaps outside of puppet... [12:34:16] RECOVERY - dhclient process on mw2013 is OK: PROCS OK: 0 processes with command name dhclient [12:34:16] RECOVERY - dhclient process on mw2014 is OK: PROCS OK: 0 processes with command name dhclient [12:34:27] RECOVERY - nutcracker port on mw2013 is OK: TCP OK - 0.000 second response time on port 11212 [12:34:38] YuviPanda: very good point [12:34:40] ideally, yes [12:34:59] mobrovac: I think there’s some movement towards that (etcd / zk) that might happen over the next few months... [12:35:18] yey :) [12:35:22] mobrovac: I personally want to get rid of modules/dsh/files and use etcd/zk for that, for example. [12:35:53] although, re etcd, at $JOB-1 we tried to use it, but found it too unstable [12:35:57] mobrovac: but yeah, so for our current uses, you just refer to the param in puppet code, and it works. and for the use case you mentioned, I don’t think puppet is the right place to fix it in. [12:36:16] mobrovac: yeah, that’s why we’ll probably end up with zk, but might check etcd too. At this point it’s nothing more than random conversations... [12:36:22] of things to do, of which we have plenty of… :) [12:36:27] :) [12:36:52] mobrovac: I’ll happily help move along patches that do cleanup / hiera-ization [12:36:56] here's something that looks promising https://www.consul.io/ [12:37:07] yup, that’s also something people are talking about... [12:37:08] haven't played with it though, just food for thought :P [12:37:30] good! re hiera-ization [12:39:07] consul's got multi-dc ops, plus it uses gossip, so no SPOF problems [12:39:16] which seems like a big win [12:39:27] (unlike zk, if i remember correctly) [12:40:55] right [12:40:58] but one step at a time :D [12:41:08] need to get this patch merged [12:55:27] hi YuviPanda [12:55:39] hey raj [12:56:23] I created an account but don't recall the password. In fact, I'm not even sure I used an email address when signing up (may not have been required then) [12:56:49] akosiaris: we plan to deploy soon-ish a service developed by the mobileapps guys, so we'd need an ip and stuff in SCA [12:56:57] I tried a few email addresses and I didn't receive a password reset [12:56:59] akosiaris: under which tag in phab should i create the request? [12:57:22] also, I'm not even sure if the email address I did use still exists (since I possibly used my cable company's email) [12:57:54] raj: ah. sadly we can’t really do much about that, however. [12:58:21] can you at least tell me if it has an email address associated and if so what it is so I can send the password reset to the correct email? [12:58:33] you can even see that it hasn't been logged into since around the time of creation [12:58:55] mobrovac: soonish == ? [12:58:57] don’t think we can do that either, sorry. [12:59:17] mobrovac: good question about the tag... well operations for sure, I suppose it will have it's own tag as well [13:00:03] YuviPanda, there must be something? [13:00:20] nope. sorry. I guess you can create a new account. [13:00:40] YuviPanda, can you at least tell me if there's an email associated at all? At least so I can stop trying if there isn't? [13:00:48] akosiaris: soonish == not before the end of next week for sure, but i'd like to start the process early enough [13:01:03] raj: don’t think we can do that either. privacy, etc. [13:01:22] mobrovac: that team operates on sprints of 2 weeks each, so we need to be prepared in advance [13:01:24] but I'm not even asking for the email address, just if one is associated [13:01:50] raj: well, you can open a ticket at phabricator.wikimedia.org and see if anyone else will, but I highly doubt it. [13:01:53] (03PS2) 10coren: replica-addusers: moar documentation [puppet] - 10https://gerrit.wikimedia.org/r/196300 [13:02:11] YuviPanda: That one ^^ [13:02:45] ///modules rather than //modules [13:02:57] Yeay stupid puppet magical pseudo-uris. [13:03:01] (03CR) 10Yuvipanda: [C: 031] replica-addusers: moar documentation [puppet] - 10https://gerrit.wikimedia.org/r/196300 (owner: 10coren) [13:03:02] mobrovac: I am a bit concerned about the time. I think we might have to shoot for April, I doubt we can do it in March [13:03:12] PROBLEM - configured eth on mw2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:12] PROBLEM - configured eth on mw2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:12] PROBLEM - configured eth on mw2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:12] PROBLEM - configured eth on mw2015 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:15] (03CR) 10coren: [C: 032] replica-addusers: moar documentation [puppet] - 10https://gerrit.wikimedia.org/r/196300 (owner: 10coren) [13:03:31] PROBLEM - dhclient process on mw2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:33] PROBLEM - dhclient process on mw2015 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:33] PROBLEM - dhclient process on mw2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:33] PROBLEM - dhclient process on mw2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:42] PROBLEM - nutcracker port on mw2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:42] PROBLEM - nutcracker port on mw2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:42] PROBLEM - nutcracker port on mw2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:42] PROBLEM - nutcracker port on mw2015 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:51] paravoid: You have an unpushed changed to router.db. Okay if I merge it in? [13:03:52] PROBLEM - nutcracker process on mw2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:52] PROBLEM - nutcracker process on mw2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:02] PROBLEM - puppet last run on mw2016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:02] PROBLEM - puppet last run on mw2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:10] paravoid: +pfw1-codfw.wikimedia.org:juniper:up: [13:04:22] RECOVERY - configured eth on mw2015 is OK: NRPE: Unable to read output [13:04:22] RECOVERY - configured eth on mw2016 is OK: NRPE: Unable to read output [13:04:22] RECOVERY - configured eth on mw2018 is OK: NRPE: Unable to read output [13:04:22] RECOVERY - configured eth on mw2017 is OK: NRPE: Unable to read output [13:04:32] RECOVERY - dhclient process on mw2018 is OK: PROCS OK: 0 processes with command name dhclient [13:04:41] RECOVERY - dhclient process on mw2015 is OK: PROCS OK: 0 processes with command name dhclient [13:04:41] RECOVERY - dhclient process on mw2017 is OK: PROCS OK: 0 processes with command name dhclient [13:04:41] RECOVERY - dhclient process on mw2016 is OK: PROCS OK: 0 processes with command name dhclient [13:04:52] RECOVERY - nutcracker port on mw2016 is OK: TCP OK - 0.000 second response time on port 11212 [13:04:52] RECOVERY - nutcracker port on mw2018 is OK: TCP OK - 0.000 second response time on port 11212 [13:04:52] RECOVERY - nutcracker port on mw2017 is OK: TCP OK - 0.000 second response time on port 11212 [13:04:52] RECOVERY - nutcracker port on mw2015 is OK: TCP OK - 0.000 second response time on port 11212 [13:05:01] RECOVERY - nutcracker process on mw2018 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:05:01] RECOVERY - nutcracker process on mw2016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:05:43] YuviPanda, I don't see where to open a ticket ther [13:05:44] e [13:05:56] Coren: oh damn; yes, please do! [13:05:57] raj: you need to create an account to do it. [13:05:59] Coren: thanks [13:06:05] akosiaris: ah? mind illuminating me on that? [13:06:11] raj: so that’s a bit of a chicken and egg problem, eh. [13:06:17] indeed [13:06:18] paravoid: Merged. [13:06:21] raj: either way, I still reccomend just creating a new account and going wit hit. [13:06:42] YuviPanda, I understand, but the other one was my name, so I'd really like to get it back [13:06:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:07:02] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:07:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:07:14] raj: yeah, I understand :) nothing much we can do to help though. sorry. [13:09:41] RECOVERY - puppet last run on labstore1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:12:15] damn internet conn [13:13:12] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: Puppet has 6 failures [13:13:56] mobrovac: lack of time more or less. Trying to finish goals, so next Q ? [13:14:21] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 6 failures [13:14:22] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 6 failures [13:14:22] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 6 failures [13:15:06] akosiaris: but, i'd put it on the existing sca nodes, with configs and stuff which more or less resemble the *oids currently there [13:15:16] akosiaris: no fancy stuff like zotero and the likes [13:15:32] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:15:32] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:15:38] why does it need a new IP? I guess it just needs a few patches merged, no? [13:16:42] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:42] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:51] mobrovac: not arguing with the plan, it's lack of time until end of March I am concerned about. Maybe move along with low priority ? [13:18:20] akosiaris: ah i see. ok, so i'll put the ticket up and then we can try to move it bit by bit (mind you, there currently doesn't even exist a deploy repo for it), sounds good ? [13:18:33] sure [13:18:38] gr8 :) [13:19:20] mobrovac: btw, in terms of ‘what you can do to help’, hiera-izing the parsoid config would be appreciated, for one :D [13:19:37] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1116481 (10Jgreen) I don't think we need all the domains. Many of the ones Daniel mentioned are just an artifact of our DNS templating anyway. My understanding is we want everything to end up at o... [13:20:42] YuviPanda: sure! can't do it right now, but will have it on my radar [13:20:50] mobrovac: cool [13:20:51] <^d> YuviPanda: Speaking of hiera...we need to poke elastic role again :( [13:21:01] ^d: yeaaaah. [13:21:02] we do. [13:21:14] ^d: twentyafterfour got redis up and running, I think. [13:21:25] <^d> I saw the phab e-mails, yeah [13:21:28] ^d: I’m wondering if we should first get tin / mediawiki instances running, though. [13:22:46] ^d: also, palladium is precise in prod and trusty in staging... [13:23:13] <^d> Yeahhhh [13:29:44] 6operations, 6Labs: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116529 (10yuvipanda) 3NEW [13:32:10] (03CR) 10Jgreen: [C: 031] "removing -1 now that I understand the SSL cert issue, see phabricator ticket for discussion" [dns] - 10https://gerrit.wikimedia.org/r/196007 (https://phabricator.wikimedia.org/T92438) (owner: 10John F. Lewis) [13:32:32] 6operations, 6Labs: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116545 (10coren) Strictly speaking, it's commented out in site.pp atm because it is known to not work correctly because of a problem with the switch; the actual runtime configuration is not removed... [13:49:57] 6operations, 6Labs: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116571 (10yuvipanda) Are you sure? eth0 and eth1 have no IP addresses on the machine, only bond0 does. [13:50:41] !log restart keystone on virt1000, login errors reported [14:02:36] (03CR) 10Ottomata: [C: 04-1] "Is it really that painful?" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [14:08:14] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event [8 pts] - https://phabricator.wikimedia.org/T91918#1116583 (10kevinator) Implementation notes from our tasking meeting: Client - adding m... [14:11:30] I broke the zuul init script earlier and akosiaris forgot to +2 a trivial dependent change https://gerrit.wikimedia.org/r/#/c/196570/ :) [14:11:36] could use a hand there :D [14:12:05] (03PS2) 10Ottomata: zuul: restore trailing ':' command in init scripts [puppet] - 10https://gerrit.wikimedia.org/r/196570 (owner: 10Hashar) [14:12:07] i gotcha [14:12:28] (03CR) 10Ottomata: [C: 032 V: 032] zuul: restore trailing ':' command in init scripts [puppet] - 10https://gerrit.wikimedia.org/r/196570 (owner: 10Hashar) [14:12:54] ottomata: and the follow up https://gerrit.wikimedia.org/r/#/c/196572/ :D [14:13:39] (03PS2) 10Hashar: zuul: daemon is zuul-server [puppet] - 10https://gerrit.wikimedia.org/r/196572 [14:14:12] (03CR) 10Ottomata: [C: 032 V: 032] zuul: daemon is zuul-server [puppet] - 10https://gerrit.wikimedia.org/r/196572 (owner: 10Hashar) [14:14:25] I will run puppet where relevant [14:14:32] k merged. [14:14:46] thanks! [14:14:57] ottomata: and your guide https://wikitech.wikimedia.org/wiki/Git-buildpackage has been very helpful to me this week :) [14:15:35] ha, for python? [14:15:40] that guide is not mine.! :) [14:15:44] well generally for cowbuilder [14:15:55] damn that is by mark [14:16:14] wait now, you did edit it back in March 2013 https://wikitech.wikimedia.org/w/index.php?title=Git-buildpackage&diff=64147&oldid=51009 [14:19:27] 6operations: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1116613 (10Jgreen) the only thing at /a/log/fundraising is the nfs mount from the netapp, which is still in use for log collection. the banner log collection pipeline has to be fully migrated before we delete/umount this. [14:19:45] yeah, hashar i've added stuff for python mostly [14:20:28] ottomata: I will probably write yet step by step guide :D [14:20:32] more generic though [14:20:51] 6operations: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1116614 (10Jgreen) [14:24:04] that woudl be awesome [14:29:21] Ahm, should I be getting X-Analytics headers in responses in my browser? [14:29:34] I would expect such internal headers to be filtered out at the edge [14:38:35] hashar: I did not forget it, I haven't actually reviewed that yet :D [14:39:38] and now that I do, I am wondering why not just exit 0 [14:39:57] akosiaris: on my debian, lot of init script ends with : [14:40:12] sigh.. thank god for systemd then [14:40:22] :D [14:42:11] on mine btw only 26 do [14:42:22] out of 88 [14:45:06] akosiaris: you can postpone zuul mgiration to systemd till we migrate it to jessie I guess [14:45:10] it is on Precise right now [14:45:15] it is still on Precise [14:45:38] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1116644 (10faidon) You're right about the DNS templating system, but only partially. We do define shop subdomains in 10 different distinct domains (projects) and have about 20 redirects set up. I'... [14:46:17] 6operations, 6Commons, 6Multimedia, 7HHVM, and 3 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1116645 (10Steinsplitter) [14:52:29] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering, 5Patch-For-Review: Chad H. needs access to iridium (Phabricator host) to manage repos - https://phabricator.wikimedia.org/T92564#1116651 (10demon) >>! In T92564#1115364, @RobH wrote: > @demon: please sign https://phabricator.wikimedi... [14:56:02] PROBLEM - Host mw2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:22] PROBLEM - Host mw2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:22] PROBLEM - Host mw2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:02] PROBLEM - Host mw2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:41] RECOVERY - Host mw2001 is UP: PING OK - Packet loss = 0%, RTA = 42.95 ms [14:59:03] RECOVERY - Host mw2002 is UP: PING OK - Packet loss = 0%, RTA = 43.35 ms [15:01:42] RECOVERY - Host mw2004 is UP: PING OK - Packet loss = 0%, RTA = 43.23 ms [15:03:52] (03CR) 10Subramanya Sastry: "I am being conservative, but maybe change to 2, monitor for half a day and then update to 4?" [puppet] - 10https://gerrit.wikimedia.org/r/196531 (owner: 10GWicke) [15:06:02] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1116669 (10mobrovac) 3NEW [15:06:12] akosiaris: ^^ [15:06:32] akosiaris: let me know if i need to add/clarify it [15:07:14] ugh, more unplanned services [15:08:33] yeah [15:08:57] i think we ought to talk about these things some time soon [15:09:04] which things? [15:09:07] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1116678 (10Jgreen) Agreed. So we need to do the following, in something like this order: 1) decide the store's "canonical hostname" between www.store.wikipedia.org vs store.wikipedia.org (Victoria... [15:09:29] service-template-node but the name is service-mobileapp-node ? [15:09:45] copy/paste I suppose ? [15:10:47] (03CR) 10Faidon Liambotis: [C: 04-1] "The rebase is very wrong -- it rebases code that's long gone (e.g. Gluster!). The idea is still valid though and I'd love for someone to p" [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [15:11:37] akosiaris: no, these are two different things service-mobileapp-node is based on service-template-node [15:11:41] yey naming [15:12:04] paravoid: deployment workflow of (future/existent) services [15:12:35] how about the decision/design process of spawning new services? [15:12:41] this feels a little backwards :) [15:12:46] 6operations, 10ops-codfw, 3wikis-in-codfw: Configure mw2001-2134 correctly - https://phabricator.wikimedia.org/T91238#1116679 (10Papaul) @Joe Redirection after boot for mw2001-mw2134. have you already changed the settings on some on those servers? i login into 4 this morning and the settings are already cha... [15:13:36] paravoid: yep, for sure, it's a "=>" relation :) [15:13:52] esp. since there are no clear owners or dedicated resources to maintaining those services [15:14:10] my impression from the budget discussion is that the services team will not scale up to maintain N new services [15:14:16] in staffing, that is [15:14:26] there's obviously a lot of unknowns and different views, so we need to find some common ground for starters [15:14:49] ops may but not only for small numbers of N :) [15:15:12] hence the "common ground" part ;) [15:15:19] :) [15:15:43] i really think we need to find something that works for both sides [15:15:53] and I doubt the mobile apps team has incorporated any service maintainership/responsibility/redundancy in their staffing plans [15:16:01] the current status quo seems only to be harming human relationships [15:16:59] s/current/relatively new poorly executed/ :P [15:17:13] that too [15:17:16] and by that I don't mean technically [15:19:56] 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1116695 (10hashar) I have further tweaked the package intended for precise-wikimedia. Patchset 9 of https://gerrit.wikimedia.... [15:20:13] 6operations, 6Labs, 7Monitoring: Setup alarms for labstore* to check for network saturatioin - https://phabricator.wikimedia.org/T92629#1116700 (10yuvipanda) 3NEW [15:20:44] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1116708 (10hashar) I need to get the package reviewed by ops. [15:25:37] YuviPanda: do you know if there is a ticket for the ops/puppet clean up effort? [15:25:47] mobrovac: which part? [15:25:51] <^d> Is prod redis still sick or something? [15:26:09] YuviPanda: configs etc, something i can reference in the ticket for parsoid config move [15:26:10] mobrovac: :) [15:26:19] mobrovac: ah, just file one and stick it in. [15:26:31] mobrovac: actually, for parsoid, there’s one [15:26:37] ah ? [15:26:41] !log reinstalling cp1044 [15:26:47] Logged the message, Master [15:26:49] <^d> Tons of "Failed connecting to redis server at 10.64.0.162: Connection timed out" [15:27:33] mobrovac: https://phabricator.wikimedia.org/T86633 [15:27:43] is a bit of an broad ticket [15:27:52] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: puppet fail [15:28:27] cool, will ref that then, thnx YuviPanda [15:28:56] ^d: I made a ticket for it [15:29:00] last night [15:29:11] https://phabricator.wikimedia.org/T92591 [15:29:13] <^d> Ugh, this has been going on all night? [15:29:17] speaking of which ori [15:29:20] longer [15:29:34] hm, did we lose wikibugs? [15:29:39] not sure how long but at least since yesterday morning [15:29:54] mobrovac: I updated T92627 [15:30:08] 6operations: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1116717 (10demon) p:5High>3Unbreak! [15:30:17] paravoid: ^ looks ok. [15:32:18] legoktm: whether to display a deletion log or not it's not super duper crucial” [15:32:22] ^d: ^ [15:32:35] thus I didn't worry too much but yeah ori or aaron need to look at that today ideally [15:32:57] <^d> ...redis being down isn't crucial? [15:33:45] <^d> Maybe it's not crucial from the redis side, but it seems pretty damning on MW. [15:33:47] <^d> If we keep timing out [15:34:25] 6operations, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1116727 (10hashar) I have poked ops list about it. [15:35:07] No argument, but since it had been happening for quite awhile (unnoticed) and with the consensus of devs being that it could wait until morning I documented and made an issue etc [15:41:28] (03CR) 10GWicke: "We can also just go to four and keep an eye on it. We know the total load (as it's already being processed) already, the only difference i" [puppet] - 10https://gerrit.wikimedia.org/r/196531 (owner: 10GWicke) [15:45:25] (03PS5) 10Yuvipanda: puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) [15:45:28] andrewbogott: ^ [15:46:23] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Refactor out the auto key signer [puppet] - 10https://gerrit.wikimedia.org/r/196537 (https://phabricator.wikimedia.org/T92606) (owner: 10Yuvipanda) [15:46:28] woot [15:46:43] <^d> chasemp: Fair enough, but I'd rather just drop the "ideally" so we don't have to leave it over the weekend :) [15:46:57] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:47:07] ^d: yes agreed I was planning on making sure it got attention this morning [15:47:15] pretty much perfect synergy you just asked about it [15:48:34] YuviPanda: ok, merged on virt1000, I’ll try creating a new instance [15:48:44] andrewbogott: \o/ <3 thanks [15:58:28] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:58:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:00:01] (03PS1) 10Yuvipanda: puppetmaster: Make the puppetmaster autoupdate variable hierable [puppet] - 10https://gerrit.wikimedia.org/r/196585 [16:00:03] thcipriani: ^ [16:00:18] (03PS2) 10Yuvipanda: puppetmaster: Make the puppetmaster autoupdate variable hierable [puppet] - 10https://gerrit.wikimedia.org/r/196585 [16:00:23] * thcipriani looks [16:03:37] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Make the puppetmaster autoupdate variable hierable [puppet] - 10https://gerrit.wikimedia.org/r/196585 (owner: 10Yuvipanda) [16:05:54] (03PS1) 10Yuvipanda: puppetmaster: Fix typo in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/196588 [16:06:04] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Fix typo in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/196588 (owner: 10Yuvipanda) [16:06:06] (03PS2) 10Yuvipanda: puppetmaster: Fix typo in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/196588 [16:06:14] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Fix typo in previous commit [puppet] - 10https://gerrit.wikimedia.org/r/196588 (owner: 10Yuvipanda) [16:07:41] (03PS1) 10BBlack: depool cp4015 for reboot/bios [puppet] - 10https://gerrit.wikimedia.org/r/196589 [16:07:59] (03CR) 10BBlack: [C: 032 V: 032] depool cp4015 for reboot/bios [puppet] - 10https://gerrit.wikimedia.org/r/196589 (owner: 10BBlack) [16:11:48] 6operations, 10Parsoid, 6Services: Move Parsoid config into ops/puppet - https://phabricator.wikimedia.org/T92636#1116822 (10mobrovac) 3NEW [16:13:35] thcipriani: I updated the page to include autoupdate and autosign instructions [16:14:22] (03CR) 10Subramanya Sastry: [C: 031] "Works for me if you can monitor the load." [puppet] - 10https://gerrit.wikimedia.org/r/196531 (owner: 10GWicke) [16:14:40] YuviPanda: nice. [16:14:59] thcipriani: however, you can’t make local uncommited chanes at all if you enable autoupdate, since there’s a reset —hard in there [16:15:11] (03PS1) 10Rush: phab setup to allow for granular shell permissions [puppet] - 10https://gerrit.wikimedia.org/r/196590 [16:16:03] YuviPanda: that should be fine for how I've been working anyway, mostly just making patches. [16:16:18] making patches, taking names. [16:16:55] (03PS2) 10Rush: phab setup to allow for granular shell permissions [puppet] - 10https://gerrit.wikimedia.org/r/196590 [16:17:55] (03PS3) 10Rush: phab setup to allow for granular shell permissions [puppet] - 10https://gerrit.wikimedia.org/r/196590 [16:19:55] (03PS4) 10Rush: phab setup to allow for granular shell permissions [puppet] - 10https://gerrit.wikimedia.org/r/196590 [16:21:13] (03CR) 10Rush: [C: 032] phab setup to allow for granular shell permissions [puppet] - 10https://gerrit.wikimedia.org/r/196590 (owner: 10Rush) [16:21:25] (03PS1) 10BBlack: repool cp4015 [puppet] - 10https://gerrit.wikimedia.org/r/196592 [16:21:38] (03PS2) 10BBlack: repool cp4015 [puppet] - 10https://gerrit.wikimedia.org/r/196592 [16:21:50] (03CR) 10BBlack: [C: 032 V: 032] repool cp4015 [puppet] - 10https://gerrit.wikimedia.org/r/196592 (owner: 10BBlack) [16:25:49] !log rebooting cp4019 [16:25:56] Logged the message, Master [16:27:55] (03PS1) 10Rush: phab setup for granular shell permissions v2 [puppet] - 10https://gerrit.wikimedia.org/r/196594 [16:28:05] (03CR) 10jenkins-bot: [V: 04-1] phab setup for granular shell permissions v2 [puppet] - 10https://gerrit.wikimedia.org/r/196594 (owner: 10Rush) [16:28:07] (03PS2) 10Rush: phab setup for granular shell permissions v2 [puppet] - 10https://gerrit.wikimedia.org/r/196594 [16:32:29] (03CR) 10Rush: [C: 032] phab setup for granular shell permissions v2 [puppet] - 10https://gerrit.wikimedia.org/r/196594 (owner: 10Rush) [16:34:49] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:35:19] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:37:18] https://gdash.wikimedia.org/dashboards/reqerror/ ? [16:37:26] what's up with that increase 500 spike pattern? [16:38:39] PROBLEM - NTP on cp4015 is CRITICAL: NTP CRITICAL: Offset unknown [16:39:50] <_joe_> chasemp: were you looking at rbf1001 ? [16:40:31] _joe_: last night yes https://phabricator.wikimedia.org/T92591 [16:40:38] need to get ori or aaron's attention on that^ [16:40:39] <_joe_> chasemp: still an issue [16:40:46] yes [16:40:48] <_joe_> why do we need them? [16:41:05] well..it's their service and the most straight forward thing to do is turn off AOF [16:41:16] but not sure why they have it enabled etc [16:41:33] what data flows through the rbf redis that needs persistence? [16:41:41] no one seems to know [16:41:50] awesome :P [16:41:53] <_joe_> bblack: rbf is for redis bloom filters [16:42:08] <_joe_> chasemp: phone aaron in case, he knows for sure [16:42:17] 6operations, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1116865 (10greg) This is spammy spammy spammy: for rbf1001 we've had 44200 error logs in the past 15 minutes rbf1002 (I presume, it's 10.64.0.163) is less but still ~6... [16:42:22] <_joe_> if having an UBN! ticket is not enough I mean [16:44:17] I expected one of them to pop up by now but I imagine 24 hours of this is enough to call somebody yes [16:44:51] what's the other phab project this applies to other than ops? [16:45:13] good question, since nobody seems to know what data is in the thing anyways [16:45:20] :/ [16:45:29] #redis doesn't exist [16:46:04] legoktm: "whether to display a deletion log or not it's not super duper crucial” [16:46:29] I'm trying to think up a reason we'd need a big central storage for persistent bloom filters, but I'm not creative enough [16:46:33] aaron's not on the ticket [16:46:35] but I don't know what the rebuild cost is if I just turn off AOF and redis restarts... [16:46:47] it's assigned to him greg-g? [16:46:49] 6operations, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1116871 (10greg) [16:46:57] oh! /me just looked at CC [16:47:17] greg-g: do you have a # for aaron? [16:47:30] looking [16:49:58] 6operations, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1116885 (10chasemp) gave aaron a call and left a vm [16:53:23] bblack: cp3008 in esams is special? the only one in .wikimedia.org , all others in .wmnet [16:54:05] mutante: it's special because it's the last one not reinstalled in esams yet. It used to be about 50/50 wikimedia.org/wmnet. I've been moving all to the private vlan as they're reinstalled. [16:54:30] bblack: ah, got it. cool [16:54:49] it will get reinstalled/renamed sometime in the next hour or so [16:54:59] doing the last upload eqiad now [16:55:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 619 [16:55:37] (03PS1) 10Dzahn: depool cp1049 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196596 [16:55:56] (03PS1) 10BBlack: uncomment cp1044 [puppet] - 10https://gerrit.wikimedia.org/r/196597 [16:56:08] (03CR) 10BBlack: [C: 032 V: 032] uncomment cp1044 [puppet] - 10https://gerrit.wikimedia.org/r/196597 (owner: 10BBlack) [16:56:27] mutante: awesome :) [16:56:47] ah, misc :) [16:57:04] yeah I've hit both misc so far, and 1/2 parsoidcache [16:57:22] may leave the other parsoidcache for now, as gwicke seems concerned about losing cache contents there for perf [16:57:27] (03PS2) 10Dzahn: depool cp1049 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196596 [16:58:16] (03CR) 10Dzahn: [C: 032] depool cp1049 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196596 (owner: 10Dzahn) [16:59:03] once cp1049 + cp3008 are done, we're basically task-complete on this. [16:59:17] there are still 3 other servers not properly reinstalled, one at each dc, but they're all down for HW issues [16:59:34] ok.very nice. yay! [17:00:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 94248 Threads: 2 Questions: 691483 Slow queries: 579 Opens: 2176 Flush tables: 2 Open tables: 64 Queries per second avg: 7.336 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:00:28] we should tell #debian how much more Debian we use now [17:02:57] mutante: have we blogged about the change yeT? [17:03:12] 6operations, 10ops-esams: Remove all Toolserver equipment - https://phabricator.wikimedia.org/T92518#1116903 (10RobH) p:5Triage>3High a:3mark [17:03:17] seems it could perhaps be told wider than #debian ;) [17:04:09] ebernhardson: no, we have not afaik [17:04:44] 6operations, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1116906 (10chasemp) called ori who knew understood the workload and he said he was minutes away from the office and will get into it thanks @ori [17:04:46] 6operations, 10ops-esams: Remove knsq16-30 and prepare OE13 for new servers - https://phabricator.wikimedia.org/T92519#1116908 (10RobH) p:5Triage>3High a:3mark [17:05:29] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3053 - https://phabricator.wikimedia.org/T92514#1116912 (10RobH) p:5Triage>3High a:3mark [17:06:22] (03PS2) 10Andrew Bogott: Block in firstboot until NFS mounts are available. [puppet] - 10https://gerrit.wikimedia.org/r/196233 [17:08:00] 6operations, 10Analytics-EventLogging: Delete vanadium:/srv/eventlogging - https://phabricator.wikimedia.org/T75084#1116914 (10RobH) a:3Ottomata @ottomata: is this task something you would handle? I'd think so, since the parent is also assigned to you. If not, please feel free to remove yourself. [17:09:03] (03PS3) 10Andrew Bogott: Block in firstboot until NFS mounts are available. [puppet] - 10https://gerrit.wikimedia.org/r/196233 [17:09:53] 6operations, 10ops-codfw: rack/wire/initial setup of db2043-db2070 - https://phabricator.wikimedia.org/T89368#1116922 (10RobH) p:5Normal>3High @mark: Please see notes https://phabricator.wikimedia.org/T89368#1034823 and advise on where you'd like us to rack the remainder of this order. [17:10:14] (03CR) 10coren: [C: 031] "Polling in a loop is kinda cruddy, but it beats the race condition caused by the async nature of how volumes are exported. :-(" [puppet] - 10https://gerrit.wikimedia.org/r/196233 (owner: 10Andrew Bogott) [17:22:24] (03PS1) 10Dzahn: repool cp1049 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196601 [17:34:18] 6operations, 10Analytics-EventLogging: Delete vanadium:/srv/eventlogging - https://phabricator.wikimedia.org/T75084#1117037 (10Ottomata) Ja can do. [17:35:39] 6operations, 10Analytics-EventLogging: Delete vanadium:/srv/eventlogging - https://phabricator.wikimedia.org/T75084#1117041 (10Ottomata) 5Open>3Resolved [17:36:35] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:36:45] (03PS1) 10Dzahn: drop shop & store entries from some projects [dns] - 10https://gerrit.wikimedia.org/r/196605 [17:37:03] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK: OK: No anomaly detected [17:37:35] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10MediaWiki-Debug-Logging, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1117047 (10Legoktm) [17:38:03] (03PS2) 10Dzahn: drop shop & store entries from some projects [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) [17:39:53] 6operations, 6Commons, 6Multimedia, 7HHVM, and 3 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1117064 (10bd808) [17:40:20] 6operations, 7Graphite: revisit what percentiles are calculated by txstatsd - https://phabricator.wikimedia.org/T88662#1117068 (10Nemo_bis) This is really a pity... On a dashboard like https://gdash.wikimedia.org/dashboards/frontend/ , it's now basically impossible to detect any pattern, because both the mean... [17:41:21] !log cp1049 (upload) - depooled in pybal [17:41:24] (03PS1) 10Rush: phabricator admin group with permissions [puppet] - 10https://gerrit.wikimedia.org/r/196606 [17:41:28] Logged the message, Master [17:42:14] (03CR) 10jenkins-bot: [V: 04-1] phabricator admin group with permissions [puppet] - 10https://gerrit.wikimedia.org/r/196606 (owner: 10Rush) [17:42:29] 6operations, 6Commons, 6Multimedia, 7HHVM, and 3 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1117072 (10bd808) The HHVM upstream bug tracked in {T91468} should block this rollout. See analysis in T89918#1052318 for full gory details, but the TL;DR is that HHVM's... [17:45:30] mutante: about? [17:46:07] chasemp: yea [17:46:19] is https://gerrit.wikimedia.org/r/#/c/196606/ that a real failure or linting is busted? [17:46:21] any idea? [17:47:21] hmm. does it not like the * character in description? [17:47:23] looking [17:47:47] ah,i see it, real error chasemp [17:48:00] you have 2 closing ] [17:48:20] in the privileges section , 314 and 315 [17:48:55] ahh [17:48:57] _joe_: you around? [17:49:10] <_joe_> gwicke: in a meeting [17:49:17] kk [17:49:45] mutante: thanks [17:49:52] (03PS2) 10Rush: phabricator admin group with permissions [puppet] - 10https://gerrit.wikimedia.org/r/196606 [17:49:52] chasemp: welcome [17:52:19] 6operations, 10ops-codfw, 3wikis-in-codfw: Configure mw2001-2134 correctly - https://phabricator.wikimedia.org/T91238#1117148 (10Papaul) Redirection after boot was already enabled on mw-2019-mw2049. After rebooting those servers the installation process started.The installation is complete for mw2019-mw2049. [17:52:23] <_joe_> gwicke: but tell me what you need [17:52:38] <_joe_> papaul: thanks a ton man! [17:53:01] you welcome [17:53:59] (03CR) 10Dzahn: [C: 031] "like, the better alternative to phab-roots" [puppet] - 10https://gerrit.wikimedia.org/r/196606 (owner: 10Rush) [17:54:57] _joe_: wanted to discuss https://gerrit.wikimedia.org/r/#/c/196531/ [17:55:58] <_joe_> gwicke: +1 :) [17:56:15] (03CR) 10Dzahn: "should most likely be replaced by https://gerrit.wikimedia.org/r/#/c/196606/1" [puppet] - 10https://gerrit.wikimedia.org/r/196425 (https://phabricator.wikimedia.org/T92564) (owner: 10Dzahn) [17:56:44] PROBLEM - Disk space on ms-be2007 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [17:56:45] _joe_: +1/2 ? [17:57:09] <_joe_> gwicke: knowing we're moving VE to restbase fast is reassuring - we can turn the parsoid jobs off once everything goes through restbase? [17:57:19] yes [17:57:26] and it's already processing everything twice right now [17:57:50] didn't even show up among all the other stuff like 50% of the parsoid cache being wiped last night [17:58:35] <_joe_> gwicke: k, it's +1 because it's 7 pm here :) [17:59:26] _joe_: any objections against merging & monitoring it today? [17:59:36] we have quite a few hours left over here [17:59:58] <_joe_> gwicke: none, in fact, I will do it [18:00:11] 6operations, 10Parsoid, 6Services: Move Parsoid config into ops/puppet - https://phabricator.wikimedia.org/T92636#1117196 (10faidon) That doesn't sound great — puppet's access cannot be opened up to all deployers, which means there's going to be an implicit dependency between a Parsoid deployer and a root, t... [18:00:17] (03CR) 10Giuseppe Lavagetto: [C: 032] "I think this is completely ok after discussions with Gabriel and Marko." [puppet] - 10https://gerrit.wikimedia.org/r/196531 (owner: 10GWicke) [18:00:27] _joe_: ok, thx! [18:00:42] will keep an eye on it over the weekend as well [18:00:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: ! XO (WA/OGXX/563343) {#2009} [10Gbps DF]BR [18:01:28] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [18:03:25] 6operations, 10ops-esams, 7HTTPS, 3HTTPS-by-default: esams power capacity issues - https://phabricator.wikimedia.org/T90000#1117210 (10mark) I racked 10 of the new servers today, and (temp) cabled them up (just power) for testing. At POST, they consume up to 1700W so far. [18:05:27] (03PS3) 10Rush: phabricator admin group with permissions [puppet] - 10https://gerrit.wikimedia.org/r/196606 [18:06:41] (03CR) 10Rush: [C: 032] phabricator admin group with permissions [puppet] - 10https://gerrit.wikimedia.org/r/196606 (owner: 10Rush) [18:08:48] (03CR) 10Dzahn: [C: 032] repool cp1049 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196601 (owner: 10Dzahn) [18:08:50] (03PS1) 10Rush: phab allow chad h to admin sane things [puppet] - 10https://gerrit.wikimedia.org/r/196613 [18:09:24] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering, 5Patch-For-Review: Chad H. needs access to iridium (Phabricator host) to manage repos - https://phabricator.wikimedia.org/T92564#1117224 (10chasemp) https://gerrit.wikimedia.org/r/#/c/196613/ [18:13:16] !log cp1049 - repooled in pybal - all eqiad upload caches now jessie [18:13:23] Logged the message, Master [18:13:45] 6operations, 10Parsoid, 6Services: Move Parsoid config into ops/puppet - https://phabricator.wikimedia.org/T92636#1117236 (10ssastry) That was my concern as well. I also mentioned that I had hashar move the betalabs config out of puppet into parsoid deploy repo so we could tweak it without an ops dependency.... [18:14:53] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1117240 (10BBlack) (apparently disks already arrived in esams!) esams cluster re-arrangement details: Text cluster: ------------------ loses decommed amssq* (all of current cap... [18:19:09] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1117269 (10BBlack) note: we could potentially reduce some of the interdependent moves above by first adding the new ethernet to cp3015-cp3018, and swapping the roles of cp3015-30... [18:19:27] (03CR) 10Dzahn: [C: 031] phab allow chad h to admin sane things [puppet] - 10https://gerrit.wikimedia.org/r/196613 (owner: 10Rush) [18:20:14] (03CR) 10Dzahn: "and see https://gerrit.wikimedia.org/r/#/c/196613/" [puppet] - 10https://gerrit.wikimedia.org/r/196425 (https://phabricator.wikimedia.org/T92564) (owner: 10Dzahn) [18:20:20] (03Abandoned) 10Dzahn: create a phab-roots admin group, add demon [puppet] - 10https://gerrit.wikimedia.org/r/196425 (https://phabricator.wikimedia.org/T92564) (owner: 10Dzahn) [18:25:21] 6operations, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1117287 (10Dzahn) >>! In T92438#1116424, @faidon wrote: > Let's just drop everything but 2-3 domains, I'd say. It's really not worth the cruft.. This is dropping it from wiktionary, wikinews, wik... [18:27:19] (03PS12) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [18:35:54] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [18:38:40] so we are running into quota limits on the staging project... [18:39:01] I keep forgetting autismcat is aaron :) [18:39:05] twentyafterfour: which quota? number of instances ? [18:39:11] maxed out cores and nearly exceeded instance limit. [18:39:13] hey AaronSchulz what are the consequences for turning off AOF for rbf*? [18:39:38] mutante: we are going to need a lot more headroom since we are duplicating everything in the deployment-prep project [18:40:25] twentyafterfour: ah, hmm, we should involve andrewbogott then, when it's not the common "just one more instance" [18:40:50] heh, yeah, the quota of 10 for staging is.... way low :) [18:41:17] twentyafterfour: I can increase the quota… are you going to add all your instances right away or just a few today? [18:41:22] chasemp: it's means restart/power problems will cause a long rebuild of the cache...also replication without durability in redis is weird [18:41:36] andrewbogott: just a few today I'd say [18:41:43] ok, stay tuned… [18:41:45] sure I understand what AOF is I mean can you survive it [18:41:49] (03PS13) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [18:41:52] because right now it's not surviving it being //on// [18:42:11] it's dropping connections / timing out a large percentage of clients [18:42:17] due to redis blocking [18:42:17] maybe rdb snapshots every day would be better [18:43:57] (03PS1) 10BBlack: depool cp3008 -> reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196617 [18:44:10] (03CR) 10BBlack: [C: 032 V: 032] depool cp3008 -> reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196617 (owner: 10BBlack) [18:44:36] twentyafterfour: ok, that should give you room for a bit [18:45:01] andrewbogott: thanks. deployment-prep has 44 instances and 114 cores currently ... we will probably end up with a few less instances than before when all is done. [18:45:16] !log reinstalling cp3008 [18:45:22] Logged the message, Master [18:46:25] robh: is anything holding up https://rt.wikimedia.org/Ticket/Display.html?id=9249? Are we waiting on mark’s signoff? [18:46:51] I pinged him about it earlier, it is assigned to him [18:47:02] but that purchase amount will require both his and damon's approvals [18:47:12] I am not sure of the status of it with him though [18:47:46] andrewbogott: but be aware that mark is onsite at esams doing onsite work [18:47:54] so he likely won't get to it until tonight at the soonest [18:48:11] robh: that’s fine, just making sure it hasn’t been forgotten :) [18:48:21] (and even then i'd personally doubt it since onsite work is tiring and shit, he should sleep and drink beer later ;) [18:48:31] it hasnt been, no worries =] [18:48:51] (03PS1) 10BBlack: update hieradata cp3008 domainname [puppet] - 10https://gerrit.wikimedia.org/r/196620 [18:49:00] AaronSchulz: it doens't look like the redis roles are parameterized for what you want, if that is what you want [18:49:05] (03CR) 10BBlack: [C: 032 V: 032] update hieradata cp3008 domainname [puppet] - 10https://gerrit.wikimedia.org/r/196620 (owner: 10BBlack) [18:49:16] but the failures are constant, so I am going to turn off AOF and freeze these for you and ori [18:49:32] to converse so it isn't failing through the weeked, assuming that's the entire failure mode (I hope) [18:49:42] !log rbf2001 - powercycling, PXE boot [18:49:48] Logged the message, Master [18:49:54] unless you are against that [18:50:38] (03PS1) 10Andrew Bogott: Moved the dns::recursor class into a module [puppet] - 10https://gerrit.wikimedia.org/r/196621 [18:50:57] (03PS1) 10BBlack: remove old cp3008 public subnet DNS [dns] - 10https://gerrit.wikimedia.org/r/196622 [18:51:01] chasemp: you can turn it off sure [18:51:08] * AaronSchulz is wadding through puppet atm [18:51:20] shouldn't be hard to do daily rdb snapshots [18:51:43] (03CR) 10BBlack: [C: 032] remove old cp3008 public subnet DNS [dns] - 10https://gerrit.wikimedia.org/r/196622 (owner: 10BBlack) [18:55:44] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [18:59:45] (03PS1) 10Dzahn: let rbf200x hosts be Ubuntu for now [puppet] - 10https://gerrit.wikimedia.org/r/196624 (https://phabricator.wikimedia.org/T86897) [19:01:09] (03CR) 10Dzahn: "just this for now to fix rbf2001? rbf2002 seems fine already and we can still debug the issue on rdf2*" [puppet] - 10https://gerrit.wikimedia.org/r/196624 (https://phabricator.wikimedia.org/T86897) (owner: 10Dzahn) [19:08:16] AaronSchulz: that seems to have drastically helped in one case and not much in the other [19:08:38] is it loadbalancing between them or why would rbf1001 be higher load / failure? [19:08:51] either way still problems [19:10:36] root@rbf1002:~# lsof -i | grep 6379 | wc -l [19:10:36] 286 [19:10:43] root@rbf1001:~# lsof -i | grep 6379 | wc -l [19:10:43] 59 [19:12:43] they got the same pid? :) [19:13:29] oh that's a port number duh [19:13:45] port :) [19:13:48] root@rbf1002:~# lsof -i | grep ESTABLISHED | grep 6379 | wc -l [19:13:48] 309 [19:13:49] reads go to the "slaves", which is just the one [19:13:57] root@rbf1001:~# lsof -i | grep ESTABLISHED | grep 6379 | wc -l [19:13:57] 44 [19:14:39] if the load on rdb1001 is lower now, there is no reason both can't be used [19:15:35] actually it already uses both [19:15:49] odd [19:16:37] definitley lopsided https://phabricator.wikimedia.org/T92591#1117540 [19:19:53] (03PS1) 10BBlack: repool cp1008, update node comments [puppet] - 10https://gerrit.wikimedia.org/r/196625 (https://phabricator.wikimedia.org/T86648) [19:22:12] chasemp: I thought it was rbf1002 with more queries now, from that comment it's rbf1001, which is less surprising since it gets writes+reads instead of just reads [19:22:33] (03CR) 10BBlack: [C: 032] repool cp1008, update node comments [puppet] - 10https://gerrit.wikimedia.org/r/196625 (https://phabricator.wikimedia.org/T86648) (owner: 10BBlack) [19:25:08] wikibugs poofed? [19:26:55] (03PS1) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [19:27:05] (03CR) 10jenkins-bot: [V: 04-1] [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 (owner: 10Yuvipanda) [19:27:14] (03PS2) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [19:28:09] (03CR) 10jenkins-bot: [V: 04-1] [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 (owner: 10Yuvipanda) [19:29:54] seems like the enwiki log bloom filter is stuck in 2005 [19:30:03] (03PS3) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [19:31:30] AaronSchulz: why have AOF on the read only node at all? [19:32:32] if the idea is to have fast reads from a less encumbered host [19:32:40] Jeff_Green: could you glance at https://phabricator.wikimedia.org/T90679#1102468 to confirm FR is not using contacts.wm but another civicrm [19:33:01] mutante: ha, this keeps coming up [19:33:03] Jeff_Green: also,i have a new patch to drop a lot of the store/shop entries [19:33:11] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [19:33:16] just because the software name is the same :-P [19:33:29] Jeff_Green: yea, people confuse civis, i am trying to kill one so it will not happen anymore:) [19:33:38] afaik fundraising has never had anything to do with that civi install, certainly not since I've been here [19:34:15] i am just collecting a bunch of "we don't use that" and hopefully don't find a user [19:34:25] (03PS4) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [19:34:28] have you found any user at all yet? [19:34:52] no,the people who used to are now either at Wiki Edu Foundation or use Asana [19:35:10] but they did ask "my understanding that the Fundraising team may still use Civi":) [19:35:10] right, that's what I thought [19:35:16] sigh [19:35:29] i can only say it so many times before I croak [19:35:57] see 2 comments above . thanks [19:36:01] my suggestion is to lock the db and give it a week and see if anyone screams [19:36:19] here's the "less shops" thing https://gerrit.wikimedia.org/r/#/c/196605/ [19:36:23] or part of it [19:36:24] AaronSchulz: the lsof -i output will show connections vs https://phabricator.wikimedia.org/T92591#1117540 which shows failures. which may or may not correlate [19:37:14] Jeff_Green: or take down the webinterface but keep the db to give asana users a dump [19:37:22] sure [19:38:12] (03PS14) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:39:26] (03PS2) 10Andrew Bogott: Moved the dns::recursor class into a module [puppet] - 10https://gerrit.wikimedia.org/r/196621 [19:40:21] (03PS15) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:41:38] Jeff_Green: should we ask Victoria if she wants only wikimedia and wikipedia or any others like mediawiki, wiktionary, foundation [19:42:19] yeah [19:42:38] i don't know why those were ever set up [19:47:15] hm… what’s going on with the puppet compiler here? http://puppet-compiler.wmflabs.org/632/change/196621/html/chromium.wikimedia.org.html [19:47:20] yuvi, any idea? Or mutante? [19:48:00] Does the compiler have an out of date private repo, or private shadow repo? [19:52:20] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:52:37] (03PS16) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:56:05] (03PS5) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [19:56:09] (03PS1) 1020after4: Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 [19:57:45] (03PS1) 10Aaron Schulz: Disabled bloom filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196633 [19:58:00] andrewbogott: which part of it? that it doesn't appear to work for the production version is a limitation of the compiler afaik [19:58:00] (03CR) 1020after4: [C: 031] [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 (owner: 10Yuvipanda) [19:58:22] (03CR) 10Yuvipanda: [C: 04-1] Remove the conditional declaration for labs, should use hiera. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196631 (owner: 1020after4) [19:58:31] the warnings about "Variable access via 'ip4_address' is deprecated. " is true [19:58:33] (03CR) 10Aaron Schulz: [C: 032] Disabled bloom filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196633 (owner: 10Aaron Schulz) [19:58:38] (03Merged) 10jenkins-bot: Disabled bloom filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196633 (owner: 10Aaron Schulz) [19:58:41] (03CR) 10BryanDavis: [C: 031] "I haven't run it but the change looks sane to me. Easiest way to test definitively is to cherry-pick to deployment-bastion:/srv/deployment" [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [19:59:30] (03PS17) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:59:32] !log aaron Synchronized wmf-config/mc.php: Disabled bloom filter (duration: 00m 08s) [19:59:38] Logged the message, Master [19:59:40] mutante: ah, so I should regard that as a successfull 0-diff run, and ignore the ‘one failure’? [19:59:45] (03PS2) 1020after4: Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 [19:59:51] http://puppet-compiler.wmflabs.org/632/change/196621/compiled/puppet_catalogs_3_production/chromium.wikimedia.org.warnings [20:00:40] greg-g: Can I deploy a CentralAuth update and MassMessage fix so we can start sending out SULF notifications? [20:00:44] (03CR) 1020after4: "better?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196631 (owner: 1020after4) [20:01:18] (03PS3) 10Yuvipanda: Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:01:21] greg-g: https://gerrit.wikimedia.org/r/195448 is CentralAuth, https://gerrit.wikimedia.org/r/196615 is MassMessage [20:01:53] (03PS4) 10Yuvipanda: Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:02:20] (03CR) 10Yuvipanda: [C: 031] "@twentyafterfour: before I merge, can you set the params for deployment-prep so that doesn't break when this gets merged?" [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:03:52] (03CR) 1020after4: "I don't think it's going to break deployment prep. in fact, deployment-prep should already be broken if I understand things correctly - pr" [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:04:48] (03PS18) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [20:05:08] twentyafterfour: right, but we’re changing other defaults as well (like maxmemory) [20:05:12] and aof write size [20:05:14] *rewrite [20:05:20] anyway, you’re right, we shouldn’t worry. [20:05:26] (03PS5) 10Yuvipanda: Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:05:32] we can re-image beta at some point [20:05:40] (03CR) 10Yuvipanda: [C: 032] Remove the conditional declaration for labs, should use hiera. [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:05:43] AaronSchulz: if rbf1001 is meant to be the master it doesn't seem to have any slaves? [20:05:49] YuviPanda: those values were the same for both branches [20:05:53] oh [20:05:55] fair enough [20:06:10] twentyafterfour: actually no [20:06:17] twentyafterfour: maxmemory in prod is 0.82 * total [20:06:20] twentyafterfour: in labs it was just 500 [20:06:23] all the time [20:06:27] ah yes right [20:06:52] twentyafterfour: aaah, and dir. dir isn’t set in prod. [20:07:19] twentyafterfour: it won’t break beta either, but we should set it... [20:07:41] twentyafterfour: can you amend the patch to pass $dir through? [20:08:06] (03CR) 10Yuvipanda: "$dir needs to pass through." [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:08:27] YuviPanda: dir is passed [20:08:33] and I set it for deployment-prep via hiera [20:08:56] twentyafterfour: ouch. you’re right. [20:09:17] but I see that the default is good anyway, /srv/redis [20:09:22] (03CR) 10Yuvipanda: [C: 032] "I'm an idiot." [puppet] - 10https://gerrit.wikimedia.org/r/196631 (https://phabricator.wikimedia.org/T91547) (owner: 1020after4) [20:09:37] twentyafterfour: yeah. [20:09:40] twentyafterfour: done [20:09:52] woot [20:10:32] so all this time deployment-redis02 was basically pointless [20:11:00] hahah [20:11:07] twentyafterfour: wait till you learn about deployment-videoscaler01 :P [20:11:13] chasemp: is rbf1002 misconfigured? [20:11:18] annnnd still no dice on staging-rdb2 [20:11:39] twentyafterfour: did you update staging-palladium? [20:11:42] to pick up this change? [20:11:45] AaronSchulz: I didn't change anything except AOF but not sure yet [20:11:46] that would make some sense, the code would get stuck in an update loop on 1/2 the filter checks [20:11:51] guessing you turned off queries? [20:11:55] YuviPanda: it has to manually update? [20:11:56] because i see nothing now [20:12:04] yes, no queries now [20:12:06] twentyafterfour: I don’t think thcipriani|afk set it to autoupdate, no. [20:12:07] (03PS19) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [20:12:19] twentyafterfour: and even with autoupdate, it updates only once every 20mins or something like that [20:12:29] I was referring above to the replication that is supposed to be there [20:12:53] understood [20:12:58] just making sure you turned things off on purpose [20:12:59] :) [20:13:01] though I may just change the code to not need that [20:13:08] (03PS1) 10Andrew Bogott: Moved the labs ldap dns manifest into a module [puppet] - 10https://gerrit.wikimedia.org/r/196638 [20:13:13] chasemp: so is the replication broken or not? [20:13:25] # Replication [20:13:25] role:master [20:13:25] connected_slaves:0 [20:13:39] rbf1001 [20:13:57] * AaronSchulz needs "breakfast" [20:14:16] (03CR) 10jenkins-bot: [V: 04-1] Moved the labs ldap dns manifest into a module [puppet] - 10https://gerrit.wikimedia.org/r/196638 (owner: 10Andrew Bogott) [20:14:16] it must not have ever been working? [20:14:19] is that possible? [20:16:30] more demonstrative: rbf1002 [20:16:32] role:master [20:16:32] connected_slaves:0 [20:16:35] the "slave" [20:17:28] (03PS1) 10Chad: WIP: Hiera-ize most of the Elasticsearch config. Untested. [puppet] - 10https://gerrit.wikimedia.org/r/196640 [20:18:24] AaronSchulz: ok I manuall configured the slave status [20:18:32] let's try turning it back on to see if that makes a difference? [20:21:48] (03PS2) 10Andrew Bogott: Moved the labs ldap dns manifest into a module [puppet] - 10https://gerrit.wikimedia.org/r/196638 [20:21:50] (03PS3) 10Andrew Bogott: Moved the dns::recursor class into a module [puppet] - 10https://gerrit.wikimedia.org/r/196621 [20:24:19] AaronSchulz: want to give this a try? [20:24:59] (03PS20) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [20:26:47] (03PS6) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [20:29:05] andrewbogott: ^ is exciting. lets people write simple yaml based rules for which nodes get which roles. Merges that with whatever we get from wikitech / LDAP, and then uses that. Enabled only for self hosted puppetmasters, but this makes life super so much easier for staging / deployment-prep [20:29:10] since we don’t have to go futz with wikitech to recreate things [20:30:04] yep, seems useful. Where does the yaml file itself live? On the instance? [20:30:32] andrewbogott: ah, so atm in the ops/puppet repo itself. [20:30:47] oh, hm. ok [20:30:48] andrewbogott: and I’m not sure where the yaml file should live, actually. I prefer ops/puppet. [20:30:53] legoktm: doit [20:31:29] andrewbogott: we could also put the yaml file on wikitech. [20:36:44] (03PS2) 10Dzahn: let rbf200x hosts be Ubuntu for now [puppet] - 10https://gerrit.wikimedia.org/r/196624 (https://phabricator.wikimedia.org/T86897) [20:38:52] (03PS3) 10Dzahn: let rbf200x hosts be Ubuntu for now [puppet] - 10https://gerrit.wikimedia.org/r/196624 (https://phabricator.wikimedia.org/T86897) [20:39:20] (03CR) 10Dzahn: [C: 032] let rbf200x hosts be Ubuntu for now [puppet] - 10https://gerrit.wikimedia.org/r/196624 (https://phabricator.wikimedia.org/T86897) (owner: 10Dzahn) [20:43:32] (03PS2) 10Chad: WIP: Hiera-ize most of the Elasticsearch config. Untested. [puppet] - 10https://gerrit.wikimedia.org/r/196640 [20:47:03] (03PS3) 10Chad: WIP: Hiera-ize most of the Elasticsearch config. Untested. [puppet] - 10https://gerrit.wikimedia.org/r/196640 [20:53:13] (03Abandoned) 10Dzahn: update closing date for wikimania scholarships? [puppet] - 10https://gerrit.wikimedia.org/r/195836 (https://phabricator.wikimedia.org/T92358) (owner: 10Dzahn) [20:53:39] (03CR) 10BryanDavis: [C: 04-1] "vagrant@scap:/srv/mediawiki-staging$ scap --verbose" [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [20:54:45] chasemp: is the job queue replication working? [20:54:56] (03CR) 10BryanDavis: "Also sync-file doesn't use this code path, it use php -l directly. The easiest way to fix for that wold be to add `utils.check_php_opening" [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [20:56:05] AaronSchulz: the slave relationship is now established [20:56:17] replication should work [20:56:34] fwiw I think replication broke with a hiera conversion https://phabricator.wikimedia.org/rOPUP457d58535e9b3e49e0eb7a91c42b76316f84c44f [20:58:45] (03PS1) 10Thcipriani: Parameterize mail::mx role [puppet] - 10https://gerrit.wikimedia.org/r/196658 [20:59:34] (03PS4) 10Chad: WIP: Hiera-ize most of the Elasticsearch config. Untested. [puppet] - 10https://gerrit.wikimedia.org/r/196640 [20:59:57] chasemp: stupid hiera q for you [21:00:04] if you know [21:00:19] !log legoktm Synchronized php-1.25wmf21/extensions/MassMessage/: https://gerrit.wikimedia.org/r/196649 (duration: 00m 09s) [21:00:19] odds are against it but you can ask [21:00:24] Logged the message, Master [21:00:31] hah, in order to set a variable, does it need to be a parameter of a class? [21:00:47] i want to set a variable in beta labs [21:01:18] !log rbf2001 - reinstalled, wmf-reimage [21:01:22] Logged the message, Master [21:01:49] or, if the class uses a global variable [21:01:51] and it is not yet set [21:01:55] will it be inferred from hiera? [21:02:02] ..."Hiera is used by puppet as the default lookup method for class parameters" [21:02:10] so I would imagine so [21:02:29] hm [21:02:38] do you about? https://wikitech.wikimedia.org/wiki/Puppet_Hiera [21:02:45] last example there [21:03:21] !log legoktm Synchronized php-1.25wmf21/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/196649/ (duration: 00m 08s) [21:03:26] Logged the message, Master [21:03:33] ok thanks [21:03:34] reading mroe [21:04:40] !log legoktm Synchronized php-1.25wmf20/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/196654 (duration: 00m 08s) [21:04:47] Logged the message, Master [21:05:02] !log legoktm Synchronized php-1.25wmf20/extensions/MassMessage/: https://gerrit.wikimedia.org/r/196648 (duration: 00m 08s) [21:05:06] Logged the message, Master [21:05:23] greg-g: all done [21:05:30] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [21:07:10] i'm doing this for kafka, and i'm not fully ready to switch over to hiera for that, i'm just working on something that needs kafka in betalabs. so I want to set the cluster config via the labs hiera interface. Hm. maybe I can use this hiera() function? [21:07:11] (03PS5) 10Dzahn: dsh: delete most groups [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) [21:10:40] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:10:48] (03PS5) 10Chad: WIP: Hiera-ize most of the Elasticsearch config. Untested. [puppet] - 10https://gerrit.wikimedia.org/r/196640 [21:12:37] !log rbf2001 - re-signed puppet, re-enable icinga [21:12:43] Logged the message, Master [21:12:55] (03PS4) 10Legoktm: check_php_syntax: Check for any content before opening (03CR) 10Legoktm: "PS4 fixes sync-file and doesn't error on empty files" [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [21:14:25] (03PS5) 10Legoktm: Check for any content before opening (03PS1) 10Ottomata: Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 [21:17:03] YuviPanda: yt? [21:18:02] (03CR) 10Chad: ""Untested" isn't true: it works in beta (can't do staging yet because no trebuchet host yet). The duplication between staging/beta isn't i" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196640 (owner: 10Chad) [21:18:09] (03CR) 10jenkins-bot: [V: 04-1] Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 (owner: 10Ottomata) [21:19:03] (03PS2) 10Ottomata: Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 [21:19:56] ottomata: I think he just went to bed [21:19:57] (03PS6) 10Chad: WIP: Hiera-ize most of the Elasticsearch config, barely tested [puppet] - 10https://gerrit.wikimedia.org/r/196640 [21:20:02] (03CR) 10jenkins-bot: [V: 04-1] Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 (owner: 10Ottomata) [21:20:04] (03PS3) 10Ottomata: Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 [21:21:03] (03CR) 10jenkins-bot: [V: 04-1] Allow kafka configuration in labs via hiera [puppet] - 10https://gerrit.wikimedia.org/r/196665 (owner: 10Ottomata) [21:28:15] (03CR) 10Ottomata: "Not certain why puppet doesn't like this at the moment, I'll figure that out." [puppet] - 10https://gerrit.wikimedia.org/r/196665 (owner: 10Ottomata) [21:28:46] (03PS1) 10Dzahn: apply role::db::redis on rbf200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/196668 (https://phabricator.wikimedia.org/T86898) [21:37:19] (03CR) 10Dzahn: [C: 032] apply role::db::redis on rbf200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/196668 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [21:42:04] (03CR) 1020after4: [C: 031] Check for any content before opening (03Abandoned) 10John F. Lewis: apache: remove shop.wp.o funnel + shop.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/196008 (https://phabricator.wikimedia.org/T92438) (owner: 10John F. Lewis) [21:50:40] (03Abandoned) 10John F. Lewis: shop: change main shop domain [dns] - 10https://gerrit.wikimedia.org/r/196007 (https://phabricator.wikimedia.org/T92438) (owner: 10John F. Lewis) [21:51:22] (03CR) 10John F. Lewis: [C: 031] "per ticket comment by Victoria" [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [21:53:30] (03CR) 10Cscott: [C: 031] "Quotes around numerical values look really weird." [puppet] - 10https://gerrit.wikimedia.org/r/195860 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [21:54:15] (03CR) 10John F. Lewis: [C: 04-1] "requires definitions for bits, upload, login and mobile clauses as well (transferring IRC comments from yesterday)" [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:55:40] (03CR) 10John F. Lewis: [C: 04-1] "per Faidon" [dns] - 10https://gerrit.wikimedia.org/r/196473 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [21:57:04] (03CR) 10John F. Lewis: [C: 04-1] "Jumps the gun, stating definitions not even existing yet :)" [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [21:58:49] (03CR) 10John F. Lewis: [C: 031] "Shouldn't be a blocker this should ideally be merged when possible in order to promote the wwwportal pages." [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [22:00:57] (03PS1) 10Aaron Schulz: Added jobqueue federated log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196699 [22:06:14] chasemp: we should probably wipe that data on rbf* for good measyre [22:06:40] * AaronSchulz has no idea why it was dialed back to 2005, probably some other data loss from something else [22:11:19] chasemp: also I'm really leaning towards tweaking the code to make it multi-master, so I guess the replication may is well be remove anyway [22:11:45] * AaronSchulz hates spofs [22:12:29] could they ever be multi-master across datacenters? [22:14:59] in principle, they really just replicate data that is all in mysql [22:15:03] there is no original data in there [22:16:59] and the source data is log-structured [22:34:29] springle: if you are working, I have questions about how to set up mysql for pdns. [22:35:10] I know roughly how to do it if I install a local db server, but maybe that’s the wrong thing to do? [22:35:43] e.g. labs services all use a local mysql on virt1000 but I’ve come to regard that as dangerous and bad, so maybe I shouldn’t repeat the mistake [22:40:20] andrewbogott: i think i know one thing about that, yes, should be on a db server instead of local and he'll want you to use alias names, such as "m1-master" instead of server names [22:40:57] could likely be m1-master, that's where f.e. bugzilla is [22:41:09] vs. wiki cluser [22:41:18] That’s good, in the sense that it means Sean has to do the work instead of me. But bad, in the sense that I’m never awake when Sean is working. (I guess it’s Saturday already) [22:41:27] mutante: it’s for labs, though, so might be some reason why not to comingle with other misc dbs [22:42:01] true, maybe one is especially for labs [22:42:20] i'd just make a ticket requesting it, then you don't have to worry about timezones either [22:43:17] also, he setup dbproxy hosts,in some cases it's going via them [22:44:09] * andrewbogott considers whether writing phab tickets is easier than writing code [22:45:42] just copy/paste from IRC [22:45:54] that should be enough already [22:47:02] (03CR) 10Aaron Schulz: [C: 032] Added jobqueue federated log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196699 (owner: 10Aaron Schulz) [22:47:06] (03Merged) 10jenkins-bot: Added jobqueue federated log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196699 (owner: 10Aaron Schulz) [22:47:44] !log aaron Synchronized wmf-config/InitialiseSettings.php: Added jobqueue federated log (duration: 00m 11s) [22:47:51] Logged the message, Master [22:48:35] 6operations, 6Labs: Replicate or back up glance image data on virt1000 - https://phabricator.wikimedia.org/T90628#1118307 (10Andrew) [22:49:26] (03CR) 10Alex Monk: "This was redundant... The user is now listed twice in this group." [puppet] - 10https://gerrit.wikimedia.org/r/189483 (https://phabricator.wikimedia.org/T88769) (owner: 10Alexandros Kosiaris) [22:52:03] (03CR) 10John F. Lewis: [C: 031] adding support to redirect wikimedia.xyz to wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/196321 (owner: 10RobH) [22:54:14] (03PS1) 10Dzahn: add rbf2001/2002 hosts yaml files in hiera [puppet] - 10https://gerrit.wikimedia.org/r/196704 (https://phabricator.wikimedia.org/T86898) [22:58:02] (03CR) 10Dzahn: "please see here instead https://gerrit.wikimedia.org/r/#/c/196605/" [dns] - 10https://gerrit.wikimedia.org/r/196473 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [22:58:10] (03Abandoned) 10Dzahn: change all shop/store to myshopify.com CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/196473 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [23:01:50] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [23:03:19] (03PS1) 10Dzahn: admins: duplicate milimetric in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/196707 [23:04:30] (03PS2) 10Dzahn: admins: duplicate milimetric in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/196707 [23:05:27] (03CR) 10John F. Lewis: [C: 031] admins: duplicate milimetric in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/196707 (owner: 10Dzahn) [23:06:46] (03CR) 10Dzahn: [C: 032] admins: duplicate milimetric in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/196707 (owner: 10Dzahn) [23:07:56] (03CR) 10Dzahn: "thanks for reporting. fixed here https://gerrit.wikimedia.org/r/#/c/196707/" [puppet] - 10https://gerrit.wikimedia.org/r/189483 (https://phabricator.wikimedia.org/T88769) (owner: 10Alexandros Kosiaris) [23:10:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1118335 (10Dzahn) milimetric was in the deployments group twice. (reported by Krenair) removed one of them here: https://gerrit.wikimedia.org/r/#/c/196707/ [23:19:01] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:27:09] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1118369 (10Dzahn) [23:29:29] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1118371 (10Dzahn) >>! In T92438#1118034, @vshchepakina wrote: > In regard to having Shopify reissue the cert to cover the new canonical hostname, I could give admin access to S... [23:35:52] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1118375 (10Dzahn) I looked at this a bit and they solve the cert problem by just having a single cert on their side with a bunch of "alt. names" on it, so shop.wikimedia.org is... [23:44:37] !log legoktm Synchronized php-1.25wmf21/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/196718/ (duration: 00m 09s) [23:44:43] Logged the message, Master [23:47:02] !log legoktm Synchronized php-1.25wmf20/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/196717/ (duration: 00m 08s) [23:47:07] Logged the message, Master