[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150716T0000). [00:01:50] RECOVERY - puppet last run on labsdb1004 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [00:06:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL host 208.80.154.197, sessions up: 73, down: 3, shutdown: 0BRPeering with AS1273 not established - CWBRPeering with AS8218 not established - NEO-ASNBRPeering with AS62651 not established - BR [00:09:31] RECOVERY - puppet last run on pollux is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:41:45] (03PS1) 10Dzahn: wikimania scholarships: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/224985 (https://phabricator.wikimedia.org/T105003) [00:43:46] (03PS2) 10Dzahn: wikimania scholarships: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/224985 (https://phabricator.wikimedia.org/T105003) [00:44:45] (03CR) 10Dzahn: [C: 032] wikimania scholarships: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/224985 (https://phabricator.wikimedia.org/T105003) (owner: 10Dzahn) [00:51:40] (03PS1) 10Gergő Tisza: Add salt configuration for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) [00:54:57] (03PS2) 10Gergő Tisza: Add salt configuration for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) [00:55:21] (03PS1) 10Dzahn: wikimania scholarships: remove namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224988 (https://phabricator.wikimedia.org/T105920) [00:55:45] (03PS2) 10Dzahn: wikimania scholarships: remove namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224988 (https://phabricator.wikimedia.org/T105920) [00:55:51] (03CR) 10Dzahn: [C: 032] wikimania scholarships: remove namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224988 (https://phabricator.wikimedia.org/T105920) (owner: 10Dzahn) [00:56:35] (03CR) 10Dzahn: "i said +2 , why would review still be "in progress"" [puppet] - 10https://gerrit.wikimedia.org/r/224988 (https://phabricator.wikimedia.org/T105920) (owner: 10Dzahn) [00:58:51] RECOVERY - puppet last run on krypton is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [01:02:00] (03PS1) 10Dzahn: remove en2.wikipedia.org from redirects [puppet] - 10https://gerrit.wikimedia.org/r/224989 (https://phabricator.wikimedia.org/T105981) [01:03:58] (03CR) 10John F. Lewis: [C: 031] "with dep gone" [puppet] - 10https://gerrit.wikimedia.org/r/224989 (https://phabricator.wikimedia.org/T105981) (owner: 10Dzahn) [01:05:35] (03PS2) 10Dzahn: remove en2.wikipedia.org from redirects [puppet] - 10https://gerrit.wikimedia.org/r/224989 (https://phabricator.wikimedia.org/T105981) [01:06:00] (03CR) 10Dzahn: [C: 032] remove en2.wikipedia.org from redirects [puppet] - 10https://gerrit.wikimedia.org/r/224989 (https://phabricator.wikimedia.org/T105981) (owner: 10Dzahn) [01:11:55] 6operations, 6Analytics-Backlog, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1455975 (10mmodell) Krinkle: I don't think anyone has been removing branches prematurely. Also, I am pro... [01:12:21] (03PS1) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [01:13:48] (03PS1) 10Dzahn: delete files/apache/conf.d/namevirtualhost [puppet] - 10https://gerrit.wikimedia.org/r/224993 [01:16:40] (03CR) 10John F. Lewis: [C: 031] delete files/apache/conf.d/namevirtualhost [puppet] - 10https://gerrit.wikimedia.org/r/224993 (owner: 10Dzahn) [01:16:49] (03CR) 10John F. Lewis: [C: 031] racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 (owner: 10Dzahn) [01:17:18] 6operations, 10Wikimedia-Wikimania-Scholarships, 5Patch-For-Review: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1455984 (10Dzahn) [01:17:19] 6operations, 5Patch-For-Review: fix the puppet role for the wikimania scholarship app - https://phabricator.wikimedia.org/T105920#1455982 (10Dzahn) 5Open>3Resolved a:3Dzahn [01:18:10] (03PS1) 10Dzahn: wikimania scholarships: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/224994 (https://phabricator.wikimedia.org/T105003) [01:18:12] (03PS2) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [01:18:54] (03CR) 10John F. Lewis: [C: 031] wikimania scholarships: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/224994 (https://phabricator.wikimedia.org/T105003) (owner: 10Dzahn) [01:19:18] (03PS2) 10Dzahn: wikimania scholarships: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/224994 (https://phabricator.wikimedia.org/T105003) [01:19:27] (03CR) 10Dzahn: [C: 032] wikimania scholarships: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/224994 (https://phabricator.wikimedia.org/T105003) (owner: 10Dzahn) [01:21:45] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/224994 (https://phabricator.wikimedia.org/T105003) (owner: 10Dzahn) [01:22:22] !log es1.6 upgrade: upgrade elastic1023 [01:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:35:15] can someone with puppet +2 look at https://gerrit.wikimedia.org/r/#/c/224987/ ? it's trivial and needed for testing [01:37:07] (03PS1) 10Dzahn: misc-web: add node krypton as a backend [puppet] - 10https://gerrit.wikimedia.org/r/224995 (https://phabricator.wikimedia.org/T104946) [01:37:32] (03CR) 10John F. Lewis: [C: 031] "trivial for hiera change" [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [01:38:21] (03CR) 10Alex Monk: [C: 032] Provide static PNG logo for emlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214981 (https://phabricator.wikimedia.org/T100953) (owner: 10Odder) [01:38:53] (03Merged) 10jenkins-bot: Provide static PNG logo for emlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214981 (https://phabricator.wikimedia.org/T100953) (owner: 10Odder) [01:41:03] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/214981/ (duration: 00m 12s) [01:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:41:49] Krenair: Having fun? :-) [01:43:17] (03CR) 10BryanDavis: [C: 031] Add salt configuration for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [01:43:46] bd808: where is that upstream var actually used? i was wondering because it doesnt appear in the deployment module / salt master [01:43:56] was just looking at that one [01:44:08] it is trebuchet config [01:44:21] so that patch just adds a new trebuchet target [01:44:54] (03CR) 10Alex Monk: "Is anyone going to do this? It's been rotting in the mediawiki-config queue for over 2 months now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [01:45:34] mutante: that's the stuff that used to be in role::deployment to setup trebuchet but is now in hiera [01:45:46] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1456014 (10Tgr) >>! In T102566#1451278, @Nemo_bis wrote: > The necessary steps also must be documented. At least one of https... [01:45:51] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [01:46:31] (03PS3) 10Dzahn: Add salt configuration for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [01:47:16] bd808: JohnFLewis: i see it now, thanks [01:47:29] (03CR) 10Dzahn: [C: 032] Add salt configuration for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/224987 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [01:47:39] :) [01:47:45] (03Abandoned) 10Alex Monk: Add Dev namespace on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187278 (https://phabricator.wikimedia.org/T369) (owner: 10Spage) [01:48:06] James_F, something like that.. [01:48:22] * James_F nods [01:48:48] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1456021 (10Tgr) >>! In T102566#1456014, @Tgr wrote: > I don't see anything needing change on any of those. Probably because... [01:48:51] open mediawiki-config changes without negative code review now all fit on one page [01:49:07] 6operations, 10ops-eqiad, 10Wikimania-Hackathon-2015: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1456023 (10brion) 3NEW [01:49:42] thanks mutante [01:51:49] tgr: no problem [01:52:00] 6operations, 10ops-eqiad, 10Traffic, 10Wikimania-Hackathon-2015: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1456023 (10Krenair) [01:53:03] i was going to say that Hilton Wi-Fi always sucks, not just in Mexico [01:53:11] but the "upload is fast, just download is slow" is odd [01:53:31] i mean, "hotel" + "wifi" = sucks [02:03:31] !log LocalisationUpdate failed (1.26wmf13) at 2015-07-16 02:03:30+00:00 [02:03:31] !log LocalisationUpdate failed (1.26wmf14) at 2015-07-16 02:03:31+00:00 [02:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:06:24] (03PS2) 10Dzahn: misc-web: add node krypton as a backend [puppet] - 10https://gerrit.wikimedia.org/r/224995 (https://phabricator.wikimedia.org/T104946) [02:07:55] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 16 02:07:55 UTC 2015 (duration 7m 54s) [02:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:09:50] bd808: Apparently not [02:09:59] It seems to work when run manually [02:10:26] Ok [02:10:31] There's something definitely weird going on [02:10:40] There is 2 runs of localisation update running [02:10:51] ori may have done something to it [02:11:01] to get it to build both cdb and php [02:11:03] (03CR) 10Dzahn: [C: 032] misc-web: add node krypton as a backend [puppet] - 10https://gerrit.wikimedia.org/r/224995 (https://phabricator.wikimedia.org/T104946) (owner: 10Dzahn) [02:11:23] tailing the log on tin and it was currently updating the skin repos [02:11:41] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:20:13] (03PS1) 10Dzahn: scholarships: switch app to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/224996 (https://phabricator.wikimedia.org/T105003) [02:21:03] (03PS2) 10Dzahn: scholarships: switch app to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/224996 (https://phabricator.wikimedia.org/T105003) [02:25:01] (03PS3) 10Dzahn: scholarships: switch app to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/224996 (https://phabricator.wikimedia.org/T105003) [02:25:14] (03PS4) 10Dzahn: scholarships: switch app to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/224996 (https://phabricator.wikimedia.org/T105003) [02:26:59] (03CR) 10Dzahn: [C: 032] scholarships: switch app to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/224996 (https://phabricator.wikimedia.org/T105003) (owner: 10Dzahn) [02:33:32] 6operations, 5Patch-For-Review, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1456078 (10Dzahn) [02:33:33] 6operations, 10Wikimedia-Wikimania-Scholarships, 5Patch-For-Review: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1456075 (10Dzahn) 5Open>3Resolved a:3Dzahn done and swtiched to krypton. deleted on zirconium. [02:34:56] (03PS1) 10Dzahn: misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 [02:35:01] (03CR) 10jenkins-bot: [V: 04-1] misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 (owner: 10Dzahn) [02:35:10] (03PS2) 10Dzahn: misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 [02:36:16] (03PS3) 10Dzahn: Fix link to WD Json dumps on other dumps html page [puppet] - 10https://gerrit.wikimedia.org/r/224768 (https://phabricator.wikimedia.org/T104307) (owner: 10Addshore) [02:37:39] (03CR) 10Dzahn: [C: 032] Fix link to WD Json dumps on other dumps html page [puppet] - 10https://gerrit.wikimedia.org/r/224768 (https://phabricator.wikimedia.org/T104307) (owner: 10Addshore) [02:39:34] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 10m 50s) [02:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:54] !log es1.6 upgrade: upgrade elastic1024 [02:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:03] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-16 02:46:03+00:00 [02:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:09] (03PS8) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:13:19] !log l10nupdate Synchronized php-1.26wmf14/cache/l10n: (no message) (duration: 10m 23s) [03:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:19:37] !log LocalisationUpdate completed (1.26wmf14) at 2015-07-16 03:19:37+00:00 [03:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:20:35] 6operations, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1456136 (10Legoktm) [03:27:37] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1456157 (10Dzahn) This sounds like a classic Analytics thing. At least way more than a "traffic" thing. [03:35:16] (03CR) 10Gergő Tisza: Basic role for Sentry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:35:39] (03PS9) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:46:27] 6operations, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1456193 (10brion) Back in the hacking space after 10pm I see up to 100-200 kbytes/sec on eqiad, better than from upstairs but still much... [03:54:36] !log es1.6 upgrade: upgrade elastic1025 [03:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:56:41] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (101189s 100000s) [04:13:01] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [04:17:21] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 4.31901176443e-06 [04:26:21] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (10780 100000s) [04:38:36] !log krenair Synchronized php-1.26wmf13/extensions/WikimediaMaintenance/dumpInterwiki.php: https://gerrit.wikimedia.org/r/#/c/225006/ (duration: 00m 13s) [04:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:40:31] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 9.3210788414e-07 [04:48:11] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:04:55] 6operations, 10Traffic, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1456305 (10Nemo_bis) [05:06:43] <_joe_> Thanks nemo [05:07:16] <_joe_> I was about to do that [05:24:01] !log krenair Synchronized php-1.26wmf14/extensions/WikimediaMaintenance/dumpInterwiki.php: https://gerrit.wikimedia.org/r/#/c/225008/ (duration: 00m 13s) [05:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:30:11] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.32520104577e-08 [05:31:48] !log krenair Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 12s) [05:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:33:03] _joe_: paravoid told me to remove the network project! [05:33:12] _joe_: er, the traffic project. [05:34:00] <_joe_> legoktm: didn't nemo added the network project? [05:34:13] Legoktm edited projects, added network; removed Traffic, ops-eqiad.Via Web · [05:34:19] (that was actually paravoid using my laptop) [05:34:27] Nemo_bis added a project: Traffic.Via Web · [05:34:31] <_joe_> I was on the phone, phab is horrible on my phone [05:34:34] <_joe_> oh I see [05:34:34] :P [05:35:27] <_joe_> and you, don't let paravoid use your laptop [05:35:37] <_joe_> opsec! [05:35:59] <_joe_> :) [05:37:51] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.98350101066e-08 [05:53:11] !log es1.6 upgrade: upgrade elastic1026 [05:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:04:48] <_joe_> dcausse: wow you're almost done guys congrats [06:05:34] _joe_: yes, at last, it has been pretty slow this time according to nick [06:06:01] hopefully next rolling upgrades will be very fast [06:06:20] <_joe_> yeah I know the deal - now we can freeze indices [06:06:27] yes [06:06:35] <_joe_> one thing I wanted so badly in many distributed kv stores [06:06:43] <_joe_> it's like the most obvious of features [06:30:01] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 2 failures [06:31:00] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [06:31:41] PROBLEM - puppet last run on mw2081 is CRITICAL Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 2 failures [06:32:01] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 2 failures [06:32:52] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:41:54] 6operations, 6Commons: Commons thumbnail of Pluto photo is broken at 500px - https://phabricator.wikimedia.org/T105793#1456398 (10Joe) [06:43:53] 6operations, 6Commons: Commons thumbnail of Pluto photo is broken at 500px - https://phabricator.wikimedia.org/T105793#1456401 (10Joe) @MZMcBride ironically being on-duty I just looked at the tickets with the "operations" tag these last few days, so I didn't notice this. However I'll take a look at the logs.... [06:50:51] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.55275947807e-10 [06:55:41] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:40] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2081 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:58:41] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:47] so I don't quite understand the difference between conftool data and hiera data, why can't conftool use hiera instead of it's own data files? [07:08:56] (not trying to be critical of it, just trying to underatand conftool / etcd stuff so that I am informed when working on deployment stuffs) [07:10:46] <_joe_> twentyafterfour: we preferred those to be separated for now, just to avoid confusion for the moment. But the second we start using that data in puppet (and I hope we will, soon), you'll notice [07:11:30] <_joe_> twentyafterfour: also, strictly speaking, what is in conftool-data is more properly data that could be used by a puppet ENC rather than by hiera [07:12:27] _joe_: I see. [07:12:35] isn't hiera kinda like an ENC really? [07:14:40] <_joe_> twentyafterfour: nope, the ENC is supposed to tell you what is on a server, hiera to configure individual classes on it [07:14:50] <_joe_> but yeah, you can use an ENC instead of hiera [07:15:00] <_joe_> or make hiera work sort-of-like an ENC [07:15:08] <_joe_> viva puppetlabs! [07:18:54] (03CR) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [07:22:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 16 07:22:49 UTC 2015 (duration 22m 48s) [07:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:34] !log es1.6 upgrade: upgrade elastic1027 [07:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:33:02] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 9.28627195581e-12 [08:42:27] !log es1.6 upgrade: upgrade elastic1028 [08:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:17] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1456526 (10mmodell) See also {T102991} and ultimately {T89945} [09:19:42] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [09:38:31] !log es1.6 upgrade: upgrade elastic1029 [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:39] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename zh-yue -> yue - https://phabricator.wikimedia.org/T30441#1456550 (10Glaisher) See {T105999} [09:51:40] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 9.07518907332e-16 [10:01:27] (03PS1) 10Glaisher: Enable EducationProgram extension at French Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225019 (https://phabricator.wikimedia.org/T105853) [10:04:41] (03CR) 10Glaisher: Enable IP user page creation on fawiki's Draft ns (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223497 (https://phabricator.wikimedia.org/T105118) (owner: 10Ebrahim) [10:05:43] !log es1.6 upgrade: upgrade elastic1030 [10:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:14] !log citoid deploying 5aeb0fc [10:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:04] (03PS1) 10Glaisher: Enable Quiz extension at French Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225021 (https://phabricator.wikimedia.org/T103263) [10:15:51] PROBLEM - citoid on sca1002 is CRITICAL: Connection refused [10:16:40] PROBLEM - citoid on sca1001 is CRITICAL: Connection refused [10:17:00] PROBLEM - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is CRITICAL: Connection refused [10:17:44] <_joe_> whatsup? [10:18:00] <_joe_> mobrovac: seems like the release had something wrong? [10:18:40] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.50011866422e-16 [10:22:15] 6operations, 10Beta-Cluster: deployment-bastion fails puppet because of bacula - https://phabricator.wikimedia.org/T106003#1456624 (10hashar) 3NEW [10:22:45] (03CR) 10Hashar: "The class is applied on deployment-bastion which lacks backup/caesium etc causing puppet to fail (T106003)" [puppet] - 10https://gerrit.wikimedia.org/r/223448 (owner: 10Dzahn) [10:25:19] (03PS1) 10Hashar: Do not backup beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225023 (https://phabricator.wikimedia.org/T106003) [10:25:51] RECOVERY - citoid on sca1001 is OK: HTTP OK: HTTP/1.1 200 OK - 876 bytes in 0.022 second response time [10:25:54] !log citoid rolled back to ffbaf6d [10:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:20] RECOVERY - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1326 bytes in 0.011 second response time [10:27:10] RECOVERY - citoid on sca1002 is OK: HTTP OK: HTTP/1.1 200 OK - 876 bytes in 0.006 second response time [10:30:13] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because of bacula - https://phabricator.wikimedia.org/T106003#1456648 (10hashar) Did some cleanup based on `/var/log/apt/history.log` ``` Start-Date: 2015-07-07 22:35:01 Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--f... [10:30:29] (03CR) 10Hashar: "Cherry picked on beta cluster puppet master (deployment-salt)" [puppet] - 10https://gerrit.wikimedia.org/r/225023 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [10:30:50] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because of bacula - https://phabricator.wikimedia.org/T106003#1456650 (10hashar) [10:33:19] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because some classes were moved from nodes to role class - https://phabricator.wikimedia.org/T106003#1456660 (10hashar) [10:34:03] (03CR) 10Hashar: "The class is applied on deployment-bastion which lacks backup/caesium etc causing puppet to fail (T106003)" [puppet] - 10https://gerrit.wikimedia.org/r/223464 (owner: 10Dzahn) [10:34:55] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because some classes were moved from nodes to role class - https://phabricator.wikimedia.org/T106003#1456624 (10hashar) [10:37:26] (03PS1) 10Hashar: Do not use releases::upload on beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225025 (https://phabricator.wikimedia.org/T106003) [10:38:20] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because some classes were moved from nodes to role class - https://phabricator.wikimedia.org/T106003#1456673 (10hashar) a:3hashar [10:38:21] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 4.02072663305e-17 [10:39:28] (03CR) 10Hashar: [C: 031] Do not backup beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225023 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [10:39:40] (03CR) 10Hashar: [C: 031] "Cherry picked on beta cluster puppet master (deployment-salt)" [puppet] - 10https://gerrit.wikimedia.org/r/225025 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [10:41:46] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because some classes were moved from nodes to role class - https://phabricator.wikimedia.org/T106003#1456674 (10hashar) p:5Triage>3Normal I have cherry picked both patches on beta cluster puppetmaster and puppet now passes on... [10:52:47] (03PS10) 10Giuseppe Lavagetto: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [10:56:34] !log es1.6 upgrade: upgrade elastic1031 [10:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:21] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 8.67733453635e-18 [11:06:19] !log citoid deploying ff90869 [11:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:20] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1456692 (10mobrovac) [11:19:26] (03PS13) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [11:19:37] <_joe_> mobrovac: ^^ [11:19:43] <_joe_> passes the spec tests too [11:19:45] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1456697 (10BBlack) Re: caching, I think in the ideal variant of these ideas, we're not arbitrarily bu... [11:20:22] _joe_: which spec tests? rake? [11:23:57] <_joe_> yes [11:24:06] <_joe_> rake that runs nosetests, but well :) [11:24:24] <_joe_> I still need to add the creation of a virtualenv [11:24:46] <_joe_> but I think it's GTG, at this point [11:24:56] <_joe_> in the meanwhile, lunch! [11:24:58] kk [11:25:03] _joe_: buon appetito [11:32:27] !log restarted gmond on elastic1024 [11:32:30] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.09861218558e-18 [11:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:24] 6operations, 10Graphoid, 6Services, 5Patch-For-Review: Confine Graphoid with firejail - https://phabricator.wikimedia.org/T103095#1456745 (10mobrovac) [12:03:10] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.41429733791e-19 [12:10:51] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 8.43636267553e-20 [12:11:05] (03CR) 10Hashar: [C: 031] Beta: Move wikidata.beta.wmflabs.org to static mappings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 (owner: 10Chad) [12:21:09] !log es1.6 upgrade: all done [12:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:30] <_joe_> dcausse: I thought you were at 24/31 [12:22:31] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.85440989384e-20 [12:23:25] _joe_: upgrade on the last ones went relatively fast [12:27:01] _joe_: ok got it, I just restarted gmond on elastic1024, stats on ganglia were wrong on this node [12:27:43] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1456776 (10fgiunchedi) >>! In T104365#1450686, @Krinkle wrote: > As example I've done reqerror, and part of graphite-eqiad: > * https://grafana.wikimedia.org/#/dashboard/db/varnish-http-errors > * https://grafana... [12:29:09] 6operations, 10Traffic, 7HTTPS, 7Mobile, 5Patch-For-Review: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1456780 (10dr0ptp4kt) A glance at the User-Agent values suggests the domains are commonly being visited by devices supporting HTML instead... [12:30:12] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 2.29917704665e-20 [12:41:13] 6operations, 10Traffic, 7HTTPS, 7Mobile, 5Patch-For-Review: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1456791 (10dr0ptp4kt) Here are the mails to wikitech-l and mobile-l. https://lists.wikimedia.org/pipermail/wikitech-l/2015-July/082408.ht... [12:41:51] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.06810343643e-20 [12:56:35] (03CR) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [12:56:52] (03PS14) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [12:57:39] (03PS2) 10Hashar: Gerrit: remove ::old role [puppet] - 10https://gerrit.wikimedia.org/r/223161 (owner: 10Chad) [13:00:28] (03PS2) 10Hashar: Gerrit: Remove ::labs role [puppet] - 10https://gerrit.wikimedia.org/r/223171 (owner: 10Chad) [13:05:22] (03PS1) 10Glaisher: Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) [13:05:51] (03PS11) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [13:06:45] (03CR) 10Nikerabbit: [C: 031] Beta: Only enable ContentTranslation on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224975 (https://phabricator.wikimedia.org/T91340) (owner: 10Alex Monk) [13:12:13] (03PS1) 10Glaisher: Redirect sep11.wikipedia.org to https wayback machine [puppet] - 10https://gerrit.wikimedia.org/r/225043 [13:13:53] (03PS2) 10Hashar: Gerrit: Remove $extra_groups from replicationdest, nothing uses it [puppet] - 10https://gerrit.wikimedia.org/r/223169 (owner: 10Chad) [13:17:31] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1456863 (10MarkAHershberger) Dzahn writes: > This sounds like a classic Analytics thing. At least way more than a > "traffic" thing. I got Kevin Luduc here at Wikimania to get us the info... [13:21:27] any ops available for a few trivial Gerrit puppet patches? Chad cleaned up the current manifests, removing stuff that is no more used. Starts at https://gerrit.wikimedia.org/r/#/c/223161/ with a couple child changes :} [13:21:34] my dashboard thanks you in advance [13:23:39] <_joe_> hashar_: I'm here :) [13:24:00] <_joe_> just lemme dist-upgrade my desktop for a few mins and I'll assist [13:28:14] 6operations, 10Traffic, 7HTTPS, 7Mobile, 5Patch-For-Review: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1456920 (10dr0ptp4kt) As mentioned on threads: looks like the user pageviews for wap.wikipedia.org and mobile.wikipedia.org subdomains are... [13:29:42] <_joe_> hashar_: so which changes? [13:30:13] _joe_: https://gerrit.wikimedia.org/r/#/c/223161/ [13:30:27] https://gerrit.wikimedia.org/r/#/c/223169/ and https://gerrit.wikimedia.org/r/#/c/223171/ [13:30:37] all reviewed by QChris and me [13:30:48] that is mostly some leftover from past migrations / labs [13:36:05] <_joe_> hashar_: ok will look [13:37:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Gerrit: remove ::old role [puppet] - 10https://gerrit.wikimedia.org/r/223161 (owner: 10Chad) [13:38:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Gerrit: Remove $extra_groups from replicationdest, nothing uses it [puppet] - 10https://gerrit.wikimedia.org/r/223169 (owner: 10Chad) [13:39:29] (03PS3) 10Giuseppe Lavagetto: Gerrit: Remove ::labs role [puppet] - 10https://gerrit.wikimedia.org/r/223171 (owner: 10Chad) [13:39:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Gerrit: Remove ::labs role [puppet] - 10https://gerrit.wikimedia.org/r/223171 (owner: 10Chad) [13:40:17] <_joe_> hashar_: you've been served. I hope the puppet merging service satisfied your needs [13:40:44] _joe_: 5/5 will come again :} [13:40:55] I have a couple other ones for beta [13:41:05] related to roles being moved from production nodes to role classes that are applied on beta [13:41:28] and causes puppet failures due to different context (ispecially we lack bacuda backup and have no prod ssh private key) [13:41:38] https://gerrit.wikimedia.org/r/225023 https://gerrit.wikimedia.org/r/225025 :D [13:44:02] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.68380274207e-22 [13:46:07] <_joe_> hashar_: we'll serve you in a few minutes [13:48:35] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1457000 (10Aklapper) >>! In T99216#1304942, @MarkTraceur wrote: > Are there any remaining policy problems? Or can I bother some opsen about t... [13:51:45] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1457037 (10Aklapper) >>! In T87588#1322345, @Krenair wrote: > Anything else? @Reedy ? Or can we close this ticket finally? [13:53:32] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1457055 (10dr0ptp4kt) [13:58:01] (03PS15) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [14:00:59] i've still got much slower access to eqiad than to ulsfo or esams on the wikimania networks [14:01:31] (03CR) 10Mobrovac: [C: 031] "Great stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [14:01:34] lol [14:01:44] the or esams part is hilarious [14:02:30] brion: tired to find paravoid? [14:03:12] haven't seen him since yesterday afternoon [14:03:43] guess it's still a little early there [14:03:58] which was when, ironically, i first noticed the network slowness but didn't have a chance to investigate, thought it was just the phone i demoed something on ;) [14:05:11] 6operations, 10Traffic, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1457099 (10brion) CC'ing faidon for on-site network investigation if possible :D [14:06:05] Reedy: vs esams; it's not the latency it's the bandwidth that's a problem :D [14:06:48] (03PS16) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [14:08:17] hmm, the route *back* here from eqiad seems to go through gtt.net [14:08:24] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: auto-monitoring of local endpoints [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [14:09:29] not the first time I've heard stuff like that [14:09:40] (03PS6) 10Giuseppe Lavagetto: restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) [14:10:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] restbase: spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/224586 (https://phabricator.wikimedia.org/T94831) (owner: 10Giuseppe Lavagetto) [14:11:28] 6operations, 10Traffic, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1457107 (10brion) Note the route back from eqiad to the hotel ISP goes through GTT, not through Telia/Cogent as the upstream... [14:11:31] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 2.69205954851e-23 [14:13:18] (03PS1) 10Giuseppe Lavagetto: service::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225048 [14:13:35] (03CR) 10Giuseppe Lavagetto: [C: 032] service::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225048 (owner: 10Giuseppe Lavagetto) [14:13:49] (03CR) 10Giuseppe Lavagetto: [V: 032] service::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225048 (owner: 10Giuseppe Lavagetto) [14:16:20] PROBLEM - puppet last run on restbase1001 is CRITICAL Puppet has 1 failures [14:16:31] <_joe_> ^^ that's me, nothing to worry about [14:16:32] (03PS1) 10Giuseppe Lavagetto: service::monitoring: fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/225050 [14:16:45] <_joe_> I added random letters to my typing apparently [14:17:14] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] service::monitoring: fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/225050 (owner: 10Giuseppe Lavagetto) [14:20:01] RECOVERY - puppet last run on restbase1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:46] (03PS1) 10Giuseppe Lavagetto: service::checker: add shebang [puppet] - 10https://gerrit.wikimedia.org/r/225053 [14:21:17] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] service::checker: add shebang [puppet] - 10https://gerrit.wikimedia.org/r/225053 (owner: 10Giuseppe Lavagetto) [14:21:36] (03CR) 10Reedy: "Mostly different domains" [puppet] - 10https://gerrit.wikimedia.org/r/173492 (owner: 10Reedy) [14:25:20] 6operations, 10Traffic: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1457183 (10Aklapper) Anybody planning to investigate this? [14:26:48] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1457185 (10Joe) [14:26:50] 6operations, 5Patch-For-Review, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1457184 (10Joe) 5Open>3Resolved [14:26:57] <_joe_> finally [14:27:11] <_joe_> it's been 3 months since me and marko discussed this :( [14:29:50] (03PS4) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [14:30:51] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6429.06672814 [14:32:51] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.hw48UMWN/mnt/tmp/ccache is not accessible: Permission denied [14:33:00] (03CR) 10Lokal Profil: "Pushed a new version implementing feedback from Hoo man. See inline comments for the config changes" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:35:18] <_joe_> https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=restbase1001&service=Restbase+endpoints+health :)) [14:36:06] (03PS2) 10Alexandros Kosiaris: CX: Add missing eo-en pair [puppet] - 10https://gerrit.wikimedia.org/r/224968 (owner: 10KartikMistry) [14:36:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Add missing eo-en pair [puppet] - 10https://gerrit.wikimedia.org/r/224968 (owner: 10KartikMistry) [14:36:50] (03CR) 10Lokal Profil: "Ahm. notes for patch 3 seem to be stuck as drafts but here are the promised config comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:38:52] (03PS1) 10Giuseppe Lavagetto: citoid: enable advanced monitoring [puppet] - 10https://gerrit.wikimedia.org/r/225056 [14:39:21] <_joe_> akosiaris: looks like we should add an additional request to your script to scaffold a new service [14:39:38] _joe_: ? [14:39:50] the monitoring part ? [14:39:55] <_joe_> akosiaris: I added a new functionality to service::node [14:39:57] <_joe_> yes [14:40:19] <_joe_> I mean it's low-prio [14:40:37] <_joe_> and in due time, all services will have spec-based monitorning by default [14:40:58] <_joe_> btw, it would be nice to add something like this to mediawiki itself :) [14:41:09] I suppose it should be easy [14:41:15] * akosiaris famous last words [14:41:26] well, it will be done in service::node, no ? [14:42:02] <_joe_> yes [14:43:25] YuviPanda: (if awake) I’d appreciate a look at https://gerrit.wikimedia.org/r/#/c/224465/ — seems simple but it may overlook oddness with custom puppetmasters in labs. [14:48:42] (03PS5) 10Andrew Bogott: Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [14:48:44] (03PS10) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [14:48:46] (03PS3) 10Andrew Bogott: Purge labs_puppet_master_secondary. [puppet] - 10https://gerrit.wikimedia.org/r/224660 [14:51:13] so is nobody else able to poke at the network routing issue? [14:53:12] PROBLEM - puppet last run on bromine is CRITICAL Puppet has 1 failures [14:54:00] <_joe_> brion: mark is on vacation, so your best bet is paravoid I guess [14:54:07] (03CR) 10Hashar: [C: 031] Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 (owner: 10Andrew Bogott) [14:54:09] ok [14:54:33] <_joe_> brion: I wouldn't know where to start [14:54:45] :) [14:55:16] <_joe_> brion: meaning I don't even have access to our networking gear :P [14:58:03] (03CR) 10Hashar: [C: 031] contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [15:00:05] manybubbles anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150716T1500). [15:00:05] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] Hola. [15:01:22] James_F: Hola, I can SWAT this morning :) [15:01:42] thcipriani: Kk. :-) [15:05:55] akosiaris: Hey [15:06:42] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1457243 (10Tnegrin) Hi Adam -- can you please look at the UAs for the queries that come to these domains? The question was asked on mobile... [15:07:13] brion: hey [15:07:15] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1457244 (10Prtksxna) I've forced push to the [[https://gerrit.wikimedia.org/r/#/q/project:wikimedia/TransparencyReport,n,z|public reposito... [15:07:19] akosiaris: https://phabricator.wikimedia.org/T89640#1457244 [15:07:22] paravoid: yo :) [15:07:28] brion: should be fixed, please confirm :) [15:07:34] andrewbogott: seems ok [15:07:49] YuviPanda: ok! Thanks [15:07:54] akosiaris: would love your help here, there is a press conference going right now and the report hasnt updated on the site [15:07:58] are submodules still being auto-bumped, doesn't seem like they are considering this Math extension hasn't been bumped yet :\ [15:08:10] 721KB/s hey that's a step up \o/ [15:08:13] I ran into an issue with wmf14 earlier [15:08:17] I don't think the submodules are being bumped there [15:08:19] oh no wait that was ulsfo ;) [15:08:33] twentyafterfour, ^ [15:08:37] \o/ and same from eqiad now [15:08:43] paravoid: awesome thanks :D [15:08:45] Krenair: kk, James_F I'm making a submodule bump patch for that math extension [15:08:47] Actually anyone, https://phabricator.wikimedia.org/T89640#1457244, legal is expecting to announce this soon. [15:08:51] what ended up being tweaked? [15:08:59] thcipriani: Isn't it automatic? [15:09:04] Oh. [15:09:07] Interesting. [15:09:17] I filed a task for twentyafterfour to look into [15:09:23] brion: I'll update the task :) [15:09:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "So my -1 is indeed because of owner/group/mode attributes missing. Please do not rely on defaults, because such a thing does not exist. Du" [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [15:09:28] (with details) [15:09:47] awesome :D [15:10:22] moizsyed: prtksxna looking [15:10:27] akosiaris: thanks! [15:10:50] akosiaris: Thanks! [15:12:33] (03CR) 10Mobrovac: [C: 031] "go go go" [puppet] - 10https://gerrit.wikimedia.org/r/225056 (owner: 10Giuseppe Lavagetto) [15:14:26] moizsyed: prtksxna: wanna check ? [15:14:30] RECOVERY - puppet last run on bromine is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:14:51] * prtksxna refreshes [15:15:03] prtksxna: so, how come there was a merge conflict on the host ? Have you guys rewritten git history or something ? [15:15:15] akosiaris: its not updated [15:15:19] akosiaris: Force pushed, so yes [15:16:01] (03PS2) 10Giuseppe Lavagetto: citoid: enable advanced monitoring [puppet] - 10https://gerrit.wikimedia.org/r/225056 [15:16:03] prtksxna: so, it's on 2183df8442222ca2cdfaf1b03f9d5076b98fb817 "Add data for privacy page" [15:16:08] akosiaris: Also, I just realized why haven't built the middleman app :| [15:16:10] Yesp [15:16:15] akosiaris: That is right [15:16:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] citoid: enable advanced monitoring [puppet] - 10https://gerrit.wikimedia.org/r/225056 (owner: 10Giuseppe Lavagetto) [15:16:33] 6operations, 10Traffic, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1457294 (10faidon) 5Open>3Resolved a:3faidon The Wikimania network is behind AS 14178 (MEGACABLE). For 14178, we're se... [15:16:33] akosiaris: I'll add one more to that in just a second [15:17:00] so, rewritting git history, while possible, does tend to create these kind of problems [15:17:06] avoid doing it [15:17:10] * prtksxna is an idiot [15:17:19] ? [15:17:39] akosiaris: it's ok, let prtksxna be an idiot :) [15:18:02] :P [15:18:07] akosiaris: Yep, sorry, we wont do it next time [15:19:16] `git push -f` should require a confirmation prompt or something. [15:19:41] "Are you sure you wanna make your friends confused? y/N?" [15:19:49] akosiaris: is the HEAD in the right place [15:20:03] akosiaris: im cloning the repo on my end and then will build and send a patch for merge [15:20:05] moizsyed: Ours isn't :P [15:20:35] moizsyed: Let's check once we've merged your patch [15:22:08] (03CR) 10KartikMistry: [C: 031] Beta: Only enable ContentTranslation on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224975 (https://phabricator.wikimedia.org/T91340) (owner: 10Alex Monk) [15:23:03] akosiaris: I just merged the build patch. [15:23:07] akosiaris: We should see a change in the website now [15:23:27] akosiaris: Sorry for the confusion, we completely forgot that we had to build (using middleman) [15:23:55] it was prtksxna's fault [15:23:58] seriously [15:23:59] :p [15:24:10] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [15:24:26] hashar: FYI nodepool is cronspamming with sudo failures, I've forwarded you the email [15:24:41] err, sudospamming [15:24:42] (03PS6) 10Andrew Bogott: Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [15:25:07] prtksxna: moizsyed now it's on 850717837666239e81404e151f3d7fed59ec1285 "Build patch" [15:25:24] akosiaris moizsyed The website has updates too! [15:25:26] º╲˚\╭ᴖ_ᴖ╮/˚╱º Y A Y ! [15:25:30] YAY! [15:25:33] cool! [15:25:34] thank you everyone! [15:25:36] Thanks akosiaris! [15:25:41] akosiaris: \o/ [15:25:45] So, no history rewritting in the future please ;-) [15:25:54] but otherwise YAY! [15:26:04] (03CR) 10Andrew Bogott: [C: 032] Use hiera for the puppetmaster name, everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/224465 (owner: 10Andrew Bogott) [15:26:25] <_joe_> andrewbogott: isn't that like googling google? [15:26:55] (03PS1) 10Andrew Bogott: Revert "Use hiera for the puppetmaster name, everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/225065 [15:27:01] <_joe_> andrewbogott: https://www.youtube.com/watch?v=v2FMqtC1x9Y [15:27:14] <_joe_> I was joking btw :) [15:27:18] Krenair: It is totally a matter of whether I create the branch on my laptop or on tin, because tin is running a really old version of git that doesn't support that feature apparently [15:27:20] (03PS1) 10Rush: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 [15:27:30] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1457349 (10Prtksxna) >>! In T89640#1457244, @Prtksxna wrote: > I've forced push to the [[https://gerrit.wikimedia.org/r/#/q/project:wikime... [15:27:35] _joe_: and yet, it seems to have broken something :( [15:27:37] !log thcipriani Synchronized php-1.26wmf14/extensions/Math/MathMathML.php: SWAT: Fix: Undefined variable passed hook [[gerrit:225058]] (duration: 00m 12s) [15:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:41] PROBLEM - puppet last run on sca1002 is CRITICAL puppet fail [15:27:45] ^ James_F check please [15:27:55] <_joe_> andrewbogott: this ^^ could be my patch [15:27:56] thcipriani: Looking. [15:28:04] (03PS2) 10Rush: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 [15:28:15] oh, wait, it didn’t break anything, I’m just running puppet agent without sudo. [15:28:20] Every week I have to make that mistake [15:28:35] thcipriani: LGTM. [15:28:44] James_F: cool thanks [15:28:49] (03CR) 10jenkins-bot: [V: 04-1] phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 (owner: 10Rush) [15:29:19] <_joe_> andrewbogott: lol [15:29:20] <_joe_> :) [15:29:55] (03PS1) 10Yuvipanda: puppetception: Simplify module a little bit [puppet] - 10https://gerrit.wikimedia.org/r/225067 [15:30:33] (03Abandoned) 10Andrew Bogott: Revert "Use hiera for the puppetmaster name, everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/225065 (owner: 10Andrew Bogott) [15:30:48] (03PS4) 10Andrew Bogott: Purge labs_puppet_master_secondary. [puppet] - 10https://gerrit.wikimedia.org/r/224660 [15:30:50] <_joe_> YuviPanda: puppetception? [15:30:54] <_joe_> wtf is that? [15:31:15] isn't it the puppet wrapper to run Chef? [15:31:57] (03PS3) 10Rush: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 [15:32:32] (03CR) 10Andrew Bogott: [C: 032] Purge labs_puppet_master_secondary. [puppet] - 10https://gerrit.wikimedia.org/r/224660 (owner: 10Andrew Bogott) [15:32:36] hashar: thoughts on all that sudo'ing from nodepool? [15:32:59] _joe_: heh, masterless puppet experiment for labs [15:33:16] godog: yeah being handled / {done} [15:33:26] godog: andrew poked me on wikimedia-releng [15:33:53] hashar: cool! thanks :) yeah I didn't see the other ping [15:33:54] (03PS1) 10Giuseppe Lavagetto: service::node: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225069 [15:34:11] (03PS2) 10Giuseppe Lavagetto: service::node: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225069 [15:34:23] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] service::node: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225069 (owner: 10Giuseppe Lavagetto) [15:35:03] (03PS4) 10Rush: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 [15:35:06] (03CR) 10Chad: "Maybe rebase on top of Ia4f775be and let both land?" [puppet] - 10https://gerrit.wikimedia.org/r/225066 (owner: 10Rush) [15:35:29] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: NRPE: Command check_endpoints_citoid not defined [15:35:30] (03PS3) 10Rush: Phabricator: Clean up vcs manifest with proper dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223353 (owner: 10Chad) [15:35:50] (03PS2) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) [15:36:04] (03CR) 10Rush: [C: 032 V: 032] Phabricator: Clean up vcs manifest with proper dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223353 (owner: 10Chad) [15:37:01] chasemp: thx! [15:37:06] ostriches: problem :) [15:37:07] Error: Failed to apply catalog: Could not find dependency Package[Git] for File[/usr/local/bin/git-http-backend] at /etc/puppet/modules/phabricator/manifests/vcs.pp:14 [15:37:14] the actual package isn't called 'git' I think [15:37:16] git-core? [15:37:18] Herp derp.... [15:37:19] Yeah [15:37:20] git-core [15:37:24] really? [15:37:31] oh yeah [15:37:32] :| ok [15:37:39] Package is actually 'git' now [15:37:43] in previous incantantions of the package yeah, I think in jessie it is called git [15:37:46] But I think we still use 'git-core' in puppet for b/c [15:37:56] So it works on all distros we use [15:37:58] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: NRPE: Command check_endpoints_citoid not defined [15:38:19] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:28] chasemp: Fix incoming, sec. [15:39:11] (03PS1) 10Chad: phabricator::vcs: Fix package name to git-core [puppet] - 10https://gerrit.wikimedia.org/r/225070 [15:40:24] (03CR) 10Rush: [C: 032] phabricator::vcs: Fix package name to git-core [puppet] - 10https://gerrit.wikimedia.org/r/225070 (owner: 10Chad) [15:40:36] chasemp: do you want to hack for a bit at some point to get something setup and running via puppetception? [15:40:42] (03PS2) 10Yuvipanda: puppetception: Simplify module a little bit [puppet] - 10https://gerrit.wikimedia.org/r/225067 [15:40:50] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetception: Simplify module a little bit [puppet] - 10https://gerrit.wikimedia.org/r/225067 (owner: 10Yuvipanda) [15:40:58] YuviPanda: I haven't really looked at it at all [15:41:03] ok! [15:41:05] if I can wrap up a few things I'll read through [15:41:18] ok [15:41:24] it's less than 30 lines of code [15:42:14] ostriches: seems good now but do you know why the security extension on iridium is out of date from puppet? [15:42:27] I do not, I haven't/don't touch that. [15:43:19] ok thanks [15:46:11] (03PS5) 10Chad: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 (owner: 10Rush) [15:46:14] chasemp: Rebased on top for you ^ [15:46:28] kk [15:47:07] (03PS6) 10Rush: phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 [15:48:32] (03PS11) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [15:48:34] (03PS1) 10Andrew Bogott: Labs puppetmaster still needs to set itself as hiera(labs_puppet_master) [puppet] - 10https://gerrit.wikimedia.org/r/225071 [15:48:36] (03CR) 10Rush: [C: 032] phab: create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/225066 (owner: 10Rush) [15:49:38] (03PS2) 10Andrew Bogott: Labs puppetmaster still needs to set itself as hiera(labs_puppet_master) [puppet] - 10https://gerrit.wikimedia.org/r/225071 [15:50:39] (03CR) 10Andrew Bogott: [C: 032] Labs puppetmaster still needs to set itself as hiera(labs_puppet_master) [puppet] - 10https://gerrit.wikimedia.org/r/225071 (owner: 10Andrew Bogott) [15:51:39] (03CR) 10Alex Monk: [C: 032] Beta: Only enable ContentTranslation on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224975 (https://phabricator.wikimedia.org/T91340) (owner: 10Alex Monk) [15:52:09] (03Merged) 10jenkins-bot: Beta: Only enable ContentTranslation on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224975 (https://phabricator.wikimedia.org/T91340) (owner: 10Alex Monk) [15:54:09] !log krenair Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/224975/ (duration: 00m 12s) [15:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:03] (03PS12) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [15:55:05] (03PS1) 10Andrew Bogott: One last fix for https://gerrit.wikimedia.org/r/#/c/224465 [puppet] - 10https://gerrit.wikimedia.org/r/225074 [15:55:29] (03PS2) 10Andrew Bogott: One last fix for https://gerrit.wikimedia.org/r/#/c/224465 [puppet] - 10https://gerrit.wikimedia.org/r/225074 [15:55:35] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1457459 (10RobH) The memory for this has been ordered and is being shipped to codfw. [15:55:58] (03PS1) 10Hashar: nodepool: typo in conf template [puppet] - 10https://gerrit.wikimedia.org/r/225076 [15:56:22] godog: andrewbogott: the cron spam would be fixed with https://gerrit.wikimedia.org/r/225076 [15:56:27] that is a lame typo in the configuration file :-/ [15:56:31] (03CR) 10Chad: [C: 032] Beta: Move wikidata.beta.wmflabs.org to static mappings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 (owner: 10Chad) [15:56:38] RECOVERY - puppet last run on sca1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:41] (03CR) 10Andrew Bogott: [C: 032] One last fix for https://gerrit.wikimedia.org/r/#/c/224465 [puppet] - 10https://gerrit.wikimedia.org/r/225074 (owner: 10Andrew Bogott) [15:56:56] (03Merged) 10jenkins-bot: Beta: Move wikidata.beta.wmflabs.org to static mappings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 (owner: 10Chad) [15:57:35] (03PS2) 10Giuseppe Lavagetto: nodepool: typo in conf template [puppet] - 10https://gerrit.wikimedia.org/r/225076 (owner: 10Hashar) [15:57:48] !log demon Synchronized multiversion/MWMultiVersion.php: prod no-op, beta change (duration: 00m 13s) [15:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:53] (03CR) 10Giuseppe Lavagetto: [C: 032] nodepool: typo in conf template [puppet] - 10https://gerrit.wikimedia.org/r/225076 (owner: 10Hashar) [15:58:11] moar rebase [15:58:36] (03PS3) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) [15:58:43] ostriches: are you / twentyafterfour think that we run the git ssh on 22 and move opsy ssh to ...2022 (or something?) [15:59:21] think => thinking [16:00:45] What about phabricator.wm.o:22 -> git and iridium.wm.o:22 shell? [16:00:58] That was kinda the plan with Gerrit, we just never got around to finishing it. [16:01:19] can you do hostname base ssh stuff? phab does not have a public ip [16:01:22] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1457516 (10JanZerebecki) Git uses content addressing for files, so instead of using the last commit t... [16:01:36] chasemp: Ah, there was the difference. ytterbium has a public *and* a private IP. [16:01:42] and I talked this out with bblack a bit and the idea is to set up an lvs service and keep it private and behind misc-web [16:01:42] (ytterbium == gerrit) [16:02:37] chasemp: What about :22 and :2222? Easier to remember than :2022 imho [16:02:46] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1457519 (10Chmarkine) Could you look at the referrers as well? Do most of the requests come from search engines? [16:02:48] no objection :) [16:05:40] chasemp: WFM then. If the desire is to keep iridium private and have all things go through misc-web, then the public/private option won't work....which is just fine. Long as everyone's on the same page :) [16:08:28] !log kept nodepool stopped on labnodepool1001.eqiad.wmnet because it spams the cron log [16:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:07] (03PS1) 10Dzahn: iegreview: add role to node krypton [puppet] - 10https://gerrit.wikimedia.org/r/225079 (https://phabricator.wikimedia.org/T105007) [16:10:45] chasemp: 2222 is fine [16:11:49] but also, couldn't the git ssh be on an alternate port and then the lvs service remaps the port number for us? [16:12:45] So the user sees 22, but lvs rewrites to 1234 or w/e? [16:13:57] in past discussions that's what faidon suggested. with an iptables rule [16:14:39] I'm remembering this discussion, vaugely. [16:14:55] ostriches: right [16:15:10] similar but not this: https://gerrit.wikimedia.org/r/#/c/185340/ [16:15:23] that seems like the best compromise for consistency of operations ssh access but also ease of git access [16:15:49] see his solution in the comments [16:15:58] "Your ferm rule is wrong too + you don't need IP forwarding enabled. What Gerrit needs isn't cross-box NAT (and FORWARD rules), just a simple REDIRECT and regular INPUT rules will suffice." [16:16:09] ferm::rule { 'gerrit-ssh': table => 'nat', chain => 'PREROUTING', rule => 'daddr @resolve(gerrit.wikimedia.org) proto tcp dport 22 REDIRECT to-ports 29418', [16:16:49] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1457597 (10Joe) 3NEW [16:17:39] "Finally, considering our Phabricator plans, is there any point to do port 22 for Gerrit right now?" :) [16:17:48] ACKNOWLEDGEMENT - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) Giuseppe Lavagetto This is a bug in citoid in production, see T106044 [16:17:49] ACKNOWLEDGEMENT - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) Giuseppe Lavagetto This is a bug in citoid in production, see T106044 [16:18:34] mutante: No, there's no. [16:18:49] Doing anything with gerrit other than keeping the lights on is pretty much a waste of time at this point. [16:18:52] (03PS3) 10Filippo Giunchedi: install-server: fix logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/224757 (https://phabricator.wikimedia.org/T104035) [16:18:55] 10Ops-Access-Requests, 6operations: Provide hoo (Marius Hoch) with Hive access - https://phabricator.wikimedia.org/T106045#1457617 (10Tnegrin) 3NEW a:3Ottomata [16:18:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: fix logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/224757 (https://phabricator.wikimedia.org/T104035) (owner: 10Filippo Giunchedi) [16:18:58] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1457625 (10Joe) [16:19:33] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1457647 (10fgiunchedi) [16:19:35] 6operations, 6Discovery, 7Elasticsearch, 5Patch-For-Review: logstash partman recipe huge root partition - https://phabricator.wikimedia.org/T104035#1457644 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi merged, resolving [16:21:01] (03CR) 10BryanDavis: [WIP] Sync /srv/mediawiki-staging to co-masters (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [16:21:04] (03PS6) 10Filippo Giunchedi: Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [16:21:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add es-tool upgrade-fast and stopping paranoia [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [16:23:25] (03CR) 10Manybubbles: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/224548 (owner: 10Manybubbles) [16:25:33] ostriches: that depends on how long we're stuck with gerrit. it doesn't seem to be going away any time soon, since zuul requires it [16:26:59] how is code review in phab going then? [16:27:15] 6operations, 10Traffic, 10Wikimania-Hackathon-2015, 7network: Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network - https://phabricator.wikimedia.org/T105984#1457697 (10Multichill) Looking glass might be useful: * http://lg.ring.nlnog.net/prefix_detail/lg01/ipv4?q=201.149.6.36 <- h... [16:27:27] twentyafterfour: I'm not exactly sure "because zuul..." is much justification for it :p [16:27:32] * ostriches dislikes zuul, a lot [16:28:11] ostriches: I'm no fan of it really but tell that to hashar [16:28:28] I have. Usually over beer though :p [16:28:39] zuul is a piece of crap [16:28:53] zuul is a piece of crap [16:28:57] quoting for truth [16:29:02] mutante: phab code review is totally ready to go [16:29:16] but zuul is blocking phab ci [16:29:46] I mean, I could make phab ci work but I was under the impression that zuul was pretty much a blocker [16:30:07] I don't think we need to approach it from the perspective of having to keep Zuul if there's better ways [16:30:14] Which, there are. [16:30:15] there are [16:30:27] there is harbormaster [16:30:49] we wouldn't even need jenkins really (other than all the huge investment in jenkins jobs that wouldn't be easy to convert) [16:31:23] but phab -> harbormaster -> jenkins would work fine [16:38:21] (03CR) 10Alex Monk: "Doesn't look like there was much discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225021 (https://phabricator.wikimedia.org/T103263) (owner: 10Glaisher) [16:38:41] (03CR) 10Alex Monk: "Do we really want to allow another install of this extension?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225019 (https://phabricator.wikimedia.org/T105853) (owner: 10Glaisher) [16:39:21] (03CR) 10Glaisher: "It's not a huge wiki and we usually do it if there are no objections after some time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225021 (https://phabricator.wikimedia.org/T103263) (owner: 10Glaisher) [16:39:23] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1457738 (10mmodell) [16:39:52] (03CR) 10Glaisher: "I don't understand. Any reason not to?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225019 (https://phabricator.wikimedia.org/T105853) (owner: 10Glaisher) [16:40:06] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1457739 (10dr0ptp4kt) @tnegrin, @chmarkine: I'll pull together some User-Agent and Referer stuff. [16:44:57] (03CR) 10BryanDavis: "This *almost* works. When testing in beta cluster I have found that there are a number of files in /srv/mediawiki-staging (mostly symlinks" [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [16:48:31] (03CR) 10Jhobs: [C: 031] Remove wap and mobile subdomains [dns] - 10https://gerrit.wikimedia.org/r/223972 (https://phabricator.wikimedia.org/T104942) (owner: 10BBlack) [16:49:32] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1457782 (10Krinkle) Using Git information for this is imho infeasible, and needlessly complex. Whethe... [16:52:22] (03PS13) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [16:54:19] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [16:54:20] (03PS1) 10Jgreen: change lutetium.wm.o FQDN to service-oriented frdev-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/225090 [16:55:40] (03CR) 10Andrew Bogott: [C: 032] Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [16:55:46] (03CR) 10Jgreen: [C: 032 V: 031] change lutetium.wm.o FQDN to service-oriented frdev-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/225090 (owner: 10Jgreen) [16:55:56] 6operations, 7network: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#1457822 (1080686) 3NEW [16:56:29] !log authdns update to rename lutetium.wm.o [16:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:39] (03PS1) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [16:56:42] 6operations, 7network: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#1457836 (1080686) [16:58:23] (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [16:58:36] (03CR) 10Andrew Bogott: "I don't object as long as I'm not in charge of keeping the translations in sync :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [16:59:59] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [17:10:59] 6operations, 10MediaWiki-extensions-CentralAuth: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1457937 (10Glaisher) [17:13:21] (03CR) 10CSteipp: [C: 031] "Sounds sane. It would be nice if ConfirmEdit nicely logged when the limit was hit in passCaptchaLimited, so we could track when it gets us" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [17:21:08] (03CR) 10Florianschmidtwelzow: "There is an error message created with wfDebug, isn't this logged anywhere? What logging would you wish, maybe we can add it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [17:28:43] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1458000 (10akosiaris) I 've managed to backport 1.7.2+dfsg-4 from sid to trusty instead. I 've deleted the ugly upstream package and replaced it with the backport. For n... [17:38:45] (03CR) 10Nemo bis: "captcha.log should be rather comprehensive, someone could check whether it's useful enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [17:46:51] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1458063 (10BBlack) >>! In T99096#1457782, @Krinkle wrote: > Using Git information for this is imho in... [17:52:36] (03PS2) 10BBlack: Redirect sep11.wikipedia.org to https wayback machine [puppet] - 10https://gerrit.wikimedia.org/r/225043 (owner: 10Glaisher) [17:53:04] (03CR) 10BBlack: [C: 032 V: 032] Redirect sep11.wikipedia.org to https wayback machine [puppet] - 10https://gerrit.wikimedia.org/r/225043 (owner: 10Glaisher) [18:00:05] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150716T1800). Please do the needful. [18:10:28] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:11:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:15:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "First round of comments. Seems like a promising thing." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [18:15:42] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1458142 (10akosiaris) 5Open>3Resolved [18:17:43] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1458152 (10akosiaris) https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/role/maps.pp and various changes in modules/... [18:20:54] any reason I shouldn't deploy wmf14 to group2? [18:21:56] (03PS1) 10BBlack: tlsproxy: double session cache size to measure impact [puppet] - 10https://gerrit.wikimedia.org/r/225109 [18:22:42] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: double session cache size to measure impact [puppet] - 10https://gerrit.wikimedia.org/r/225109 (owner: 10BBlack) [18:23:29] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:23:47] (03PS1) 1020after4: all wikis to 1.26wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225110 [18:24:08] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225110 (owner: 1020after4) [18:24:14] (03Merged) 10jenkins-bot: all wikis to 1.26wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225110 (owner: 1020after4) [18:24:39] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:25:21] (03PS1) 10Dzahn: ganglia_new: add ULSFO [puppet] - 10https://gerrit.wikimedia.org/r/225111 [18:26:08] (03PS2) 10Dzahn: ganglia_new: add ULSFO aggregator setting in hiera [puppet] - 10https://gerrit.wikimedia.org/r/225111 [18:26:27] (03PS1) 10Yuvipanda: ores: Change wsgi file name / path to match new name [puppet] - 10https://gerrit.wikimedia.org/r/225112 [18:26:54] (03PS1) 10Alexandros Kosiaris: Remove role::postgres::maps from labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/225113 [18:29:21] (03PS2) 10Yuvipanda: ores: Change wsgi file name / path to match new name [puppet] - 10https://gerrit.wikimedia.org/r/225112 [18:29:30] (03PS3) 10Dzahn: ganglia_new: add aggregator setting for ULSFO [puppet] - 10https://gerrit.wikimedia.org/r/225111 (https://phabricator.wikimedia.org/T93776) [18:29:32] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Change wsgi file name / path to match new name [puppet] - 10https://gerrit.wikimedia.org/r/225112 (owner: 10Yuvipanda) [18:31:56] (03PS4) 10Dzahn: ganglia_new: add aggregator setting for ULSFO [puppet] - 10https://gerrit.wikimedia.org/r/225111 (https://phabricator.wikimedia.org/T93776) [18:34:18] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1458265 (10MaxSem) Each of two services should have their own uids with their own permissions: * kartotherian: SELECT on all tables * 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1458288 (10akosiaris) Normally a new service would require filling a task in https://phabricator.wikimedia.org/project/profile/1305/ . Since we are ta... [18:40:52] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf14 [18:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:40] (03PS2) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [18:42:31] LoadBalancer.php is spamming logs badly: [18:42:33] https://phabricator.wikimedia.org/T106072 [18:42:34] (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [18:43:04] twentyafterfour: Again? I thought that went away [18:43:38] ostriches: apparently not. it got worse after I just pushed wmf14 [18:43:48] Merged your dupe in [18:45:40] it's happening a lot [18:46:01] seems to be going down just a little over time [18:46:11] does deploy kill apc cache? [18:47:24] twentyafterfour: A new version probably evicts a lot of opcode cache entries in favor of the new code... [18:47:44] (03PS3) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [18:48:08] well I mean, that LoadMonitor cache, I'm not sure how it's stored, is it local to the app server, or is it in a global memcache? [18:48:29] (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [18:48:55] because the invalid argument supplied to foreach error got a lot more prominent right after pushing the branch even though that code didn't change between the two branches [18:49:34] twentyafterfour: Both! [18:49:47] $srvCache is in memory, $mainCache is memcached. [18:49:56] I think the code doesn't account for all of the possible not-in-cache scenarios somehow [18:50:04] it's gotta be returning null [18:50:40] probably LoadMonitor.php line 133 [18:52:16] It looks like it handles $value being false in both scenarios [18:52:23] $staleValue = $value ?: false; [18:52:24] And then [18:52:36] $staleValue = $value ?: $staleValue; [18:54:43] The only way I see that foreach() error happening is when $serverIndexes isn't an array [18:55:39] twentyafterfour: I know we tweaked the memcached key recently. [18:56:02] But I'm still not seeing how that'd cause the bug, unless getLagTimes() is given a bogus param. [18:56:28] (03PS1) 10Alexandros Kosiaris: ganglia::web: enable opcache [puppet] - 10https://gerrit.wikimedia.org/r/225120 [18:56:30] (03PS1) 10Alexandros Kosiaris: ganglia::web: Enable scalability mode for gmetad [puppet] - 10https://gerrit.wikimedia.org/r/225121 [18:56:32] (03PS1) 10Alexandros Kosiaris: Forbid robots from crawling ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225122 [18:57:38] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia::web: enable opcache [puppet] - 10https://gerrit.wikimedia.org/r/225120 (owner: 10Alexandros Kosiaris) [18:57:39] ostriches: it's returning $value when it should return $staleValue [18:57:46] see https://phabricator.wikimedia.org/rMW9cf6637a751d018bc2bec26ea0f1a9de936c2c9c [18:58:06] (03PS2) 10Alexandros Kosiaris: Remove role::postgres::maps from labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/225113 [18:58:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove role::postgres::maps from labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/225113 (owner: 10Alexandros Kosiaris) [18:58:36] Herp derp [18:59:21] (03PS2) 10Alexandros Kosiaris: ganglia::web: enable opcache [puppet] - 10https://gerrit.wikimedia.org/r/225120 [19:03:25] (03PS4) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [19:04:18] (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [19:04:37] (03PS1) 10Yuvipanda: ores: Switch staging to master branch [puppet] - 10https://gerrit.wikimedia.org/r/225125 [19:05:20] (03PS2) 10Alexandros Kosiaris: ganglia::web: Enable scalability mode for gmetad [puppet] - 10https://gerrit.wikimedia.org/r/225121 [19:05:26] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ganglia::web: Enable scalability mode for gmetad [puppet] - 10https://gerrit.wikimedia.org/r/225121 (owner: 10Alexandros Kosiaris) [19:05:46] (03PS2) 10Alexandros Kosiaris: Forbid robots from crawling ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225122 [19:06:13] (03PS2) 10Yuvipanda: ores: Switch staging to master branch [puppet] - 10https://gerrit.wikimedia.org/r/225125 [19:06:21] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Switch staging to master branch [puppet] - 10https://gerrit.wikimedia.org/r/225125 (owner: 10Yuvipanda) [19:07:36] (03PS3) 10Alexandros Kosiaris: Forbid robots from crawling ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225122 [19:07:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Forbid robots from crawling ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225122 (owner: 10Alexandros Kosiaris) [19:10:21] 6operations, 6Labs: upgrade salt to 2015.5 - https://phabricator.wikimedia.org/T106074#1458390 (10Krenair) [19:13:49] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1458395 (10Andrew) So, here's what I'm seeing: - 3.19 doesn't crash with suspend/resume. That's good! - Suspend/resume doesn't work reliably... instances seem to lose some amount... [19:14:39] 6operations, 10Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1458399 (10ori) @joe Any update on this? [19:15:44] (03PS3) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [19:15:54] (03PS2) 10Dzahn: delete files/apache/conf.d/namevirtualhost [puppet] - 10https://gerrit.wikimedia.org/r/224993 [19:16:15] I'm gonna deploy https://gerrit.wikimedia.org/r/#/c/225123/ [19:16:32] (03PS2) 10Dzahn: iegreview: add role to node krypton [puppet] - 10https://gerrit.wikimedia.org/r/225079 (https://phabricator.wikimedia.org/T105007) [19:20:20] !log twentyafterfour Synchronized php-1.26wmf14/includes/db/LoadMonitor.php: Deploying Hotfix for T105373 (duration: 00m 13s) [19:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:44] 6operations, 10ops-codfw, 5Patch-For-Review: Rename osm-cp2001, osm-cp2002, osm-cp2003, osm-cp2004 - https://phabricator.wikimedia.org/T104869#1458409 (10akosiaris) 5Resolved>3Open And unfortunately, some things changed and we go for another rename. So: * maps-test-db2001 => maps-test2001 * maps-test-d... [19:21:14] (03PS1) 10Eevans: eliminate redundant threshold alerts [puppet] - 10https://gerrit.wikimedia.org/r/225184 [19:23:59] (03PS1) 10Alexandros Kosiaris: Assign new names to maps-test machines [dns] - 10https://gerrit.wikimedia.org/r/225187 (https://phabricator.wikimedia.org/T104869) [19:25:20] (03CR) 10Dzahn: [C: 032] iegreview: add role to node krypton [puppet] - 10https://gerrit.wikimedia.org/r/225079 (https://phabricator.wikimedia.org/T105007) (owner: 10Dzahn) [19:27:01] (03CR) 10Alexandros Kosiaris: [C: 032] Assign new names to maps-test machines [dns] - 10https://gerrit.wikimedia.org/r/225187 (https://phabricator.wikimedia.org/T104869) (owner: 10Alexandros Kosiaris) [19:28:11] (03PS1) 10Yuvipanda: ores: Add indonesian spell check package for ores [puppet] - 10https://gerrit.wikimedia.org/r/225190 [19:28:20] (03PS2) 10Yuvipanda: ores: Add indonesian spell check package for ores [puppet] - 10https://gerrit.wikimedia.org/r/225190 [19:28:28] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Add indonesian spell check package for ores [puppet] - 10https://gerrit.wikimedia.org/r/225190 (owner: 10Yuvipanda) [19:29:04] (03PS4) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [19:30:36] (03CR) 10GWicke: [C: 031] eliminate redundant threshold alerts [puppet] - 10https://gerrit.wikimedia.org/r/225184 (owner: 10Eevans) [19:30:51] 6operations, 10ops-codfw, 5Patch-For-Review: Rename osm-cp2001, osm-cp2002, osm-cp2003, osm-cp2004 - https://phabricator.wikimedia.org/T104869#1458455 (10akosiaris) mgmt DNS and switch port description done. racktables and DC labels left [19:31:14] ottomata: https://github.com/wikimedia/mediawiki-php-luasandbox/blob/master/ext_luasandbox.php , https://github.com/wikimedia/mediawiki-php-luasandbox/blob/master/debian/rules , https://github.com/wikimedia/mediawiki-php-luasandbox/blob/master/config.cmake [19:32:35] (03PS1) 10Dzahn: iegreview: ensure deploy dir path exists [puppet] - 10https://gerrit.wikimedia.org/r/225198 [19:34:30] (03PS14) 10BryanDavis: Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [19:34:50] (03CR) 10BryanDavis: Add role::mediawiki_vagrant_lxc (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [19:36:10] (03PS2) 10Dzahn: iegreview: ensure deploy dir path exists [puppet] - 10https://gerrit.wikimedia.org/r/225198 [19:36:19] so apparently there's no way for a user to fix it themselves if they put the wrong email address in phab when registering. a user on enwiki just did that. who can take care of it for them? [19:36:32] (03PS3) 10Dzahn: iegreview: ensure deploy dir path exists [puppet] - 10https://gerrit.wikimedia.org/r/225198 [19:36:56] jackmcbarn: afaik all we can do is delete the user (based on seeing tickets like that in the past) [19:37:10] (03PS1) 10Alexandros Kosiaris: Introduce maps-test200{1,2,3,4} [dns] - 10https://gerrit.wikimedia.org/r/225201 (https://phabricator.wikimedia.org/T105394) [19:37:20] mutante: should i direct the user here to ask for that? [19:37:46] jackmcbarn: ideally to #wikimedia-devtools [19:37:51] okay [19:38:44] (03CR) 10Dzahn: [C: 032] iegreview: ensure deploy dir path exists [puppet] - 10https://gerrit.wikimedia.org/r/225198 (owner: 10Dzahn) [19:39:26] (03PS5) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [19:39:42] !log imported aspell-id from ubuntu to jessie-wikimedia - needed by ores, simple package that I am not sure why it is not in jessie [19:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:01] arrr. good old "duplicate declaration" when combining multiple roles.. [19:41:34] (03CR) 10Yuvipanda: "IS THERE COMMUNITY CONSENSUS? CAN WE HAVE AN RFC?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:43:15] 6operations, 10ops-codfw, 5Patch-For-Review: Rename osm-cp2001, osm-cp2002, osm-cp2003, osm-cp2004 - https://phabricator.wikimedia.org/T104869#1458509 (10Papaul) 5Open>3Resolved rack-tables update complete , labels changed complete. [19:43:49] PROBLEM - puppet last run on krypton is CRITICAL puppet fail [19:46:27] (03PS1) 10Alexandros Kosiaris: Introduce maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225204 [19:56:25] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1458570 (10GWicke) [20:00:11] YuviPanda: it's odd that ONLY the -id package is missing, i asked the bot "judd" and #debian and this other guy has been looking for it and even mailed debian-user-indonesian@lists.debian.org , but not clear why :) [20:00:21] me neither yeah [20:01:12] popularity contest says that exactly 1(!sic) user had it installed though [20:01:27] we have 7 hosts with it now! [20:01:30] :) [20:01:41] do we enable reporting the popcon info? [20:01:47] not sure [20:04:43] (03CR) 10Ladsgroup: "I asked you or Andrew in Lyon if an RFC needed and you said no." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [20:07:29] (03PS1) 10Ori.livneh: Revert "hhvm: make cache files management explicit in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/225209 [20:08:50] (03PS1) 10Ori.livneh: Add a script for rolling restart of HHVM servers [puppet] - 10https://gerrit.wikimedia.org/r/225211 [20:12:30] (03PS2) 10Ori.livneh: Revert "hhvm: make cache files management explicit in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/225209 [20:13:48] (03PS1) 10Dzahn: use ensure_resource to avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/225213 [20:13:57] (03CR) 10Ori.livneh: [C: 032] Revert "hhvm: make cache files management explicit in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/225209 (owner: 10Ori.livneh) [20:15:44] (03PS2) 10Dzahn: use ensure_resource to avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/225213 [20:15:53] (03CR) 10Dzahn: [C: 032] use ensure_resource to avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/225213 (owner: 10Dzahn) [20:15:55] (03CR) 10Ori.livneh: "(Confirmed no-op)" [puppet] - 10https://gerrit.wikimedia.org/r/225209 (owner: 10Ori.livneh) [20:17:44] icinga-wm: speak up [20:17:49] RECOVERY - puppet last run on krypton is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:18:07] (03CR) 10Dzahn: "13:19 < icinga-wm> RECOVERY - puppet last run on krypton is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures" [puppet] - 10https://gerrit.wikimedia.org/r/225213 (owner: 10Dzahn) [20:18:47] (03PS6) 10Dzahn: racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 [20:19:16] (03CR) 10Dzahn: [C: 032] racktables: delete namevirtualhost file [puppet] - 10https://gerrit.wikimedia.org/r/224992 (owner: 10Dzahn) [20:20:45] (03CR) 10Ori.livneh: [C: 032] Add a script for rolling restart of HHVM servers [puppet] - 10https://gerrit.wikimedia.org/r/225211 (owner: 10Ori.livneh) [20:20:55] (03PS2) 10Ori.livneh: Add a script for rolling restart of HHVM servers [puppet] - 10https://gerrit.wikimedia.org/r/225211 [20:22:08] (03PS3) 10Dzahn: delete files/apache/conf.d/namevirtualhost [puppet] - 10https://gerrit.wikimedia.org/r/224993 [20:25:16] (03PS1) 10coren: Replicas: include a restricted watchlist view [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) [20:27:20] (03CR) 10Dzahn: [C: 032] "the last module to use this was racktables, and also removed it from there now" [puppet] - 10https://gerrit.wikimedia.org/r/224993 (owner: 10Dzahn) [20:30:30] 6operations, 10Wikimedia-IEG-grant-review, 5Patch-For-Review: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1458697 (10Dzahn) a:3Dzahn [20:32:13] (03PS1) 10Dzahn: iegreview: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225221 (https://phabricator.wikimedia.org/T105007) [20:34:09] 6operations: Update Elasticsearch to 1.6.1 - https://phabricator.wikimedia.org/T106090#1458719 (10MoritzMuehlenhoff) [20:36:07] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1458728 (10Manybubbles) p:5Triage>3Normal [20:36:31] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1458734 (10Manybubbles) I was thinking we could time this with the rolling restart the shuts down dynamic scripting. [20:38:03] (03PS2) 10KartikMistry: WIP: Do not use registry for Beta [puppet] - 10https://gerrit.wikimedia.org/r/224955 [20:42:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [20:43:23] (03PS1) 10Dzahn: misc-web varnish: switch iegreview to krypton [puppet] - 10https://gerrit.wikimedia.org/r/225226 (https://phabricator.wikimedia.org/T105007) [20:44:44] (03PS3) 10KartikMistry: Do not use registry for Beta [puppet] - 10https://gerrit.wikimedia.org/r/224955 [20:45:27] (03PS1) 10Dzahn: iegreview: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/225229 (https://phabricator.wikimedia.org/T105007) [20:48:30] (03PS4) 10Alexandros Kosiaris: cxserver: Allow for empty registry [puppet] - 10https://gerrit.wikimedia.org/r/224955 (owner: 10KartikMistry) [20:48:36] (03PS5) 10Alexandros Kosiaris: cxserver: Allow for empty registry [puppet] - 10https://gerrit.wikimedia.org/r/224955 (owner: 10KartikMistry) [20:49:30] (03PS6) 10Alexandros Kosiaris: cxserver: Allow for empty registry [puppet] - 10https://gerrit.wikimedia.org/r/224955 (owner: 10KartikMistry) [20:50:08] !log iegreview tool - short maintenance downtime [20:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:17] (03CR) 10Dzahn: [C: 032] iegreview: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/225229 (https://phabricator.wikimedia.org/T105007) (owner: 10Dzahn) [20:50:48] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Allow for empty registry [puppet] - 10https://gerrit.wikimedia.org/r/224955 (owner: 10KartikMistry) [20:51:25] (03PS2) 10Dzahn: misc-web varnish: switch iegreview to krypton [puppet] - 10https://gerrit.wikimedia.org/r/225226 (https://phabricator.wikimedia.org/T105007) [20:51:34] (03CR) 10Dzahn: [C: 032] misc-web varnish: switch iegreview to krypton [puppet] - 10https://gerrit.wikimedia.org/r/225226 (https://phabricator.wikimedia.org/T105007) (owner: 10Dzahn) [20:53:42] (03PS2) 10Dzahn: iegreview: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/225229 (https://phabricator.wikimedia.org/T105007) [20:54:01] mutante: why are there two VMs (bromine and krypton) instead of one? :) [20:54:08] (03PS2) 10Dzahn: iegreview: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225221 (https://phabricator.wikimedia.org/T105007) [20:55:35] (03PS3) 10Dzahn: iegreview: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225221 (https://phabricator.wikimedia.org/T105007) [20:56:34] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1458789 (10Yurik) Correction: for the "tilerator" (generation service), it will also need CREATE KEYSPACE and CREATE TABLE rights. [20:57:05] (03CR) 10Dzahn: [C: 032] iegreview: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225221 (https://phabricator.wikimedia.org/T105007) (owner: 10Dzahn) [20:57:34] SPF|Cloud: one is for PHP apps, one for static HTML sites [20:58:13] Yes but why don't you just put them on one server/VM like zirconium? [20:58:28] Redundancy, performance, ..? [20:59:37] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1458799 (10Gage) TLDR: the rough estimate is about 32Mbit/sec from jobrunners to elasticsearch nodes. Traffic is bursty so I advise planning for a 50-60Mbit ceiling. Details: Joe mentione... [21:00:16] service separation and a compromise between all-in-one and one instance per service [21:00:28] can we discuss it any time except in the moment i'm switching ? [21:00:56] Sure [21:01:06] (03PS4) 10Thcipriani: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 [21:02:14] PHP Fatal error: Call to undefined function Wikimedia\\IEGReview\\curl_init() in /srv/deployment/iegreview/iegreview/src/ParsoidClient.php [21:02:19] grrr [21:02:56] Right php... [21:03:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:03:35] more like "how did it work before unless it was manual stuff" [21:04:29] (03PS1) 10Zfilipin: Fixed Style/TrailingWhitespace RuboCop offense [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) [21:05:05] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [21:05:13] In these kind of situations I'm happy when I still have the piece of software on the old server [21:05:18] we dont even have php/conf.d anymore on jessie [21:05:36] Really? [21:07:38] SPF|Cloud: it's now split up in apache and cli. /etc/php5/apache2/conf.d [21:08:16] I don't work that often with Ubuntu so that's normal for me [21:08:22] 6operations: Update libzmq3/pyzmq - https://phabricator.wikimedia.org/T106093#1458818 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [21:08:55] php5-curl wasn't puppetized either way [21:09:53] manual installs tssk tssk tssk [21:12:42] (03PS1) 10Dzahn: iegreview: require package php5-curl [puppet] - 10https://gerrit.wikimedia.org/r/225241 (https://phabricator.wikimedia.org/T105007) [21:13:16] Looks good [21:13:32] (03PS2) 10Dzahn: iegreview: require package php5-curl [puppet] - 10https://gerrit.wikimedia.org/r/225241 (https://phabricator.wikimedia.org/T105007) [21:14:01] (03CR) 10Dzahn: [C: 032] iegreview: require package php5-curl [puppet] - 10https://gerrit.wikimedia.org/r/225241 (https://phabricator.wikimedia.org/T105007) (owner: 10Dzahn) [21:14:18] SPF|Cloud: just that i dont have an actual login for it ..hmm [21:14:24] looks [21:14:53] I'd personally install php5-curl by default when installing php5 (like in this case) though [21:15:17] i'll just install what it actually needs [21:15:37] I would find that annoying I guess, but np [21:15:54] can't you ask someone for a login? [21:15:59] it's also a test to see if these roles are properly puppetized [21:16:10] apply them on a fresh instance and see what fails [21:16:37] it's probably easier to make myself one [21:16:40] When migrating services in prod it's bad when stuff fails :) [21:16:58] yes, that's why we should never manually install packages [21:17:04] yep [21:17:30] And preferably (if possible at all, which might not be the case here) test on a labs instance first [21:23:46] hey folks [21:23:56] (03PS1) 10Ori.livneh: Enable ESI for testwiki [puppet] - 10https://gerrit.wikimedia.org/r/225243 [21:24:06] bblack, paravoid: ^ [21:24:07] if any ops is around, nutcracker might need to be kicked on mw1128 and mw1134 [21:24:30] they have a huge share of the log errors with more than 300k memcached errors for the last 15 minutes -:-( [21:24:33] hah [21:24:34] 14:14 operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - T88730 (ori) @joe Any update on this? [21:24:35] ori: hahahaha [21:24:42] ohhh [21:24:54] i'll do it [21:25:00] bblack: read the commit before you run away :P [21:25:07] commit message [21:25:29] ori: thanks !! you are so sweet [21:25:42] ori: any commit message that has to be that long is doomed! :P [21:26:12] anyway logstash point at them https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [21:26:17] !log bounced nutcracker on mw1128 and mw1134 [21:26:21] mw1139 seems to have the same issue :-( [21:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:08] (03PS5) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [21:27:11] !log bounced nutcracker on mw1139 as well. hashar noticed flood of errors from these hosts on https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors . lack of monitoring / alerts is troubling. [21:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:25] ori fix confirmed! [21:27:32] well at least i can't see any more errors in logstash [21:28:04] ori: even a 5-minute cache lifetime means it's only being re-fetched from PHP once per 5 minutes. we do a shitload of requests over 5 minutes. [21:28:31] if it's a question of many clients seeing a perf spike on the 5-minute boundary, we could look at varnish grace/saint -mode stuff to ensure we keep serving stale content while we reload the slow resource into cache. [21:28:40] ori, hashar: watching for those nutcracker deaths was the inspiration for https://phabricator.wikimedia.org/T100735 [21:29:13] ideally [21:29:18] logstash would whine about it here [21:29:38] tgr did some work using Sentry but it is not resourced / priority :( [21:29:41] logstash won't. it is a stream processor with no memory [21:29:42] bd808: are you taking it on? if you're too busy, it seems reasonable to me to assign that one to ops. [21:30:33] ori: I'm working towards it. We need to get logstash100[123] reimaged (which I think will start next week) and then upgrade to newest logstash [21:30:51] cool [21:30:55] assigning to you then :P [21:31:00] heh. ok [21:32:14] ori: actually I just looked, and we already do grace-mode. So we shouldn't have clients stacking up waiting on the refresh of the big RL URL every time it expires. [21:33:22] but in the overall, I'm just loathe to do any unecessary increases in our dependencies on subtle things about varnish, which ESI might fall under. [21:33:49] it really is the biggest rendering performance bottleneck we have [21:33:53] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1458885 (10Krinkle) >>! In T99096#1458063, @BBlack wrote: >>>! In T99096#1457782, @Krinkle wrote: >>... [21:34:06] ori: where does this show up from a client's POV? [21:34:30] it is cacheable, right? are we seeing any significant fraction of clients actually held up waiting on PHP to regenerate the cached object? [21:34:50] (if so, maybe we just need to tune varnish/VCL a little for this use-case, or there's a bug in that stuff in our VCL) [21:35:19] no, the VCL for this is working properly, cache hit-rate is near total [21:35:32] so what's the issue? [21:36:19] it blocks the page load on an external resource that has to be refetched via a network request every five minutes [21:37:18] if the URL was versioned, we could set far future-expires headers, and allow the native browser cache to hang on to it essentially forever [21:37:42] I think I'm getting lost in some ambiguities here [21:38:07] the blocker on page load is that once per 5 minutes, the client has to fetch the object from varnish's cache, right? [21:38:22] ok [21:38:36] yes [21:38:41] so, couldn't we version the URL even without ESI? [21:38:50] how? [21:39:00] I don't know what the reasons are we can't [21:39:14] because page HTML is cached for 30 days [21:39:29] oh, THAT [21:39:39] ok, it all makes sense now, mostly [21:39:53] and the HTML has to contain a reference to the startup module [21:39:54] also the reason we have to keep 5 old MW branches everwhere [21:40:12] bblack: We already do this for javascript resources since the url is dynamically constructed using the version from the startup module. But for stylesheets, it's raw in the html. [21:40:16] couldn't we split that anyways? [21:40:27] an alternative that paravoid and i just floated was to pass the version as a cookie and then construct the URL using javascript [21:40:38] yeah something like that [21:40:49] can we cache for less time ? [21:40:54] advantages: it's easy to do in varnish and doesn't involve anything radioactive [21:41:05] hashar: probably not "less" enough to solve this problem, no [21:41:25] disadvantages: speed hit -- the browser's lookahead parser won't see the URL until the javascript code executes, so it'll be requested later [21:41:40] and: no clean way for varnish to know what value it should set for that cookie [21:41:46] and it's ugly too :) [21:41:52] I almost hate myself for proposing it [21:42:06] What's up with grafana [21:42:09] is logstash down? [21:42:26] getting a 403 wikimedia error [21:42:36] even if we did ESI, ESI would be in varnish not in the browser. the browser would still need to load the big object that's referenced in the 30-day cached HTML. [21:42:47] maybe I don't completely get this yet, I don't know [21:43:02] I mean, I don't get how ESI makes the situation better [21:43:05] bblack: [21:43:09] That url is cached for 5 minutes [21:43:18] So that we can deploy changes quickly and have them rolll out [21:43:20] Krinkle: works for me. https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [21:43:28] bblack: cache busting via a different ESI-able URL, essentially [21:43:28] if that url contians ?v=123 [21:43:29] Kringle: right, so don't you have to version that URL regardless of ESI? [21:43:37] we can have the brower cache it locallhy in browser cache indefinitely [21:43:41] no, just set a short expiry [21:43:42] like we already do with all our other static resources [21:43:44] which we do [21:43:52] so there's no roundtrip, not even a 304 then [21:44:01] what Krinkle said [21:44:02] mutante: is iegreview now working? [21:44:13] bd808: https://grafana.wikimedia.org/#/dashboard/db/grafana [21:44:15] 503 here [21:44:41] SPF|Cloud: are you piping a single-point-of-failure into the Cloud? :) [21:44:43] grafana !== logstash [21:44:44] I guess what I mean is, you could still do ?v=123 without ESI, by combining this stuff per-version at the PHP or deployment level right? [21:44:53] why do we ened to combine the pieces in varnish? [21:44:54] ori: no :) [21:44:56] bd808: It uses elasticsearch backend of logstash to store dashbaords [21:45:06] 16:39 because page HTML is cached for 30 days [21:45:06] 16:39 oh, THAT [21:45:09] it's a Southparkfan in the Cloud [21:45:23] ok sorry, I keep thinking you're intending to use ESI on the RL stuff itself [21:45:33] you're intending to use ESI on the HTML page content [21:45:37] (03PS2) 10Alexandros Kosiaris: Introduce maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225204 [21:45:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225204 (owner: 10Alexandros Kosiaris) [21:45:45] bblack: don't feel bad, it took me a while to get it too and our conversation was more high bandwidth :) [21:45:45] yes, on the URL reference to the startup module that is contained in the page HTML content [21:46:02] (03PS5) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [21:46:22] basically, all load.php urls have versions in them already as they're generated client-side in javascript (basically load.php + moduename + startup.version) [21:46:32] but there are two exception: startup module, and stylesheet [21:46:41] (03PS2) 10Alexandros Kosiaris: Introduce maps-test200{1,2,3,4} [dns] - 10https://gerrit.wikimedia.org/r/225201 (https://phabricator.wikimedia.org/T105394) [21:46:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce maps-test200{1,2,3,4} [dns] - 10https://gerrit.wikimedia.org/r/225201 (https://phabricator.wikimedia.org/T105394) (owner: 10Alexandros Kosiaris) [21:46:59] (03CR) 10Yuvipanda: "Also add class docs for each class?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [21:47:01] Krinkle: I'm seeing a 503 for https://grafana.wikimedia.org/#/dashboard/db/grafana as well but for the initial app load from wherever grafana lives [21:47:48] Yeah [21:47:56] meaning it's not getting the js/html that will eventually want to talk to the logstash elasticsearch cluster [21:49:11] so I guess as opposed to not using ESI, all this means is we can keep the 30d TTL and still have the CSS link updated in 5 minutes without re-fetching every page from hhvm->varnish. [21:49:25] (03CR) 10Gilles: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/225243 (owner: 10Ori.livneh) [21:49:34] right, and the clients don't cache the HTML either. we tell them not to so that we can do local invalidation [21:50:03] bblack: It basically changes the situation from cachable html with a non-cachable css resource, so cachable html that includes a cachable url value (that we can purge separately). [21:50:03] so what ESI is saving us here, is varnish not having to set page HTML lifetimes down to 5 minutes and constantly spam-refresh from hhvm [21:50:05] and compose them [21:50:15] and when the browser sees that versioned css url, it will have it in its cache [21:50:20] hopefully not purge, just set a short lifetime on it [21:50:22] so there's no roundtrips other than the page view html [21:50:38] bblack: Yeah, standard 304 handling. We can make that include 5-min expiry [21:50:49] with 304 if unchanged to the varnish layer [21:50:59] I assume ESI supports those HTTP protocols [21:51:44] you mean cache-control set up so that it checks the backend every 5 minutes, and gets a 304 in the usual case unless we've just deployed a change to the CSS URL [21:51:45] essentially > [21:51:48] simplified [21:51:56] bblack: yeah [21:52:13] it all makes sense [21:52:18] :) [21:52:28] Trevor, Roan and myself thought this up in 2009 [21:52:31] the reason I'm reluctant to go down the ESI road all has to do with varnish-loathing really [21:52:36] There's an RT ticket for it assigned to mark [21:52:36] :P [21:52:51] well yeah but ESI was known-broken for a long time [21:52:54] https://phabricator.wikimedia.org/T78963 [21:52:56] Yeah [21:53:02] But we haven't forgotten :P [21:53:14] It's time to look re-evaluate our compromise [21:53:38] SPF|Cloud: yes, it does. mysql grant works (maybe it shouldnt have), hacked myself a new login user, logged in, looks good [21:53:40] since our current performance metrics identify this is a bottleneck, one of few. [21:53:42] I'd want to at least be sure that whatever we try to do with ESI, we know is portable to other options too. I don't know how standard ESI functionality is, and I'd hate to have onemore thing locking us into varnish-specifics. [21:53:46] Okay, nice [21:54:19] e.g. that whatever we're using would work with https://docs.trafficserver.apache.org/en/latest/reference/plugins/esi.en.html too [21:54:27] bblack: Yeah, for stock mediawiki installs without static caching, the solution is simple: the application server just embeds the right url in the first place. [21:55:07] This is similar to what big websites that serve most contnet from application servers do as well (e.g. github, twitter, facebook) from what I can see they largely cache assets but not the page views. To allow inclusion of those resources and user-specific stuff in the initial hit [21:55:25] trafficserver is the most likely thing we'll evaluate as a varnish replacement sometime in the next several months [21:55:26] then we'd implement this inclusion system in MediaWIki File cache as proof of concept. [21:55:31] that transition will be hard enough as it is heh [21:55:49] and in prod we'd use varnish esi instead of mediawiki filecache with dynamic replacements. [21:56:01] (03PS1) 10Alexandros Kosiaris: Assign roles to maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225248 [21:56:51] 6operations, 5Patch-For-Review, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1458955 (10Dzahn) [21:56:53] 6operations, 10Wikimedia-IEG-grant-review, 5Patch-For-Review: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1458953 (10Dzahn) 5Open>3Resolved done and moved. made myself a login user and reset the admin password on mysql level to get in and check. looks alright [21:59:32] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1458970 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/225111 [22:01:13] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1458981 (10dr0ptp4kt) @Tnegrin, @Chmarkine: and now, the data. Queries spanning multiple days seem to sometimes flake out when not restri... [22:03:22] (03PS1) 10Dzahn: grafana: add ferm service for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/225249 [22:04:36] ori / Krinkle: can we talk about this again in a week or so? [22:04:38] (03PS2) 10Dzahn: grafana: add ferm service for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/225249 [22:04:44] you can test on beta or a local VM before that regardless [22:05:02] but with everyone at wikimania and traveling after, and I'm on vacation starting in a few hours, etc... [22:05:13] (03PS3) 10Dzahn: grafana: add ferm service for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/225249 [22:05:20] (03CR) 10Dzahn: [C: 032] grafana: add ferm service for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/225249 (owner: 10Dzahn) [22:05:27] I have no idea without looking whether having that do_esi even inside a conditional changes something for the main text varnish's behavior/compiled-VCL tbh. [22:05:56] it's just not a good time for prod experimentation, and testwiki goes through prod [22:10:10] (03CR) 10BBlack: [C: 04-1] "We talked about this on irc. It's possible we can move on this, but let's hold at least a week before even a testwiki enablement on the p" [puppet] - 10https://gerrit.wikimedia.org/r/225243 (owner: 10Ori.livneh) [22:10:13] Krinkle: an alternative to ESI would be to compose pages using JS, in a service worker or an edge fall-back server [22:10:39] gives you more flexibility for generating content based on cookies, for example [22:11:18] gwicke: we talked about some ways to compose in JS, but that still adds some latency to when the browser can actually start fetching. [22:11:57] depends.. with service-workers the client already has the JS [22:12:03] it's like a proxy running in the browser [22:12:11] this really is a situation where ESI makes the most sense of the available solutions. I'm just scared of varnish+ESI on two different levels :) [22:12:26] (03PS6) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [22:12:41] on first load of an authenticated request (or cache miss), it would happen at the edge instead [22:12:50] (further varnish lock-in, and the risks of it still being unstable in practice and/or performance-intensive) [22:14:26] bblack: you were probably thinking about single page app style composition [22:14:46] well this is for CSS primarily regardless [22:14:48] where the browser loads a barebones page that references some JS that in turn fills out the page [22:15:22] gwicke: we were talking about e.g. setting a cookie value that static JS in the page uses to inject a dynamic CSS URL into the page. [22:16:02] X-RL-CSSVer: 1.26mf5 or whatever [22:17:08] that would be high-volume if implemented on top of Varnish cache hits [22:18:19] no, the idea is the cookie/header value is generated in varnish (we already do that sort of thing), and the client fetches a versioned CSS URL with a long cacheability lifetime for the client. [22:18:31] deployments would involve VCL updates. it's ugly [22:18:46] but anyways, ESI is a better idea than that. [22:19:13] matt's benchmarks indicated about 50% lower throughput with five includes [22:19:17] (03CR) 10Ori.livneh: "Bblack, ack -- thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/225243 (owner: 10Ori.livneh) [22:19:56] (sorry to have missed the tail end of the conversation, but it looked like Timo was on it and I had to go meet with Roan to flesh out our presentation for tomorrow) [22:20:02] yeah [22:20:07] we'd definitely be very careful to keep the number of fragments low [22:20:15] *need to be* [22:20:33] ori: I'm convinced that ESI, used carefully and sparing, is probably the Right Thing here, I'm just worried about the other varnishy concerns here about perf/stability/lock-in :) [22:21:04] let's make sure any solution we use for this at least is ESI-compatible with Apache Traffic Server's ESI support too, theoretically [22:21:20] (which is here: https://docs.trafficserver.apache.org/en/latest/reference/plugins/esi.en.html ) [22:21:21] yes, understood. it ups the total headache quotient (THQ?) for varnish, so it's not something to undertake lightly. [22:22:31] I still have high hopes that sometime during 2015 I can get back on the train of refactoring/merging all our varnish clusters and VCL down into much simpler code and only two clusters: upload, and everything-else (former text+mobile+bits). [22:23:04] and that once that work is complete and the VCL/varnish-clusters world is simpler, we can do evaluations on porting our functionality to ATS vs upgrading to varnish4. both have significant risks and such. [22:25:39] nginx supports SSI [22:26:58] (03CR) 10Dzahn: [C: 032] "yes, i moved that into the role from tin. the goal was just to enable it on mira (new deployment server in codfw) and i argued it should b" [puppet] - 10https://gerrit.wikimedia.org/r/225025 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [22:27:51] gwicke: yeah nginx isn't really a very viable option for a varnish replacement though at this time. maybe the commercial variant is getting close, but that's relatively-new and, well, commercial :) [22:27:59] (03PS2) 10Dzahn: Do not backup beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225023 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [22:28:21] we do a ton in the varnish frontends that's deeply complex and currently-necessary, esp wrt analytics. [22:28:52] definitely not a replacement; but, perhaps another content massaging option [22:29:06] (03PS1) 10Yuvipanda: puppetception: Run puppetception in a cron [puppet] - 10https://gerrit.wikimedia.org/r/225255 [22:29:08] (03PS1) 10Yuvipanda: ores: Add a separate worker role [puppet] - 10https://gerrit.wikimedia.org/r/225256 [22:29:10] (03CR) 10Dzahn: [C: 032] Do not backup beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225023 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [22:29:22] the thing I dislike with both ESI and SSI is that it's pretty hard to work with for front-end developers [22:29:27] (03PS2) 10Dzahn: Do not use releases::upload on beta cluster deployment server [puppet] - 10https://gerrit.wikimedia.org/r/225025 (https://phabricator.wikimedia.org/T106003) (owner: 10Hashar) [22:29:43] and things that are hard to test tend to see less testing [22:29:45] gwicke: yeah. I considered actually using the open-source version's caching as a very small/tiny/light cache with a very short lifetime (e.g. 1-minute) since it's in front of varnish anyways. It sort of makes sense even now. [22:30:11] but then we'd lose analytics on the hits that nginx absorbed unless we start wading into the mess of all analytics + related transform/mangle up in the nginx layer :/ [22:30:29] (03PS2) 10Yuvipanda: puppetception: Run puppetception in a cron [puppet] - 10https://gerrit.wikimedia.org/r/225255 [22:30:45] (03PS3) 10Yuvipanda: puppetception: Run puppetception in a cron [puppet] - 10https://gerrit.wikimedia.org/r/225255 [22:31:38] bblack: yeah, analytics are tricky [22:32:15] a lot of the analytics stuff isn't just logging requests, either. it's doing a bunch of mangle/transform and spitting new headers back out on the response side [22:32:27] the worst is the geoip cookie case heh [22:32:27] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetception: Run puppetception in a cron [puppet] - 10https://gerrit.wikimedia.org/r/225255 (owner: 10Yuvipanda) [22:32:40] (03PS2) 10Yuvipanda: ores: Add a separate worker role [puppet] - 10https://gerrit.wikimedia.org/r/225256 [22:32:48] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Add a separate worker role [puppet] - 10https://gerrit.wikimedia.org/r/225256 (owner: 10Yuvipanda) [22:34:27] (03PS7) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [22:36:09] bblack: for APIs backed by storage just nginx caching could work great [22:38:10] from a latency perspective [22:38:18] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1459081 (10BBlack) Phabricator pro-tip: You can put long lists in quoted code blocks with a `lines=NN` at the top and they'll scroll. For... [22:39:32] gwicke: I have some plans around eliminating some of the pointless hops for uncacheable requests anyways. ipsec is still an annoying blocker, but we need to deal with that problem regardless. [22:39:56] still, there's no fundamental reason once we've decided in a varnish frontend daemon that a request looks uncacheable that we can't pass it directly to appservers from there. [22:40:37] yeah [22:40:45] we can *at least* skip the backend layer in the remote tier and make it always be 1-2 hops instead of 2-3. [22:41:00] (and skip the backend layer completely when it's a t1 datacenter) [22:41:20] until the ipsec problem is solved, though, we'd probably still hop from e.g. varnish-fe@esams -> varnish-be@eqiad just for ipsec [22:41:59] IIRC the mean latency those extra hops add is around 15ms [22:42:10] the bigger issue is p99 [22:42:15] yeah [22:42:26] plus it's just pointless/wasted processing at those layers [22:42:38] and makes statistics on cache behavior harder to understand, too [22:42:50] *nod* [22:44:38] (03PS8) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [22:45:27] (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [22:46:51] some digging I was doing the other day on text-cache 2layer behaviors: https://phabricator.wikimedia.org/P969 [22:47:48] (all of which supports the idea that (a) we could be smarter about skipping the be-layer in many cases, and (b) our be-layer lacks some pass/hitpass logic it should have and treats them as misses, and (c) the be-layer may or may not be worth the trouble in true cache-hit-rate terms for the text case anyways.) [22:51:36] a lot of the backend misses are probably driven by API requests and authenticated browsing [22:52:58] both of which are fixable [22:54:55] yeah what I'd like to do on that whole front is first refactor and split our FE and BE VCL code better. BE should be dead-simple compared to FE, and only needs special exceptions for passes (or not, if we skip BE on passes...). [22:55:31] once we know that passes skip BE, or at least that they're aligned on pass/hitpass behavior, it should be a lot more glaringly obvious exactly how much true hit (as in, appserver load reduction) we're getting from having a BE layer at all. [22:56:11] yeah [22:56:39] (03PS12) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [22:57:05] at the request rates the backend layer sees we have a good amount of flexibility in what tech we are using there [22:57:11] right now, it still naively looks like the BE layer may be cutting cacheable hits to the appserver in half. but that might not really be true, and doesn't jive well with the lack of FE hitrate increase when experimenting with 2x and 4x sizes there. [22:57:27] (03CR) 10jenkins-bot: [V: 04-1] monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [22:57:52] (03PS1) 10Yuvipanda: postgresql: Auto determine pgversion for user creation [puppet] - 10https://gerrit.wikimedia.org/r/225262 [22:57:52] unless there really is a very long tail of objects which see decent request rates, don't change much over 30days, but are too long-tail to fit well in the FE cache at all. [22:57:53] akosiaris: ^ [23:00:05] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150716T2300). Please do the needful. [23:00:26] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql: Auto determine pgversion for user creation [puppet] - 10https://gerrit.wikimedia.org/r/225262 (owner: 10Yuvipanda) [23:00:28] (although the dataset size in question makes me think that if that's even the case, it's probably solvable by other means: e.g. we've got cases where we're redundantly caching stuff over and over due to mishaps with non-canonical access or query args or whatever) [23:00:28] (03PS2) 10Yuvipanda: postgresql: Auto determine pgversion for user creation [puppet] - 10https://gerrit.wikimedia.org/r/225262 [23:00:34] (03CR) 10Yuvipanda: [C: 032 V: 032] postgresql: Auto determine pgversion for user creation [puppet] - 10https://gerrit.wikimedia.org/r/225262 (owner: 10Yuvipanda) [23:00:50] nothing on the swat list... [23:01:37] the thing I dislike with both ESI and SSI is that it's pretty hard to work with for front-end developers <-- https://www.varnish-software.com/static/book/Content_Composition.html#testing-esi-without-varnish [23:01:38] bblack: yeah, 13T is definitely larger than all current content in gzip blobs [23:01:58] gwicke: how big is that dataset size, relative to e.g. ~24G FE caches? [23:02:59] bblack: in Cassandra all current revisions compress to about 1T [23:03:08] ok [23:03:23] that's with deflate, and includes data-parsoid metadata [23:03:24] well that's still bigger than we'd ever put in FE ram cache the way we do it today [23:03:56] just the HTML is about 600G [23:04:16] ori: yeah, also https://github.com/MrSwitch/esi [23:04:21] we've talked about making the FE ram cache much larger by chashing from nginx->varnish, too, but there's concerns that would (a) massively increase cache<->cache machine network traffic and (b) not deal well with our rare very-hot-URL cases, like celebrity death traffic spikes. [23:04:47] ori: I am wondering though if ESI is the right approach, when we can use JS directly [23:05:43] (the idea there is that with requests round-robin to varnish-fe, our effective FE ram cache is the average of all nodes' cache sizes. chashing nginx->varnish would make it the sum, and then we really could get it up into the low-terabytes range) [23:05:51] also, simulating the subtleties of how fragments are updated etc isn't so simple [23:08:20] bblack: could be interesting if combined with a smallish in-memory cache in nginx [23:08:29] for the really hot items [23:09:45] yeah [23:09:53] but again, analytics :P [23:10:02] *nod* [23:10:50] it would be basically moving from our current 2-layer model of (nginx+v-fe)->v-be to a 2-layer model of nginx->v-fe [23:11:18] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459166 (10mobrovac) FTR, a local request for the same resource returns: ``` $ curl -v 'localhost:1970/api?format=mediawiki&search=PMC9999999' * Hostname was NOT found in DNS cache * Tryi... [23:11:49] the winning thing that keeps the 2layer model alive is that round-robin makes a lot of sense for absorbing URL-specific load/spikes on cacheable content, and chashing makes a lot of sense for getting large total cache dataset sizes. putting one in front of the other gets you the best of both. [23:13:24] at the second layer request rates we could use a lot of other tech too, though [23:13:34] yeah [23:14:02] one of the downsides, though, is each cache layer uses relatively information-poor algorithms like LRU, and they can't see the impact that they have on each other to make better decisions. [23:14:26] like storage or page composition servers running JS code [23:15:22] ATS is at least a little better in that they better ways to tier and cluster ATS servers together, ICP support, better-than-LRU algorithms, and internally-tiered caching for using ram+SSD effectively together in one cache [23:16:20] better than LRU being: https://docs.trafficserver.apache.org/en/latest/arch/cache/ram-cache.en.html [23:18:26] regardless of ATS, even with varnish today, you could make the argument that we could switch our 2-layer model to stay within one machine and not use chashing [23:18:56] the boxes all have ~720GB of SSD, which is ~ the size of the HTML discussed earlier. [23:19:39] there's a lot of dilution with chrome around it, vary etc [23:20:03] yeah, but still, we'd likely get a lot of it [23:20:38] the question is how much impact does it have on total hitrate going from a model where our cache storage tiers are 24G then 13T, vs 24G then 720G [23:21:21] we really have no idea how much of that 13T is just wasted/useless from a hitrate perspective. [23:22:46] I think it's pretty clear that increasing cache size further won't help much [23:25:58] well for the frontends sure, but I really (under the current model) can't take the frontends much beyond ~128G or so anyways in most cases. [23:26:30] there's a lot of question-mark in my mind still about how much further reduction we do see from the 13T 2nd-layer. I can't upsize the FE cache far enough to simulate that heh. [23:27:21] (03PS13) 10Alexandros Kosiaris: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [23:27:23] we could experiment on the idea of keeping the 2nd layer in one box without a ton of change though (the 720G total effective 2nd layer idea) [23:27:58] by just switching from chash to roundrobin at one of the datacenters and seeing if outbound requests to the applayer stay notably higher (after the initial spike to change the contents of all the BE caches). [23:28:16] we know the spike would be within reasonable limits (~2x reqrate), so it's not going to completely melt appservers either. [23:28:29] yeah [23:28:41] (03PS14) 10Alexandros Kosiaris: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [23:28:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [23:29:53] doing it at esams would have the bonus that we'd still have the chashed eqiad backends protecting the applayer, too. We'd have to observe the effect in terms of reqrate from esams-be -> eqiad-be. [23:30:15] the idea that I'm currently intrigued about for page composition is https://phabricator.wikimedia.org/T106099 [23:31:27] yeah I donno, that's why outside the scope of where I'm focused on thinking at the moment :) [23:32:16] if the server-side JS was still behind our outer cache infrastructure it might work. I'd be concerned about throwing nodejs at our raw traffic rates though. [23:32:30] definitely second or even third layer [23:33:23] s/why/way/ 2 lines up! :) [23:33:46] anyways, I should go, I have things to sort out before I leave town tomorrow. I'll check back in here later tonight and tomorrow AM though. [23:33:52] cya later :) [23:33:58] kk, thanks for the chat! [23:42:13] (03PS11) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [23:45:30] (03PS3) 10Alexandros Kosiaris: firewall: add ferm rule for kafka [puppet] - 10https://gerrit.wikimedia.org/r/223534 (owner: 10Matanya) [23:45:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] firewall: add ferm rule for kafka [puppet] - 10https://gerrit.wikimedia.org/r/223534 (owner: 10Matanya) [23:49:57] 6operations, 10Graphoid, 6Services, 5Patch-For-Review: Confine Graphoid with firejail - https://phabricator.wikimedia.org/T103095#1459208 (10Yurik)