[00:04:29] (03PS1) 10BBlack: fix minor regex bug in HTTPS redirect conditional [puppet] - 10https://gerrit.wikimedia.org/r/223206 [00:04:31] (03PS1) 10BBlack: HSTS: preload for all certs except wiki[pm]edia.org [puppet] - 10https://gerrit.wikimedia.org/r/223207 [00:04:55] (03CR) 10BBlack: [C: 032 V: 032] fix minor regex bug in HTTPS redirect conditional [puppet] - 10https://gerrit.wikimedia.org/r/223206 (owner: 10BBlack) [00:05:35] Krenair: has the table been created? [00:05:41] no [00:05:47] but we can do that [00:06:33] (03PS2) 10BBlack: HSTS: preload for all certs except wiki[pm]edia.org [puppet] - 10https://gerrit.wikimedia.org/r/223207 (https://phabricator.wikimedia.org/T104244) [00:06:56] Krenair: should be good then [00:07:41] mwscript extensions/WikimediMaintenance/createExtensionTables.php --wiki=eswiki WikiLove [00:07:51] (03CR) 10BBlack: [C: 032 V: 032] HSTS: preload for all certs except wiki[pm]edia.org [puppet] - 10https://gerrit.wikimedia.org/r/223207 (https://phabricator.wikimedia.org/T104244) (owner: 10BBlack) [00:09:13] legoktm: yeah, attachment status is cached, did not think that one through [00:09:25] can you review https://gerrit.wikimedia.org/r/223209 ? [00:09:44] mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=eswiki WikiLove [00:10:51] tgr: lgtm, should we be checking the return value of CentralAuthHooks::attemptAddUser()? [00:12:42] legoktm: I can add that if you prefer [00:12:53] I think that would be good yeah [00:13:04] wanted to keep as simple as possible because vagrant master is borked and I can't test anything locally :( [00:14:17] ok, can be done in a follow up then [00:14:21] are you going to deploy it now? [00:14:44] updated [00:15:01] yes, I'll backport if you merge it [00:15:17] +2'd again [00:18:03] everything is on 1.26wmf12 now, right? [00:18:10] still not used to the new cadence [00:20:13] yes [00:21:58] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1432590 (10BBlack) [00:26:54] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1432595 (10GWicke) I did see 1005 OOM twice in quick succession earlier. It has been running fine since. [00:31:03] are we autocreating submodule updates now? [00:33:23] tgr, gerrit is, yep [00:33:37] cool [00:33:48] just have to do the backport, and then you can go straight onto tin and pull it [00:34:14] probably should be announced though, I have been scratching my head for 20 mins why I cannot create it manually [00:35:02] I did update the docs [00:35:07] eventually [00:35:08] Krenair: if you are still deploying, can you pull & sync extensions/CentralAuth/includes/CreateLocalAccountJob.php ? [00:35:42] ok [00:35:44] worth an email announcement, I mean [00:36:37] at least for me it took a while to realize what's going on after being unable to create a commit [00:36:56] might just be me being clumsy with git [00:37:55] !log krenair Synchronized php-1.26wmf12/extensions/CentralAuth/includes/CreateLocalAccountJob.php: https://gerrit.wikimedia.org/r/#/c/223211/ (duration: 00m 13s) [00:38:00] Logged the message, Master [00:38:15] tgr, ^ [00:38:44] besides operations/mediawiki-config, where else would I look for any FundraisingTranslateWorkflow globals defined on meta? [00:39:17] They should be there [00:39:22] Or you can use eval.php and var_dump() [00:41:00] thanks, Krenair! [00:41:07] verified, working [00:41:15] !log upgrade db1041 trusty [00:41:20] Logged the message, Master [01:03:08] PROBLEM - puppet last run on wtp1014 is CRITICAL Puppet has 1 failures [01:11:17] (03PS1) 10Springle: upgrade db1041 to trusty + mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/223216 [01:12:06] !log Re-pooled mw1152 at 20:46 UTC, did not log it then. [01:12:11] Logged the message, Master [01:16:11] (03PS4) 10BBlack: Move dhparam support from tlsproxy to sslcert/ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/222839 [01:19:32] (03CR) 10BBlack: [C: 032] Move dhparam support from tlsproxy to sslcert/ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/222839 (owner: 10BBlack) [01:19:48] RECOVERY - puppet last run on wtp1014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:34:47] (03PS2) 10Springle: upgrade db1041 to trusty + mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/223216 [01:35:42] (03CR) 10Springle: [C: 032] upgrade db1041 to trusty + mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/223216 (owner: 10Springle) [01:41:00] (03PS1) 10Dzahn: annualreport: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223220 [01:42:23] (03PS2) 10Dzahn: annualreport: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223220 (https://phabricator.wikimedia.org/T104936) [01:42:31] (03PS3) 10Dzahn: annualreport: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223220 (https://phabricator.wikimedia.org/T104936) [01:42:41] (03PS1) 10Dzahn: annualreport: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223221 (https://phabricator.wikimedia.org/T104936) [01:46:26] 6operations, 10Traffic, 5Patch-For-Review: move service annualreport from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1432753 (10Krenair) [01:46:31] (03PS1) 10Dzahn: misc-web varnish: switch annualreport to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223222 (https://phabricator.wikimedia.org/T104936) [01:47:27] mutante, why are those commits separate? [01:48:23] Krenair: because i need to merge them separately [01:48:55] just one reason: apache config changes from 2.2 to 2.4, so if i want to update it i want to first remove the role from old host, change it, then switch [01:49:04] or i break old before the switch [01:50:08] small and separate almost always turned out better [01:57:11] (03PS1) 10Dzahn: annualreport: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223223 (https://phabricator.wikimedia.org/T104936) [01:57:56] PROBLEM - puppet last run on cp4007 is CRITICAL Puppet has 1 failures [02:01:46] RECOVERY - puppet last run on cp4007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:25] (03PS1) 10Dzahn: transparency report: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223226 [02:03:24] (03PS2) 10Dzahn: transparency report: update Apache config for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/223226 (https://phabricator.wikimedia.org/T104937) [02:04:47] 6operations, 10Traffic, 5Patch-For-Review: move transparency report from zirconium to bromine - https://phabricator.wikimedia.org/T104937#1432794 (10Krenair) [02:06:18] (03PS1) 10Dzahn: misc-web varnish: switch transparency to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223227 (https://phabricator.wikimedia.org/T104937) [02:07:41] 6operations, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1432799 (10Dzahn) [02:09:13] (03PS1) 10Dzahn: transparency: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/223228 (https://phabricator.wikimedia.org/T104937) [02:10:40] (03PS1) 10Dzahn: transparency: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) [02:11:35] (03PS1) 10Dzahn: Revert "Revert "logstash: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/223230 [02:11:41] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "logstash: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/223230 (owner: 10Dzahn) [02:13:54] (03PS2) 10Dzahn: Revert "Revert "logstash: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/223230 [02:14:22] (03PS3) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 [02:16:06] (03PS4) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 [02:16:48] (03PS5) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 [02:19:27] (03PS6) 10Dzahn: logstash: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223230 (https://phabricator.wikimedia.org/T93776) [02:20:02] (03PS1) 10Dzahn: ganglia: add aggregator for ulsfo on bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/223231 (https://phabricator.wikimedia.org/T93776) [02:21:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1432830 (10Dzahn) [02:22:38] (03CR) 10Dzahn: "@_joe: how about now for the "80% firewall" goal?" [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [02:24:36] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 06m 09s) [02:24:42] Logged the message, Master [02:24:55] (03CR) 10Dzahn: "do any of these need any ports open besides http?" [puppet] - 10https://gerrit.wikimedia.org/r/194802 (owner: 10Dzahn) [02:25:30] (03CR) 10Dzahn: "first do ms1001, right @apergos?" [puppet] - 10https://gerrit.wikimedia.org/r/205903 (owner: 10Dzahn) [02:25:49] (03CR) 10Dzahn: "@apergos you said it's not in use, afair?" [puppet] - 10https://gerrit.wikimedia.org/r/205904 (owner: 10Dzahn) [02:27:55] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-07 02:27:55+00:00 [02:28:00] Logged the message, Master [02:28:35] (03PS1) 10Dzahn: put base::firewall on neptunium (LDAP) [puppet] - 10https://gerrit.wikimedia.org/r/223232 [02:28:48] (03PS1) 10BBlack: ciperhsuites: refactor DHE support [puppet] - 10https://gerrit.wikimedia.org/r/223233 [02:29:28] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100637s 100000s) [02:29:56] (03PS1) 10Dzahn: Revert "contint: Don't include base firewall by default" [puppet] - 10https://gerrit.wikimedia.org/r/223234 [02:33:04] 6operations: track / reach the firewall goal - https://phabricator.wikimedia.org/T104939#1432838 (10Dzahn) 3NEW a:3Dzahn [02:35:25] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1432847 (10Dzahn) [02:35:27] 6operations, 5Interdatacenter-IPsec: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#1432849 (10Dzahn) [02:35:28] 6operations: track / reach the firewall goal - https://phabricator.wikimedia.org/T104939#1432846 (10Dzahn) [02:35:59] (03PS2) 10Dzahn: Revert "contint: Don't include base firewall by default" [puppet] - 10https://gerrit.wikimedia.org/r/223234 (https://phabricator.wikimedia.org/T104939) [02:36:14] Commit Message [02:36:32] (03PS3) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) [02:36:46] (03PS4) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) [02:37:25] (03PS2) 10Dzahn: put base::firewall on neptunium (LDAP) [puppet] - 10https://gerrit.wikimedia.org/r/223232 (https://phabricator.wikimedia.org/T104939) [02:37:36] (03PS4) 10Dzahn: add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) [02:37:43] (03CR) 10jenkins-bot: [V: 04-1] add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [02:37:52] (03PS2) 10Dzahn: put base::firewall on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/194802 (https://phabricator.wikimedia.org/T104939) [02:44:25] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1432887 (10BBlack) [02:44:27] 6operations, 10Traffic, 5Patch-For-Review: Sort out DHE for Forward Secrecy w/ older clients - https://phabricator.wikimedia.org/T104281#1432885 (10BBlack) 5Open>3Resolved a:3BBlack [02:48:34] 6operations: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1432891 (10Aklapper) Was about to update https://wikitech.wikimedia.org/wiki/Zirconium but https://wikitech.wikimedia.org/wiki/Ganeti does not look like a page to add static-bz now? [03:02:21] 6operations: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1432898 (10Krenair) I think we'd want a page for Bromine instead, with a note that it runs in Ganeti and hosts static-bugzilla. [03:19:11] (03CR) 10BBlack: [C: 032] "Validated in compiler: no-op on the apaches, desired cipher re-ordering on the nginx's" [puppet] - 10https://gerrit.wikimedia.org/r/223233 (owner: 10BBlack) [03:34:14] (03PS1) 1020after4: Increment deployment stats after sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/223236 (https://phabricator.wikimedia.org/T104635) [03:34:37] (03CR) 10jenkins-bot: [V: 04-1] Increment deployment stats after sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/223236 (https://phabricator.wikimedia.org/T104635) (owner: 1020after4) [03:35:34] (03PS1) 10BBlack: switch tlsproxy back to "compat"; dhe is now implicit [puppet] - 10https://gerrit.wikimedia.org/r/223237 [03:35:36] (03PS1) 10BBlack: remove deprecated "compat-dhe" (no users) [puppet] - 10https://gerrit.wikimedia.org/r/223238 [03:41:45] bblack: nice [03:47:58] :) [03:48:21] we have a little CPU to burn, should up the AEAD% a smidge :) [03:51:03] (03CR) 10BBlack: [C: 032] switch tlsproxy back to "compat"; dhe is now implicit [puppet] - 10https://gerrit.wikimedia.org/r/223237 (owner: 10BBlack) [03:54:36] PROBLEM - Restbase root url on restbase1005 is CRITICAL - Socket timeout after 10 seconds [03:55:47] (my prediction so far is ~52% -> ~59%, for this time-of-day) [03:56:50] (03CR) 10BBlack: [C: 032] remove deprecated "compat-dhe" (no users) [puppet] - 10https://gerrit.wikimedia.org/r/223238 (owner: 10BBlack) [03:57:41] (03PS9) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [04:04:07] (03PS10) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [04:04:09] (03PS1) 10BBlack: Remove unused r::c::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/223239 [04:04:24] (03CR) 10BBlack: [C: 032 V: 032] Remove unused r::c::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/223239 (owner: 10BBlack) [04:08:09] (03CR) 10BBlack: [C: 032] "Compiler-validated, works!" [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) (owner: 10BBlack) [04:12:24] nice, puppet agent found something the compiler doesn't check for! :) [04:12:27] (Cron[update-ocsp-all] => Class[Tlsproxy::Ocsp_updater] => Tlsproxy::Ocsp_stapler[unified] => Exec[unified-create-ocsp] => Service[nginx] => Cron[update-ocsp-all]) [04:15:49] (03PS1) 10BBlack: fix for ce421097: kill dep loop [puppet] - 10https://gerrit.wikimedia.org/r/223241 [04:16:32] (03CR) 10BBlack: [C: 032] fix for ce421097: kill dep loop [puppet] - 10https://gerrit.wikimedia.org/r/223241 (owner: 10BBlack) [04:18:33] yeah I guess there's a lot of things the compiler can't check once you get into script executions :/ [04:20:05] puppet needs a map() [04:24:16] (03PS1) 10BBlack: fix for ce421097: .crt suffix for $ocsp_args [puppet] - 10https://gerrit.wikimedia.org/r/223242 [04:24:57] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (6876 100000s) [04:24:59] (03CR) 10BBlack: [C: 032] fix for ce421097: .crt suffix for $ocsp_args [puppet] - 10https://gerrit.wikimedia.org/r/223242 (owner: 10BBlack) [04:26:17] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet has 1 failures [04:26:28] bd808: "cxserver::logstash_host": deployment-logstash1.deployment-prep.eqiad.wmflabs - need to change to -logstash2? [04:27:28] heh, I guess salt missed a host for agent disable :P [04:27:34] go salt! [04:30:49] (03PS1) 10BBlack: fix for ce421097: quote echo in update cron [puppet] - 10https://gerrit.wikimedia.org/r/223243 [04:31:09] (03CR) 10BBlack: [C: 032] fix for ce421097: quote echo in update cron [puppet] - 10https://gerrit.wikimedia.org/r/223243 (owner: 10BBlack) [04:31:21] (03CR) 10BBlack: [V: 032] fix for ce421097: quote echo in update cron [puppet] - 10https://gerrit.wikimedia.org/r/223243 (owner: 10BBlack) [04:43:06] RECOVERY - puppet last run on cp2024 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:47:34] <_joe_> bblack: circulare dependencies like that don't get evaluated at compile-time [04:49:28] 3 followup fixes -> I should stop for the day :) [04:56:56] 6operations: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1432976 (10Dzahn) >>! In T101734#1432898, @Krenair wrote: > I think we'd want a page for Bromine instead, with a note that it runs in Ganeti and hosts static-bugzilla. https://wikitech.wikimedia.org/wiki/Bromi... [05:01:28] (03CR) 10Ori.livneh: [C: 04-1] "Purge frequency and purge rate already default to 4096 and -1, respectively, and we set ExpireOnSets in hhvm.pp. See I87047dac97cb3bed0dff" [puppet] - 10https://gerrit.wikimedia.org/r/223000 (https://phabricator.wikimedia.org/T104769) (owner: 10Giuseppe Lavagetto) [05:22:53] (03PS1) 10Dzahn: tmh: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223244 (https://phabricator.wikimedia.org/T104939) [05:23:06] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1432981 (10BBlack) [05:27:11] (03PS1) 10Chmarkine: Remove www.email.donate.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/223245 (https://phabricator.wikimedia.org/T102827) [05:31:12] (03PS1) 10Dzahn: protactinium: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223246 [05:34:49] 6operations, 10Traffic, 7HTTPS, 7Mobile: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1432996 (10BBlack) 3NEW [05:35:07] 6operations, 10Traffic, 7HTTPS, 7Mobile: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1433004 (10BBlack) [05:35:09] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1433003 (10BBlack) [05:35:32] 6operations, 10Traffic, 7HTTPS, 7Mobile: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1432996 (10BBlack) [05:35:34] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1433005 (10BBlack) [05:44:17] (03PS4) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T104943) [05:44:38] (03PS5) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T37540) [05:45:56] 6operations, 5Patch-For-Review: track / reach the firewall goal - https://phabricator.wikimedia.org/T104939#1433026 (10Dzahn) [05:51:40] !log krinkle Synchronized php-1.26wmf12/extensions/WikiEditor/modules/jquery.wikiEditor.toolbar.js: I3e965dda1c4 (duration: 00m 12s) [05:51:45] Logged the message, Master [05:54:51] 6operations, 5Patch-For-Review: track / reach the firewall goal - https://phabricator.wikimedia.org/T104939#1433044 (10Dzahn) [05:56:14] (03PS6) 10Krinkle: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [05:59:23] did anything change in SSL/TLS setup ? [06:00:04] someone is reporting on WP:VP/T that his bot is suddenly getting DH key exchange errors. [06:00:19] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#API_calls_just_starting_throwing_SSL.2FHTTPS_.28.3F.29_errors [06:04:35] 6operations: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433059 (10Dzahn) 3NEW [06:06:54] 6operations, 10Traffic, 5Patch-For-Review: move transparency report from zirconium to bromine - https://phabricator.wikimedia.org/T104937#1433067 (10Dzahn) [06:06:55] 6operations, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1433068 (10Dzahn) [06:06:57] 6operations: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1433069 (10Dzahn) [06:06:59] 6operations: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433066 (10Dzahn) [06:07:01] 6operations: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1433070 (10Dzahn) [06:08:38] PROBLEM - puppet last run on install2001 is CRITICAL Puppet last ran 4 hours ago [06:09:49] 6operations: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433073 (10Dzahn) include role::wikimania_scholarships include role::bugzilla_static include role::transparency include role::grafana include role::iegreview include role::annu... [06:11:00] 6operations: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433077 (10Dzahn) p:5Triage>3Normal [06:11:46] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [06:11:46] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 7 06:11:46 UTC 2015 (duration 11m 45s) [06:11:51] Logged the message, Master [06:16:15] 6operations, 7Icinga: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1433089 (10Dzahn) 3NEW [06:16:34] 6operations, 7Icinga: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1433096 (10Dzahn) p:5Triage>3Normal [06:16:49] 6operations, 7Icinga: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1433097 (10Dzahn) a:3Dzahn [06:20:57] PROBLEM - puppet last run on mw2119 is CRITICAL puppet fail [06:30:16] PROBLEM - puppet last run on db1051 is CRITICAL puppet fail [06:30:17] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:31:07] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:32:58] 6operations, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1433112 (10akosiaris) >>! In T104936#1432744, @Dzahn wrote: > @akosiaris so i should put this on bromine as well? Yup. Seems like a perfect place for it [06:34:26] PROBLEM - puppet last run on ms-fe1004 is CRITICAL Puppet has 1 failures [06:35:48] PROBLEM - puppet last run on mw2145 is CRITICAL Puppet has 1 failures [06:36:47] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [06:36:47] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:36:48] PROBLEM - puppet last run on mw1150 is CRITICAL Puppet has 1 failures [06:37:08] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 1 failures [06:37:26] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:37:27] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:37:48] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:38:38] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:39:07] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:40:38] (03Abandoned) 10Giuseppe Lavagetto: hhvm: enable apc items expiration on the canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/223000 (https://phabricator.wikimedia.org/T104769) (owner: 10Giuseppe Lavagetto) [06:41:07] RECOVERY - puppet last run on mw2119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:47] RECOVERY - puppet last run on db1051 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:49:17] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:52:46] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:53:36] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:37] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:53:37] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:56] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:07] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:07] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:17] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:26] RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:03] 6operations, 7HHVM, 5Patch-For-Review: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1433137 (10Joe) So, the change was already applied precedingly. However, some testing showed me that when a key has expired it is shown in the apc dump as follows: ``` ... [07:18:06] 6operations, 7HHVM, 5Patch-For-Review: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1433139 (10Joe) As @MaxSem noticed, the key is set here: ``` https://github.com/wikimedia/mediawiki/blob/9771b003756bfe3825bf7427efca6393ed96597b/includes/resourceloader/ResourceL... [07:19:25] (03CR) 10DCausse: [C: 031] "Do we have our own repo for this plugin?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 (https://phabricator.wikimedia.org/T100500) (owner: 10EBernhardson) [07:20:58] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [07:25:29] 6operations, 10MediaWiki-General-or-Unknown, 7HHVM, 5Patch-For-Review: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1433154 (10Joe) [07:37:47] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:37:59] <_joe_> win 31 [07:38:17] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.017 second response time [07:49:19] (03PS1) 10Filippo Giunchedi: install_server: switch to elasticsearch 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/223251 (https://phabricator.wikimedia.org/T102008) [07:57:35] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1433181 (10fgiunchedi) indeed! thanks @Krinkle my plan is to upgrade it this week [08:01:25] 6operations, 6Performance-Team: Investigate upgrading kernel on DB servers - https://phabricator.wikimedia.org/T104953#1433182 (10Gilles) 3NEW [08:03:57] 6operations, 10Traffic, 7HTTPS, 7Mobile: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1433193 (10Chmarkine) How many requests to these domains there are in the log? `*.wap` was [[ https://phabricator.wikimedia.org/T18692 | deprecated ]] in early... [08:08:25] 6operations, 6Performance-Team: Investigate upgrading kernel on DB servers - https://phabricator.wikimedia.org/T104953#1433199 (10MoritzMuehlenhoff) We use Linux 3.19 as the kernel for jessie, so that would be a good oppurtunity to also move the machines from Ubuntu to Debian. [08:18:16] yuvipanda: hashar has suggested you and me should have a pairing session about rubocop [08:18:35] do you have 30-60 minutes this week? [08:19:12] re https://phabricator.wikimedia.org/T102020 [08:20:16] PROBLEM - puppet last run on mw1059 is CRITICAL Puppet last ran 1 day ago [08:23:17] PROBLEM - puppet last run on mw1115 is CRITICAL Puppet last ran 1 day ago [08:23:57] RECOVERY - puppet last run on mw1059 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:26:09] 6operations, 6Performance-Team: Investigate upgrading kernel on DB servers - https://phabricator.wikimedia.org/T104953#1433206 (10jcrespo) 5Open>3declined a:3jcrespo There will be an upgrade for sure, but we have very different deployment and issues than pinterest, and 3.13 already provides most of the f... [08:27:32] !log set operations/puppet/cassandra git submodule repo as hidden [08:27:37] Logged the message, Master [08:30:56] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:56] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [08:31:11] <_joe_> what? [08:31:23] <_joe_> is someone doing something with eeden? [08:32:05] <_joe_> I can ssh to it [08:43:44] (03PS2) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [08:44:07] RECOVERY - puppet last run on mw1115 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:45:08] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1433233 (10fgiunchedi) thanks @eevans! see related code review https://gerrit.wikimedia.org/r/#/c/223041/ I think we're missing only... [08:46:41] 6operations, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1433234 (10Joe) p:5Triage>3High [08:55:35] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1433240 (10fgiunchedi) 5Open>3Resolved I think this can be closed, reopen if appropriate [08:55:47] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 88.33 ms [08:56:04] 6operations, 10RESTBase-Cassandra: cassandra - enable Inter-node encryption - https://phabricator.wikimedia.org/T94132#1433243 (10fgiunchedi) a:3fgiunchedi [08:56:27] RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 88.56 ms [08:57:52] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Backport python-diskimage-builder 0.1.46 from testing to jessie-wikimedia - https://phabricator.wikimedia.org/T102880#1433252 (10hashar) 5Open>3Resolved I have upgraded the package on labnodepool1001 to `0.1.46-1+wmf1`. Not sure why we h... [08:59:09] (03Abandoned) 10Hashar: nodepool: add diskimage 'devuser' element [puppet] - 10https://gerrit.wikimedia.org/r/220446 (https://phabricator.wikimedia.org/T102880) (owner: 10Hashar) [09:04:14] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Backport python-diskimage-builder 0.1.46 from testing to jessie-wikimedia - https://phabricator.wikimedia.org/T102880#1433269 (10MoritzMuehlenhoff) >>! In T102880#1433252, @hashar wrote: > I have upgraded the package on labnodepool1001 to `0... [09:17:27] (03PS12) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [09:18:28] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [09:18:34] (03CR) 10Hashar: "Added dependency on package uuid-runtime for python-diskimage-builder (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=791655)" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [09:19:56] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:19] <_joe_> akosiaris: ^^ again [09:22:27] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 89.84 ms [09:23:07] RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 88.11 ms [09:23:19] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.YXB1aLGN/mnt is not accessible: Permission denied [09:25:08] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [09:27:46] PROBLEM - puppet last run on eeden is CRITICAL puppet fail [09:29:57] RECOVERY - puppet last run on install2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:35] 6operations: Ferm rules for postgres roles / labsdb - https://phabricator.wikimedia.org/T104960#1433328 (10MoritzMuehlenhoff) 3NEW [09:35:38] 6operations: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1433344 (10MoritzMuehlenhoff) 3NEW [09:41:02] RECOVERY - puppet last run on eeden is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:43:13] 6operations: Ferm rules for swift - https://phabricator.wikimedia.org/T104965#1433380 (10MoritzMuehlenhoff) 3NEW [09:50:10] 6operations: Ferm rules for swift - https://phabricator.wikimedia.org/T104965#1433398 (10MoritzMuehlenhoff) [09:50:14] 6operations: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1433400 (10MoritzMuehlenhoff) [09:50:16] 6operations: Ferm rules for postgres roles / labsdb - https://phabricator.wikimedia.org/T104960#1433401 (10MoritzMuehlenhoff) [09:53:44] 6operations: Ferm rules for parsoid / wtp* hosts - https://phabricator.wikimedia.org/T104966#1433411 (10MoritzMuehlenhoff) 3NEW [09:54:09] zeljkof: hi! probably not this week, sorry. maybe after wikimania? [09:54:11] 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar) 3NEW [09:54:30] yuvipanda: no problem, we can pair there too [09:54:40] it should take 30-60 minutes [09:54:50] or after wikimania [09:55:00] 6operations: Ferm rules for swift - https://phabricator.wikimedia.org/T104965#1433430 (10fgiunchedi) overview of ports: for `::proxy` (the frontend): * 80/tcp for `swift-proxy` from everywhere * 11211/tcp for `memcache` from other swift frontends for `::storage` (the backend): * 6000/tcp for `swift-object-serve... [09:55:31] zeljkof: oooh, cool, yes we can do that [09:56:00] yuvipanda: great, see you there [09:57:34] 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433433 (10hashar) [10:10:48] 6operations: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1433445 (10MoritzMuehlenhoff) 3NEW [10:12:06] 6operations: Ferm rules for image scalers - https://phabricator.wikimedia.org/T104969#1433452 (10MoritzMuehlenhoff) 3NEW [10:12:36] 6operations: Ferm rules for video scalers - https://phabricator.wikimedia.org/T104970#1433459 (10MoritzMuehlenhoff) 3NEW [10:12:45] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool package to 1.0.0 - https://phabricator.wikimedia.org/T104971#1433466 (10hashar) 3NEW [10:13:13] 6operations: Ferm rules for job runners - https://phabricator.wikimedia.org/T104972#1433473 (10MoritzMuehlenhoff) 3NEW [10:13:42] 6operations: Ferm rules for video scalers - https://phabricator.wikimedia.org/T104970#1433482 (10MoritzMuehlenhoff) [10:13:44] 6operations: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1433484 (10MoritzMuehlenhoff) [10:13:46] 6operations: Ferm rules for image scalers - https://phabricator.wikimedia.org/T104969#1433483 (10MoritzMuehlenhoff) [10:20:55] mobrovac: we've got a failure email from catchpoint about restbase, recovered now and lasted ~10m, I can't find anything obviously wrong, still looking [10:22:46] godog: restarted rb a couple of mins ago [10:23:38] godog: catchpoint said '1 x Server responded with a 40X or 50X response code. [50014]' [10:23:38] mobrovac: ah, this is from the top of the hour btw [10:24:42] godog: but indeed, there was a 10-min window where the number of reqs was lower than usual [10:25:25] yeah it picked up a 500 [10:31:12] 6operations, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433508 (10fgiunchedi) 3NEW [10:31:29] 6operations, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433515 (10fgiunchedi) p:5Normal>3Low [10:35:02] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 700.247025342 [10:41:54] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [10:42:23] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:42:43] PROBLEM - puppet last run on cp3033 is CRITICAL puppet fail [10:44:41] 6operations: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1433545 (10MoritzMuehlenhoff) 3NEW [10:49:18] !log restarted cassandra on restbase1005, mutations through the roof [10:49:22] Logged the message, Master [10:49:53] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [10:51:13] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.007 second response time on port 9042 [10:55:43] godog: yeah, the process also died there :) [10:59:33] RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:13] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.x3e19kl0/mnt/tmp/ccache is not accessible: Permission denied [11:02:03] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [11:04:18] 6operations: Ferm rules for logstash - https://phabricator.wikimedia.org/T104964#1433575 (10Aklapper) [11:05:17] labnodepool1001 is me [11:09:38] 6operations: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1433588 (10MoritzMuehlenhoff) [11:09:39] 6operations, 5Interdatacenter-IPsec: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#1433589 (10MoritzMuehlenhoff) [11:11:56] Who spoiled coffee? :) [11:12:14] Request: GET http://en.wikipedia.org/wiki/Duncan_Sandys, from 10.20.0.109 via cp1067 cp1067 ([10.64.0.104]:3128), Varnish XID 3975042064 [11:13:12] 6operations: Track systems/roles for which intentionally no firewall rules are applied - https://phabricator.wikimedia.org/T104958#1433598 (10Aklapper) [11:13:32] hrm, what's going on? [11:13:57] big 503 spike [11:15:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 38.46% of data above the critical threshold [500.0] [11:15:10] Ah. [11:17:23] PROBLEM - HHVM queue size on mw1246 is CRITICAL 37.50% of data above the critical threshold [80.0] [11:17:43] I see tons of dberrors for connecting to db2029.codfw.wmnet but this is most probably unrelated (it's all codfw) [11:18:02] damn you logstash [11:18:31] Error connecting to 10.64.32.25: Too many connections [11:18:32] PROBLEM - HHVM queue size on mw1088 is CRITICAL 37.50% of data above the critical threshold [80.0] [11:18:34] Error connecting to 10.64.48.15: Too many connections [11:18:38] Error connecting to 10.64.16.28: Too many connections [11:18:42] PROBLEM - HHVM busy threads on mw1252 is CRITICAL 37.50% of data above the critical threshold [115.2] [11:18:52] PROBLEM - HHVM busy threads on mw1246 is CRITICAL 37.50% of data above the critical threshold [115.2] [11:18:56] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433613 (10Aklapper) [11:19:02] PROBLEM - HHVM busy threads on mw1039 is CRITICAL 37.50% of data above the critical threshold [86.4] [11:19:03] PROBLEM - HHVM busy threads on mw1236 is CRITICAL 42.86% of data above the critical threshold [115.2] [11:19:22] RECOVERY - HHVM queue size on mw1246 is OK Less than 30.00% above the threshold [10.0] [11:19:23] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [11:19:32] gj mw1246 [11:20:02] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:20:32] RECOVERY - HHVM queue size on mw1088 is OK Less than 30.00% above the threshold [10.0] [11:20:33] RECOVERY - HHVM busy threads on mw1252 is OK Less than 30.00% above the threshold [76.8] [11:20:52] RECOVERY - HHVM busy threads on mw1246 is OK Less than 30.00% above the threshold [76.8] [11:21:03] RECOVERY - HHVM busy threads on mw1039 is OK Less than 30.00% above the threshold [57.6] [11:21:03] RECOVERY - HHVM busy threads on mw1236 is OK Less than 30.00% above the threshold [76.8] [11:21:33] 6operations: Ferm rules for MX mail servers - https://phabricator.wikimedia.org/T104979#1433623 (10MoritzMuehlenhoff) 3NEW [11:22:03] 6operations, 10ops-codfw, 10Incident-20150617-LabsNFSOutage: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1433630 (10Aklapper) p:5Unbreak!>3High (New controller in place hence decreasing priority of this task) [11:22:03] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:22:24] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:22:28] jynus: hey, can you help figure out why we had a db1055 query spike? [11:22:43] jynus: https://tendril.wikimedia.org/host/view/db1055.eqiad.wmnet/3306 [11:22:44] 6operations: Ferm rules for logstash - https://phabricator.wikimedia.org/T104964#1433632 (10MoritzMuehlenhoff) [11:22:52] also, what's going on with cassandra ffs? [11:22:55] mobrovac? [11:23:42] 6operations: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1433636 (10MoritzMuehlenhoff) 3NEW [11:24:30] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1433644 (10MoritzMuehlenhoff) [11:24:31] 6operations: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1433643 (10MoritzMuehlenhoff) [11:24:47] paravoid, checking [11:25:06] "too many connections" [11:25:15] yeah that much I got :) [11:25:36] checking cassandra, looks like it died from heap exhaustion [11:25:47] godog: oh hey, I forgot you're in this TZ again :) [11:26:00] the url is not usual [11:26:11] which url? [11:26:13] 6operations, 6Engineering-Community: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1433647 (10Qgil) [11:26:18] "/wiki/97162131" [11:26:25] where do you see that? [11:26:28] all those errors are for numbers url [11:26:35] paravoid: hehe yeah, came back yesterday :) [11:26:38] on application side, we do not keep logs of queries [11:26:42] it is implossible [11:26:47] kibana, mostly [11:26:57] !log restart cassandra on restbase1004, heap exhausted [11:26:57] ishmael used to be good for this [11:27:01] Logged the message, Master [11:27:20] oh, I agree, but sean doesn't let me [11:27:48] 6operations: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#1433655 (10MoritzMuehlenhoff) 3NEW [11:28:03] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:28:12] godog: we have a pb [11:28:36] All host(s) tried for query failed. First host tried, 10.64.0.221: AuthenticationError: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM. [11:28:44] * mobrovac checking rb1002 [11:29:19] the users are "cp1053.eqiad.wmnet" [11:29:33] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [11:30:01] cp1052.eqiad.wmnet [11:30:04] etc [11:30:08] mobrovac: sigh, from rb? [11:30:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:30:16] jynus: ? [11:30:32] paravoid: https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [11:31:36] !log restbase restarted cassandra on rb1005 [11:31:40] Logged the message, Master [11:32:07] did anything change in SSL/TLS setup yesterday ? [11:32:10] jynus: yes, what? [11:32:10] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#API_calls_just_starting_throwing_SSL.2FHTTPS_.28.3F.29_errors [11:32:49] thedj: that would be https://phabricator.wikimedia.org/T104281 [11:33:01] thedj: they are probably using an ancient version of Java [11:33:12] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:34:07] mobrovac: I saw high gc times for 1005 after my restart, was that the culprit? [11:34:18] paravoid: i'll leave a note. [11:34:32] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.007 second response time on port 9042 [11:34:39] thedj: thanks -- I'd do that but I'm currently still trying to triage the outage that happened a few minutes ago [11:34:46] godog: could be, highly likely [11:35:07] thedj: (but I can respond later) [11:35:21] (and thanks for relaying, obviously :) [11:35:33] paravoid: np. [11:35:46] there was a high number of requests "/wiki/[0-9]", databases responded up to 5000 simultaneous requests, then requests get rejected [11:36:16] (03CR) 10JanZerebecki: [C: 031] added year into logging [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [11:36:22] "/wiki/[0-9]+" [11:37:32] (03CR) 10JanZerebecki: [C: 04-1] added year into logging (031 comment) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [11:39:11] did happened on all slaves [11:39:19] on that shard [11:39:38] hrm, those /wiki/[0-9] hits were coming from all over the place [11:40:02] so, I would discard an issue with db or app servers [11:40:14] I would say a DOS, self inflicted or external [11:40:50] (the dbs usually will complain before the apps simply because there are less of them) [11:41:30] I am trying to identify those on oxygen [11:41:39] https://gdash.wikimedia.org/dashboards/reqsum/ looks fun [11:41:41] (03PS1) 10John F. Lewis: mail: ferm rules for mailman [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) [11:42:15] it still not an insane amount of req/s, but it was probably wiki pageviews that are particularly expensive I guess [11:42:17] moritzm: ^ [11:43:48] 6operations, 10ops-eqiad: Rack and Setup New LVS servers - https://phabricator.wikimedia.org/T104484#1433689 (10Cmjohnson) 5Open>3declined dual tickets created see t104458 [11:44:03] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1433692 (10Cmjohnson) Brandon, I have the 6 new LVS servers on-site. I am curious on how you would like them racked. Our 2 current LVS's are in rows A and B. Do you want these to go Row C and D. I also... [11:44:31] cmjohnson1: wrong ticket :) [11:44:59] hah...thx [11:45:04] JohnFLewis: thanks, I've added myself as reviewer, will have a look later on [11:45:42] moritzm: okay, that should be all of them - though knowing the past, there could always be that sly one that exists for some weird reasons [11:46:54] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1433697 (10Cmjohnson) @BBlack, I have the 6 new LVS servers on-site. I am curious on how you would like them racked. Our 2 current LVS's are in rows A and B. Do you want these to go Row C and D... [11:48:47] I cannot see those request on oxigen, or I am looking at the wrong timestamp [11:49:25] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1433699 (10Cmjohnson) Until we have them all ready to be sent back to Cisco. I suggest we remove them from the racks and keep in storage until 100% of them can be returned or disposed. [11:59:22] PROBLEM - Freshness of OCSP Stapling files on cp2024 is CRITICAL File /var/cache/ocsp/sni.m.wikisource.org.ocsp is more than 29100 secs old! [11:59:46] hmmm [12:00:04] I think that's because I never signed cp2024's salt key, so it missed salt last night [12:01:03] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 13 failures [12:04:59] 6operations: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#1433715 (10MoritzMuehlenhoff) [12:05:01] 6operations, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1433716 (10MoritzMuehlenhoff) [12:05:03] RECOVERY - Freshness of OCSP Stapling files on cp2024 is OK [12:05:03] 6operations: Ferm rules for MX mail servers - https://phabricator.wikimedia.org/T104979#1433717 (10MoritzMuehlenhoff) [12:12:53] Hi BeaverP2 [12:12:58] hi kinkle [12:13:18] Here I'm more likely to be corrected when I get it wrong. [12:13:23] Than the other channel. [12:13:29] nice [12:13:30] so do the nodes of wikipedia talk to each other or is the database [12:13:38] the logical communication layer [12:13:47] by providing transation isolation? [12:13:52] BeaverP2: It depends. [12:14:02] in what regard? [12:14:03] BeaverP2: The main storage of wiki revisions and user infomration is in MySQL [12:14:20] so different web servers don't talk to each other in that regard [12:14:36] i see so nothing changed when compared to 2003 :-) [12:14:41] however when it comes to caching, web servers send purges to the reverse caching proxies directly. [12:14:53] ah like memcached? [12:15:06] so like put and get regarding the memcached [12:15:10] subsystem [12:15:27] And things like wiki events for topic subscribers are emitted over UDP from the PHP process directly to the central event hub. [12:15:45] and from there published through redis and websockets, as well as IRC. [12:15:51] what is used to realize the hub? [12:16:02] ah ok i see [12:16:28] is the primary ui still HTML or does wikipedia have some sort of intermediated language? [12:16:41] BeaverP2: There's also a large amount of work deferred from the main thread by offloading to job workers. [12:16:42] i mean you have a REST api for instance right? [12:17:04] so the web request may end, but then adds certain jobs to a queue that separate servers will run independently. [12:17:35] This queue is also in redis I believe. Though the implementation is generic (for simple stock installs for third parties, it defaults to a database table for example) [12:18:00] ok sounds like a good lightweight solution [12:18:30] for example if you edit a template that is transcluded in a 1000 pages (e.g. the "Rolling Stones navigation" template on the bottom of all their album articles) there will be a 1000 jobs scheduled to recompile those articles. [12:18:45] (03PS1) 10Muehlenhoff: Enable packet filter for potassium [puppet] - 10https://gerrit.wikimedia.org/r/223282 [12:18:51] and it also happens on-demand when you view those articles, so the jobs have de-duplication built-in if by the time they run they are no longer needed. [12:19:14] So you recompile the wikipages to html includes? [12:19:23] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [12:19:23] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:28] more or less. They're not in html, but yeah. [12:19:33] only if the page changes or some of the other resources used in the compilation [12:19:39] in what kind of language are they? [12:19:45] json or some form of AST? [12:19:50] wikitext [12:20:09] Although we do have a canonical HTML5/RDFa representation as well, whic his preferred going forward. [12:20:10] ah so a flat representation of the logical AST of the wiki page [12:20:30] wikitext was the format to express wiki pages right? [12:20:46] and during the wiki page compilation, we keep track of references so that we know what pages they include and which pages should result in invalidation. [12:20:46] like the fames ##headline## [12:20:57] == Headline == [12:20:58] yeah [12:21:02] ok sorry :) [12:21:12] so its similar what we plan [12:21:27] the goal was to mimic the compilation of a program when it comes to a service [12:21:30] the html representation (not the same as the html shown to readers entirely) also contains meta data that will allow partial re-compilation. [12:21:34] e.g. only the segment that was included. [12:21:43] the main document storage is like source files and a service compiles those and links those internally [12:21:49] although for legacy reasons this can be complicated due to the dynamic nature of wikitext. [12:22:10] ok sounds good [12:22:34] like compile fragments and when the page is rendered to html those fragments are glued together like being linked [12:22:46] based on some sort of template [12:23:08] yeah. we'd selectively update the 1000 pages that that include said template [12:23:24] exactly what we plan [12:23:26] we generally can't however insert the new version of the template into those pages directly [12:23:34] because the template has scope. [12:23:38] e.g. "this page" [12:23:52] so well have to reparse that template a 1000x [12:24:03] but not 1000 entire pages, just the portion of the template [12:24:18] yeah but again that is subject to the hardware is cheap developer arn't paradigma :) [12:24:50] yeah, slower computation isn't too bad as long as it is doesn't happen unless something changed. and happens asynchronusly [12:25:01] correctly [12:25:23] although throwing more hardware at it doesn't always help [12:25:23] and especially if users do not care if their current view is outdated by some changes [12:25:38] the current/old wikitext compiler is written in PHP and single-threaded. [12:25:41] depends on the means of parallism that can be applied [12:25:42] there's only so much you can do there [12:25:57] the rule why every shard must answer its requests independently of other shards [12:26:27] without such a rule it wont scale linearly [12:26:52] if parsing takes 2 seconds, doing a 100 in parallel means effectiely only 0.02s is wasted, but that doesn't change that that one user still has to wait 2 seconds. [12:27:10] Why do you parse at all? [12:27:17] well, not for more views. [12:27:41] i usually let asts travel around so no parsing is involved just deserialization of the structure [12:27:56] so parsing once use it 1000x times [12:27:59] but when you save a page, or view a page that has been purged due to a cascading change, then it needs to parse while you wait [12:28:12] ah thats what you mean [12:28:20] i thought about parsing the template 1000 times [12:28:35] ah yeah, in theory we could re-use the AST in that case [12:28:44] however there is no AST in the php parser [12:28:49] :) [12:28:49] it's an array of 15,000 regexes [12:28:52] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 88.27 ms [12:29:00] thats why i like Java [12:29:04] and dislike php [12:29:14] Meh, PHP isn't to blame for that. [12:29:19] i know [12:29:23] i use it too [12:29:28] in some commercial projects [12:29:32] The new parser (written in Node.js incidentally) does have an AST. [12:29:46] nice to know [12:30:00] oh you mean your wikimedia parser :) [12:30:02] when you edit a page with VisualEditor, you're getting the data model from the Parsoid.js service [12:30:12] RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 89.90 ms [12:30:18] oh yeah, not the PHP parser but the parser of wikitext written in PHP [12:30:18] makes sense [12:30:31] for PHP itself we actually use HHVM now [12:30:43] PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures [12:30:45] the facebook thingy? [12:30:47] nice [12:30:50] Yep that's the one [12:31:07] Zend PHP does have opcode caching nowadays, but HHVM goes way beyond that [12:31:30] we replaced php with tapestry which was dooms day and end up build a java version of tapestry that is what we liked [12:31:42] hhvm does JIT [12:31:43] JIT, AST and all that other stuff I little about [12:31:45] as far as i know [12:31:48] yeah [12:32:08] outside Wikimedia, upgrading to PHP 5.6 with opcache was quite a big win compared to PHP 5.4 [12:32:15] for my own server [12:32:17] how big is the wikipedia source code by the way? [12:32:26] https://github.com/wikimedia/mediawiki [12:32:29] That's the core. [12:32:38] But there's several dozen plugins we develop and install on top of that though [12:33:24] do you have any idea how much code this is? [12:33:26] They're all listed here if you want: https://en.wikipedia.org/wiki/Special:Version [12:33:29] i mean in mega bytes [12:33:30] It's a lot [12:33:43] we plan for about 10mb [12:33:44] Let me do a quick check [12:33:49] would be nice [12:34:00] we actually got 4mb in a github repro [12:34:03] PROBLEM - puppet last run on eeden is CRITICAL puppet fail [12:34:11] but its basically infrastructure [12:34:34] we are not quite there where the system might be usable [12:34:49] only some parts are functioning basically to do measures [12:35:13] :D [12:35:15] 480M [12:35:21] That's without .git and cache [12:35:27] and includes all localisation and plugins [12:35:33] including front-end resources and images [12:36:04] but thats not only the code right :) [12:36:15] php, js, css, html, images, and localisation json [12:36:29] can you check the size of php + js [12:37:00] if the system goes where it should the rest is generated by others so I do not care :-) [12:39:17] currently our repos are 15gb in size since there are 12GB of wikipedia data used for the internal compression test [12:39:52] if everything goes well we have compression for natural text of 1:3 in storage and 1:6 in memory [12:40:13] since a tb of ddr4 is now 10k having a 1:6 reduction was a design goal [12:40:37] 14M php 21M js [12:41:12] raw source that is [12:41:43] RECOVERY - puppet last run on eeden is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:44:08] wow thats alot [12:44:22] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production - https://phabricator.wikimedia.org/T104708#1433769 (10Elitre) I just got this on officewiki though? [12:44:24] but i guess having a project that is about 15 years old... [12:44:45] :) so its a fine project [12:44:56] does it contain plugins on additional products? [12:45:06] like the media wiki [12:45:17] and special code for special subprojects? [12:46:43] Yes. This measure is from all plugins included [12:46:46] all plugins installed. [12:46:59] We don't have all plugins enabled on all wikis however. [12:47:50] it includes unit tests as well [12:48:06] and comments of course [12:49:04] so the complete source tree of the entire wikimedia sources [12:49:11] ok good to know [12:49:33] the estimate of 10MB of sources I use are for the initial version as well [12:49:43] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:51] so I guess in 10 years the project we are working on might have also 30mb sources :-) [12:50:02] but nice to know such a real life estimate [12:50:15] is php and js able to provide a high reuse by the way? [12:50:39] depends on your coding style and discipline of individuals involved. [12:50:41] i found form own experience that php code tends to have a high count of code duplication [12:50:47] it's 10 years matured, but also open-source [12:51:05] we closely follow DRY its the primary rule [12:51:25] you know the refactoring stuff from fowler and the extreme programming stuff is 15 years old too [12:51:40] but you still find commercial software I rejected to work for [12:51:59] i work as a contractor and i turned down contracts based on code bases [12:52:07] but mostly when they miss unit tests at all [12:52:15] yeah, we're getting more disciplined about maintainability. [12:52:16] or have a low coverage [12:52:34] its the prime reason overher with our regime [12:52:45] we have about 10k test I guess for the 4MB sources [12:53:08] i specialized in testing and performance along with database s and distributed networks [12:53:17] so i am very picky about this [12:53:23] and today we as developers have the choice [12:53:33] nice times to be a software dev [12:53:56] choice over? [12:53:57] by the way what is your responsibility within the wikipedia dev team? [12:54:13] choice to work for or not to work for a given client project [12:54:26] you can throw a stone anywhere and you will hit a project you can work for [12:54:36] Ah I see. [12:54:37] Yeah [12:54:45] so the choice is with the contractor not with the contractee (correct?) [12:54:53] or the client [12:55:16] is wikipedia still using HDD by the way? [12:55:20] we ruled it out [12:55:25] in the first place [12:55:28] SSD for most if not all, afaik [12:55:45] storage system design with SSDs in mind is so much simplier [12:55:59] yeah ssds are dead cheap [12:56:01] at least where it matters. And in other places I imagine it'll become SSD with time as things get replaced. [12:56:09] are you using enterprise ssds or consumer ssds? [12:56:21] ah i see [12:56:21] I didn't know there was a distinction. [12:56:26] Not my area of expertise [12:56:27] :D [12:56:29] :) [12:56:32] I write code, and I pretend to know stuff about ops. [12:56:52] you can have a tb of consumer for 330 per tb but for enterprise you pay about 1000 per tb [12:57:35] the enterprise ssds are usually cards having trice the throughput but most especially they have way better access latencies and IOPs [12:57:37] Krinkle: I thought most servers still used HDD? [12:57:52] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:59:40] we use consumer ssds basically for mass storage (HDD replacement) and enterprise ssds for document store and log storage [12:59:50] JohnFLewis: If you say so. I wouldn't know [13:01:25] Looking at ganglia I see some of the servers have 1 or 2 TB drives. Some of the MySQL servers for page content hae 400-500 GB drives and 70GB ram. [13:04:16] oh [13:04:26] what are ganglia servers? [13:04:38] oh, Ganglia is a monitoring tool [13:05:45] https://ganglia.wikimedia.org/latest/ [13:05:46] https://ganglia.wikimedia.org/latest/?p=2&c=API%20application%20servers%20codfw&h=mw2136.codfw.wmnet [13:06:42] Hm.. yeah, looking in puppet there's very little mention of ssd [13:06:49] but may not be called out as such thoguh [13:08:17] ok [13:08:21] thanks for investigating [13:09:04] k, gotta go – if you're curious about the storage layer or otherwise, I'm sure others may know [13:10:20] no problem thanks for your time [13:10:48] akosiaris: available in next 2 hours for cx puppet patch merge? [13:17:04] kart_: I could be. When do you want to merge ? [13:18:24] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [13:18:46] and gitblit died again. restarting [13:23:00] akosiaris: https://gerrit.wikimedia.org/r/#/c/223042/ [13:23:16] !log restarting gitblit on antimony [13:23:21] Logged the message, Master [13:23:23] akosiaris: probably last patch as we've done with deployment in all wikis after this. [13:23:45] kart_: cool. LGTM. can I merge it right now ? [13:24:03] PROBLEM - gitblit process on antimony is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:17] akosiaris: no, not now. You can merge after 1.20 hours. [13:24:27] akosiaris: should go with mw-config patch [13:24:53] akosiaris: that's way I asked for in 'next ~2 hours' :) [13:24:58] kart_: well, I can say I am happy this is the last patch, because it's not nice coupling those things together [13:25:13] akosiaris: I ack the fact :) [13:25:18] s/I/We [13:25:31] :-) [13:25:53] RECOVERY - gitblit process on antimony is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar gitblit.jar [13:27:52] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60580 bytes in 0.134 second response time [13:27:53] (03CR) 10JanZerebecki: [C: 031] Remove www.email.donate.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/223245 (https://phabricator.wikimedia.org/T102827) (owner: 10Chmarkine) [13:32:06] 6operations: Add Ferm rules for snapshot hosts - https://phabricator.wikimedia.org/T104991#1433842 (10MoritzMuehlenhoff) 3NEW [13:37:33] 6operations: Ferm rules for abacist - https://phabricator.wikimedia.org/T104992#1433865 (10MoritzMuehlenhoff) [13:40:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 36 data above and 9 below the confidence bounds [13:49:55] (03PS1) 10Cmjohnson: Adding dns entries for 4 new analytics machines [dns] - 10https://gerrit.wikimedia.org/r/223295 [13:51:50] 6operations: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1433926 (10MoritzMuehlenhoff) 3NEW [13:54:20] 6operations: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1433945 (10MoritzMuehlenhoff) [13:54:22] 6operations: Ferm rules for abacist - https://phabricator.wikimedia.org/T104992#1433946 (10MoritzMuehlenhoff) [13:54:24] 6operations: Add Ferm rules for snapshot hosts - https://phabricator.wikimedia.org/T104991#1433947 (10MoritzMuehlenhoff) [13:55:38] 6operations, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433955 (10hashar) That is a loopback mount created by disk image builder when it generates an image that will be uploaded to OpenStack labs. Apparently crea... [13:55:50] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1433956 (10hashar) [13:56:44] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubycop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1433961 (10zeljkofilipin) Talked with @yuvipanda on IRC today, we will try to pair on rubocop during wikimania with him. This patch should be ready by t... [13:56:57] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: Backport python-diskimage-builder 0.1.46 from testing to jessie-wikimedia - https://phabricator.wikimedia.org/T102880#1433963 (10hashar) >>! In T102880#1433269, @MoritzMuehlenhoff wrote: > I've changed the version number since that build is s... [14:08:07] 6operations, 10Traffic: implement better failure-scenario geoip mapping in gdnsd - https://phabricator.wikimedia.org/T94697#1434016 (10BBlack) To write down some more-recent thoughts about this subject: - Divide the globe into a reasonable grid, let's say on 1/10th degree boundaries (3600x1800 map). Grid gra... [14:10:20] (03CR) 10QChris: [C: 031] Gerrit: remove ::old role [puppet] - 10https://gerrit.wikimedia.org/r/223161 (owner: 10Chad) [14:14:35] (03PS2) 1020after4: Increment deployment stats after sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/223236 (https://phabricator.wikimedia.org/T104635) [14:14:45] (03CR) 10QChris: [C: 031] Gerrit: Remove $extra_groups from replicationdest, nothing uses it [puppet] - 10https://gerrit.wikimedia.org/r/223169 (owner: 10Chad) [14:18:35] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1434084 (10jcrespo) [14:18:58] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1434087 (10jcrespo) a:5Springle>3jcrespo [14:19:22] moritzm: you are a dream coming true - re:ferm rules :) [14:20:39] or a nightmare if we mess up any of the rules :-) [14:21:26] that too :D [14:21:35] (03CR) 10Matanya: [C: 031] Enable packet filter for potassium [puppet] - 10https://gerrit.wikimedia.org/r/223282 (owner: 10Muehlenhoff) [14:21:54] !log dropping optin_survey_old table from enwiki [14:21:59] Logged the message, Master [14:22:18] 6operations, 10Traffic: implement better failure-scenario geoip mapping in gdnsd - https://phabricator.wikimedia.org/T94697#1434100 (10BBlack) I should have added above: that all implies that the runtime per-request lookup process involves client-geoip->coordinates->gridmap-square->DC. That may not jive well... [14:22:23] ^I've made a quick backup, aside from the other 10 ones we already have [14:24:01] JohnFLewis: mailman only needs 25 for sending? no receiving ? and it is simple smtp, no smtps? [14:24:56] matanya: John is away now :) but he will be back soon [14:25:31] (03PS1) 10Anomie: Logstash: Cleanup exclusion of API continuation logging [puppet] - 10https://gerrit.wikimedia.org/r/223301 [14:25:38] moritzm: one more thing, if you want help with patches for all those services, putting on the tickets the ports they use would be useful [14:26:56] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1434108 (10Krenair) [14:27:40] yeah, that's up for a further step, first I wanted to systematically create tasks for missing services, investigating the ports in use is the next step [14:27:56] thanks for that [14:28:09] 6operations: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1434120 (10Dzahn) 3NEW [14:29:16] 6operations, 7Mail: Ferm rules for MX mail servers - https://phabricator.wikimedia.org/T104979#1434138 (10Krenair) [14:29:27] 6operations, 10Wikimedia-Stream: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#1434141 (10Krenair) [14:29:38] 6operations, 10OCG-General-or-Unknown: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1434143 (10Krenair) [14:31:08] 6operations: Ferm rules for video scalers - https://phabricator.wikimedia.org/T104970#1434155 (10Krenair) [14:31:36] 6operations: move policysite to a VM - https://phabricator.wikimedia.org/T105006#1434165 (10Dzahn) 3NEW [14:32:00] Why do we have tmh* hostnames in eqiad but mw* hostnames in codfw for videoscalers? [14:32:27] mw2007 and mw2152 vs. tmh1001 and tmh1002 [14:33:09] <_joe_> Krenair: yes, what's the issue? [14:33:15] 6operations: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1434181 (10Dzahn) 3NEW [14:33:47] <_joe_> Krenair: if it makes you feel better, we can add a CNAME [14:35:02] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1434202 (10Dzahn) 3NEW [14:36:05] _joe_, it seems like the codfw hosts didn't follow the convention [14:36:16] <_joe_> which convention exactly? [14:36:46] and it leads to things like T104970 vs. T104941 - those two look like they're discussing almost the same thing [14:36:51] <_joe_> why are the imagescalers and jobrunners named mw* in eqiad, while the videoscalers aren't? [14:36:57] _joe_, https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [14:37:10] <_joe_> Krenair: yeah let's revisit it [14:37:35] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1434224 (10Dzahn) [14:37:36] 6operations: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1434225 (10Dzahn) [14:37:36] <_joe_> Krenair: (given that we can add a cname if we want) [14:37:38] 6operations: move policysite to a VM - https://phabricator.wikimedia.org/T105006#1434226 (10Dzahn) [14:37:40] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1434223 (10Dzahn) [14:38:36] <_joe_> Krenair: how on earth are those two tickets originated from the different naming in eqiad and codfw? [14:39:06] <_joe_> well, anyways, I have more important things to babysit atm [14:39:21] one covers all videoscalers, the other only covers the eqiad hosts [14:40:35] <_joe_> which I think happened because we duplicated work, not because of such a confusion [14:41:05] <_joe_> but, feel free to open a ticket about this, so I can just state there we'll rename the videoscalers in eqiad as well when we get to reinstall those [14:41:42] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1434232 (10mmodell) p:5Triage>3Normal [14:44:22] 6operations, 6Phabricator, 6Project-Creators: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1434249 (10mmodell) p:5High>3Normal [14:44:39] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1434252 (10Steinsplitter) @VictorGrigas: anything new? :-) [14:45:43] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1434255 (10Krenair) [14:47:53] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1434266 (10Krenair) 3NEW [14:49:09] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool package to 1.0.0 - https://phabricator.wikimedia.org/T104971#1434279 (10hashar) http://backports.debian.org/Instructions/#index3h2 claims: > apt-get -t jessie-backports install "package" But we should probably go with apt preferenc... [14:50:21] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1434282 (10VictorGrigas) @Steinsplitter - Yes, actually I'll be photographing (and hopefully shooting video of) the servers on Dallas on the way back from Wikimania. I'll have that media online asap. [14:50:36] (03PS1) 10BBlack: ciphersuites: update strong desc for accuracy [puppet] - 10https://gerrit.wikimedia.org/r/223305 [14:50:53] akosiaris: You can merge: https://gerrit.wikimedia.org/r/#/c/223042/ in few minutes (or now) [14:51:00] (03CR) 10BBlack: [C: 032] ciphersuites: update strong desc for accuracy [puppet] - 10https://gerrit.wikimedia.org/r/223305 (owner: 10BBlack) [14:51:07] (03CR) 10BBlack: [V: 032] ciphersuites: update strong desc for accuracy [puppet] - 10https://gerrit.wikimedia.org/r/223305 (owner: 10BBlack) [14:53:22] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1434294 (10Tau) I downloaded 2 php.ini files via FileZilla from directories /etc/php5/apache2/ and /etc/php5/cli/. Changed... [14:53:28] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1434295 (10BBlack) [14:54:10] akosiaris: or now :) [14:55:22] 6operations: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1434312 (10Krenair) [14:55:24] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1434311 (10Krenair) [14:55:37] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1434319 (10Steinsplitter) a:3VictorGrigas [14:55:44] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool package to 0.1.0 or just before 0.1.1 - https://phabricator.wikimedia.org/T104971#1434320 (10hashar) [14:57:21] (03PS1) 10BBlack: varnish: enable dynamic directors in esams [puppet] - 10https://gerrit.wikimedia.org/r/223312 (https://phabricator.wikimedia.org/T97029) [14:58:04] jouncebot: next [14:58:04] In 0 hour(s) and 1 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T1500) [14:58:05] 6operations: inconsistent naming of appservers - https://phabricator.wikimedia.org/T105012#1434335 (10Dzahn) 3NEW [14:58:50] 6operations: inconsistent naming of appservers - https://phabricator.wikimedia.org/T105012#1434343 (10Dzahn) [14:59:07] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1434348 (10Krenair) [14:59:28] (03CR) 10Jakob: Add Phragile module. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [14:59:36] 6operations: inconsistent naming of appservers - https://phabricator.wikimedia.org/T105011#1434353 (10Krenair) [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T1500). [15:00:08] I can SWAT this morning if no one else wants it [15:00:15] kart_: you ready? [15:01:12] thcipriani: I want to wait akosiaris merge 223042 [15:01:20] thcipriani: or anyone from Ops [15:01:39] thcipriani: any other patches for SWAT? You can go ahead with them first. [15:01:56] 6operations: inconsistent naming of appservers - https://phabricator.wikimedia.org/T105012#1434378 (10Krenair) [15:01:57] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1434379 (10Krenair) [15:02:16] (03PS3) 10Andrew Bogott: Wait for a minute for NFS exports before trying to mount requested volumes. [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) [15:02:18] (03PS3) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [15:02:36] godog: around? [15:02:36] kart_: ping detected, please leave a message! [15:03:01] (was that automated? :D) [15:03:28] yes :) he's asking you to leave more details in your pings :) [15:03:44] godog: https://gerrit.wikimedia.org/r/#/c/223042/ - need merge [15:04:47] thcipriani: 5 more minutes, please. [15:04:55] kart_: sure thing [15:05:32] kart_: can I merge? [15:05:41] akosiaris: yes please! [15:06:10] (03PS2) 10Alexandros Kosiaris: CX: Add 'en' as target wikis and MT support [puppet] - 10https://gerrit.wikimedia.org/r/223042 (https://phabricator.wikimedia.org/T94123) (owner: 10KartikMistry) [15:06:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Add 'en' as target wikis and MT support [puppet] - 10https://gerrit.wikimedia.org/r/223042 (https://phabricator.wikimedia.org/T94123) (owner: 10KartikMistry) [15:06:28] THIS IS REALLY HAPPENING?! [15:06:37] kart_?^ :) [15:06:44] thcipriani: you can go ahead now [15:06:44] (03CR) 10Dzahn: [C: 031] "yea, true, ferm rules have already been added in role::poolcounter which is used on suhail/subra in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/223282 (owner: 10Muehlenhoff) [15:07:01] aharoni: finally. this was selector patch. [15:07:07] kart_: okie doke [15:07:10] aharoni: real patch coming up [15:07:27] done [15:07:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222991 (https://phabricator.wikimedia.org/T94123) (owner: 10KartikMistry) [15:07:36] akosiaris: thanks! [15:07:45] (03CR) 10Dzahn: [C: 031] "not used? take a firewall" [puppet] - 10https://gerrit.wikimedia.org/r/223246 (owner: 10Dzahn) [15:07:46] running puppet on sca100{1,2} so they pick up the change faster [15:08:01] akosiaris: cool. thanks!! [15:08:08] (03Merged) 10jenkins-bot: CX: Enable ContentTranslation in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222991 (https://phabricator.wikimedia.org/T94123) (owner: 10KartikMistry) [15:08:34] (03PS4) 10Andrew Bogott: Wait for a minute for NFS exports before trying to mount requested volumes. [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) [15:09:42] (03CR) 10Andrew Bogott: [C: 032] Wait for a minute for NFS exports before trying to mount requested volumes. [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) (owner: 10Andrew Bogott) [15:10:20] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable ContentTranslation in enwiki [[gerrit:222991]] (duration: 00m 13s) [15:10:22] (03PS1) 10Jgreen: ptr records for frack/codfw public IPs [dns] - 10https://gerrit.wikimedia.org/r/223319 [15:10:24] Logged the message, Master [15:10:28] ^ kart_ check please [15:10:34] (03CR) 10Dzahn: [C: 031] "consistency is good - makes it easier to check progress too, there should be no difference" [puppet] - 10https://gerrit.wikimedia.org/r/222314 (owner: 10Muehlenhoff) [15:10:35] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1434418 (10jcrespo) I just made a backup of `optin_survey_old` at `iron:/home/jynus` and deleted it from enwiki. Nothing seems broken. Should I continue with the other 826 wikis that also have... [15:11:03] RECOVERY - Host labnet1002 is UPING OK - Packet loss = 0%, RTA = 0.89 ms [15:11:46] thcipriani: testing. [15:11:49] aharoni: ^^ [15:11:56] (03CR) 10Jgreen: [C: 032 V: 031] ptr records for frack/codfw public IPs [dns] - 10https://gerrit.wikimedia.org/r/223319 (owner: 10Jgreen) [15:12:22] (03PS2) 10Dzahn: protactinium: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223246 [15:12:27] !log ptr records for frack/codfw and authdns-update [15:12:31] Logged the message, Master [15:15:27] 6operations: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1434445 (10bd808) The site is dormant until the 2016 scholarship round starts so it should be pretty safe to move at any time. I think I remember that access to the database for the application user is limited to the IP... [15:19:36] thcipriani: hold on. still testing. [15:19:45] kk [15:22:45] matanya: hi [15:23:31] matanya: its not encrypted so 25 afaik is all it uses but as I said, earlier there always seems to be that one sly port no one knows about :) [15:23:47] (03CR) 10Giuseppe Lavagetto: [C: 031] varnish: enable dynamic directors in esams [puppet] - 10https://gerrit.wikimedia.org/r/223312 (https://phabricator.wikimedia.org/T97029) (owner: 10BBlack) [15:23:58] thcipriani: we're good. Just discovered bug that is not config :) [15:24:11] kart_: cool, thanks! [15:25:55] kart_: Saw your ping from last night. Where did I miss a deployment-logstash1 config setting? [15:26:18] bd808: it is in common.yaml [15:27:02] bd808: hieradata/labs/deployment-prep/common.yaml [15:27:08] kart_: ah -- https://gerrit.wikimedia.org/r/#/c/223184/1/hieradata/labs/deployment-prep/common.yaml,unified [15:28:07] that should be applied on the beta cluster already via cherry-pick [15:28:29] 6operations, 10Wikimedia-Wikimania-Scholarships: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1434526 (10bd808) [15:29:01] (03PS1) 10Andrew Bogott: Revert "Wait for a minute for NFS exports before trying to mount requested volumes." [puppet] - 10https://gerrit.wikimedia.org/r/223323 [15:30:25] (03PS2) 10RobH: EventLogging access for maxsem [puppet] - 10https://gerrit.wikimedia.org/r/222177 (https://phabricator.wikimedia.org/T104482) [15:30:32] bd808: okay! [15:30:34] :) [15:30:46] (03CR) 10RobH: [C: 032] EventLogging access for maxsem [puppet] - 10https://gerrit.wikimedia.org/r/222177 (https://phabricator.wikimedia.org/T104482) (owner: 10RobH) [15:32:50] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1434546 (10RobH) 5stalled>3Resolved @MaxSem: Your access has been expanded, and the changeset (linked above) has been merged. You'll need to give th... [15:32:58] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1434548 (10RobH) a:5RobH>3None [15:33:40] (03PS1) 10Dzahn: copper: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223325 (https://phabricator.wikimedia.org/T104939) [15:33:41] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production - https://phabricator.wikimedia.org/T104708#1434552 (10Krenair) >>! In T104708#1433769, @Elitre wrote: > I just got this on officewiki though? Based on T73849 and T101052 I'm not sure it ever worked at private wikis. [15:33:56] !log manually editing table mediawiki.ipblocks to fully solve a former software bug [15:34:00] Logged the message, Master [15:34:23] it is ^T102949 but it is private [15:34:56] (03PS4) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [15:34:58] (03PS1) 10Andrew Bogott: Install nfs-no-idmap before probing nfs for exports. [puppet] - 10https://gerrit.wikimedia.org/r/223326 [15:35:00] (03PS3) 10Alex Monk: Get rid of most of noc.wikimedia.org/conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 [15:35:58] ARgh [15:36:05] who edited my otrs window off the deployments page? [15:36:08] =[ [15:36:34] bah, was in wrong section too, oh well [15:36:38] (03CR) 10Andrew Bogott: [C: 032] Install nfs-no-idmap before probing nfs for exports. [puppet] - 10https://gerrit.wikimedia.org/r/223326 (owner: 10Andrew Bogott) [15:36:47] robh, lol [15:37:12] somehow was on the 1st... i dunno, must have put in last week by mistake [15:38:25] yeah, went in the wrong week and therefore got archived by james [15:38:32] yep, i pulled it and fixed it [15:38:49] was in my last second of 'oh shit i put 10am my time for the window right?' [15:38:59] i have an hour to prepare not 20 minutes right? ;D [15:42:12] (03PS1) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [15:43:22] <_joe_> mobrovac: I'm almost done, just a couple of methods to implement and I'd need a test server too [15:44:34] robh: poke me when you have a few seconds/minutes (likely minutes) [15:45:38] (03CR) 10BryanDavis: [C: 031] "Cherry-picked to beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/223301 (owner: 10Anomie) [15:46:59] _joe_: cool! the restbase side /should/ be deployed later today, so you could use the prod one [15:47:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "role::package::builder already defines that." [puppet] - 10https://gerrit.wikimedia.org/r/223325 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [15:49:40] (03PS1) 10Giuseppe Lavagetto: imagescalers: reimage mw1153 with HAT [puppet] - 10https://gerrit.wikimedia.org/r/223331 (https://phabricator.wikimedia.org/T84842) [15:50:00] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1434674 (10Jdouglas) Awesome, thanks! I'll copy them into the task description. [15:50:11] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1434678 (10BBlack) These are to replace the 6x current LVS, which would be decom/reclaim after that. The old ones are lvs1001-3 in A4 and lvs1004-6 in B4. Forgive my rambling, but just writing... [15:50:32] <_joe_> ori: whenever you +1 https://gerrit.wikimedia.org/r/223331, I'll move on with reimaging [15:51:20] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1434693 (10Jdouglas) [15:52:51] (03PS5) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [15:52:53] (03PS1) 10Andrew Bogott: Don't set -e in block-for-export [puppet] - 10https://gerrit.wikimedia.org/r/223333 [15:53:19] 6operations, 10MediaWiki-General-or-Unknown, 7HHVM, 5Patch-For-Review: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1434706 (10Joe) I also created https://github.com/facebook/hhvm/issues/5601 to track what is in my opinion a defect in HHVM anyways. [15:53:46] <_joe_> Krinkle: around? I figured you might be the right person to ask about ResourceLoader [15:54:12] (03CR) 10Andrew Bogott: [C: 032] Don't set -e in block-for-export [puppet] - 10https://gerrit.wikimedia.org/r/223333 (owner: 10Andrew Bogott) [15:54:35] (03PS2) 10Alex Monk: imagescalers: reimage mw1153 with HAT [puppet] - 10https://gerrit.wikimedia.org/r/223331 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [15:54:42] <_joe_> Krinkle: if I am right, can you please look at https://phabricator.wikimedia.org/T104769? RL sets APC keys without a TTL, and this causes HHVM to crash after running for a few days [15:55:00] (03PS2) 10BBlack: varnish: enable dynamic directors in esams [puppet] - 10https://gerrit.wikimedia.org/r/223312 (https://phabricator.wikimedia.org/T97029) [15:55:09] (03CR) 10BBlack: [C: 032] varnish: enable dynamic directors in esams [puppet] - 10https://gerrit.wikimedia.org/r/223312 (https://phabricator.wikimedia.org/T97029) (owner: 10BBlack) [15:55:15] (03CR) 10BBlack: [V: 032] varnish: enable dynamic directors in esams [puppet] - 10https://gerrit.wikimedia.org/r/223312 (https://phabricator.wikimedia.org/T97029) (owner: 10BBlack) [15:58:51] _joe_: does hhvm not have an equivalent of https://secure.php.net/manual/en/apc.configuration.php#ini.apc.ttl ? [15:59:45] (03PS6) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [15:59:48] (03PS1) 10Andrew Bogott: Check for exports every second rather than every 5. [puppet] - 10https://gerrit.wikimedia.org/r/223336 [16:00:06] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1434770 (10faidon) >>! In T104458#1434678, @BBlack wrote: > I'm not sure what we do about Row B yet. Is it possible we can put 1GbE SFPs in the existing Row B switches and just make those conne... [16:00:42] (03Abandoned) 10Andrew Bogott: Revert "Wait for a minute for NFS exports before trying to mount requested volumes." [puppet] - 10https://gerrit.wikimedia.org/r/223323 (owner: 10Andrew Bogott) [16:01:44] <_joe_> bd808: I could try the php settings, I didn't think it might be respected [16:01:47] <_joe_> lemme try [16:01:50] thcipriani: possible for you to deploy urgent patch? [16:02:08] kart_: sure [16:02:26] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1434775 (10BBlack) Yeah but if we wanted to do everything "right", we need 6 ports in Row B, not just 2. Are the same ports available on B6 as well? That would get us up to 4. I didn't see th... [16:02:50] (03CR) 10Andrew Bogott: [C: 032] Check for exports every second rather than every 5. [puppet] - 10https://gerrit.wikimedia.org/r/223336 (owner: 10Andrew Bogott) [16:03:43] (03CR) 10Andrew Bogott: [C: 032] Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) (owner: 10Andrew Bogott) [16:03:55] _joe_: maybe this? -- https://github.com/facebook/hhvm/blob/c0776edffc1e977a82ed666b33f17fd9c7f8f8c5/hphp/runtime/ext/apc/ext_apc.cpp#L136 [16:03:56] <_joe_> bd808: my question for Krinkle would remain: what is a sensible value of said ttl for ResourceLoader? [16:03:59] <_joe_> :) [16:04:19] <_joe_> bd808: nope, that does something else [16:04:21] "when the cache is full and it's the oldest thing left" [16:04:52] <_joe_> bd808: uhm, maybe? it's not documented, obviously [16:04:59] php5's apc is bounded and has LRU semantics for just in time eviction [16:05:40] <_joe_> bd808: I nned to see if that has been ported to ini settings, 1 sec [16:05:56] _joe_: yeah TTLLimit looks to be a twisty maze. Not sure exactly when it is applied yet [16:06:16] thcipriani: Need to update ContentTranslation for https://gerrit.wikimedia.org/r/#/c/223330/ (wmf12) [16:06:29] <_joe_> bd808: and yes it must do what we want [16:06:35] <_joe_> it's just not documented [16:06:49] thcipriani: I just merged, should take few minutes to update submodule [16:07:00] thcipriani: you can sync-file, I guess. [16:07:14] kart_: kk [16:07:16] legoktm: extension.json change need full scap? [16:07:36] thcipriani: do you know if we need scap for extension.json change? [16:07:41] kart_: full scap is only needed for l10n string changes [16:08:00] bd808: okay! [16:08:05] bd808: thanks! [16:08:12] np [16:08:14] thcipriani: you can go ahead. [16:08:26] kart_: kk, going [16:08:35] _joe_: it's sort of documented in this header -- https://github.com/facebook/hhvm/blob/4498e2b25bc17ace502b4c86c551dbab25f284bf/hphp/runtime/base/concurrent-shared-store.h#L159-L168 [16:08:45] "The requested ttl is limited by the ApcTTLLimit." [16:09:25] and the default is -1 [16:09:40] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production - https://phabricator.wikimedia.org/T104708#1434824 (10cscott) Yes, it never worked on private wikis, since we don't do the cookie-forwarding thing needed to make that work. The old pre-OCG collection extension didn't ha... [16:10:56] (03PS3) 10Filippo Giunchedi: cassandra: alternative metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/223041 (https://phabricator.wikimedia.org/T104208) [16:11:22] !log thcipriani Synchronized php-1.26wmf12/extensions/ContentTranslation/extension.json: Remove default value for ContentTranslationCampaigns (duration: 00m 12s) [16:11:27] Logged the message, Master [16:11:28] ^ kart_ [16:11:42] thcipriani: thanks! [16:11:50] <_joe_> bd808: yes saw that, and it works [16:11:52] <_joe_> aha. [16:12:11] _joe_: interestingly it looks to me like if that setting is left at the default then TTLs are never used -- https://github.com/facebook/hhvm/blob/4498e2b25bc17ace502b4c86c551dbab25f284bf/hphp/runtime/base/concurrent-shared-store.cpp#L452-L459 [16:12:30] <_joe_> bd808: so now I'm left with the original problem - what value do set there? [16:12:34] _joe_: checking [16:12:38] <_joe_> bd808: nope, you're reading it wrong :) [16:12:43] <_joe_> Krinkle: thanks a lot [16:13:00] _joe_: ah right "apcExtension::TTLLimit > 0" [16:13:04] <_joe_> Krinkle: I just need a TTL we can set in HHVM, not a patch [16:13:25] _joe_: Hm... not based on usage or LRU? [16:14:05] <_joe_> Krinkle: well, no, but for that I've opened https://github.com/facebook/hhvm/issues/5601 [16:14:21] Hm.. that's too bad. [16:14:33] Having it fall out unconditionally is kind of annoying [16:14:42] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:45] Hm.. as long as possible in that ase [16:14:46] case [16:14:47] <_joe_> Krinkle: yea it is, but it's a temporary patch I hope [16:14:48] one or two weeks [16:15:05] <_joe_> Krinkle: one week is more than 5 GBs of cache [16:15:06] Ideally not exactly 7 days due to deployment [16:15:12] so maybe 10 days [16:15:21] <_joe_> we'd need to shorten it definitely below 7 days [16:15:26] <_joe_> it's exhausting our memory [16:15:31] _joe_: Yes, there many unpopular keys but also very popular ones [16:15:43] <_joe_> Krinkle: sigh [16:15:52] e.g. the minififed version of your user javascript may only be requested as many pages as you access [16:16:04] but minified javascript of visualeditor on the other hand.. [16:16:04] <_joe_> ok [16:16:31] <_joe_> so in that case we should set a shorter TTL, like 1 day or so, and let the hot keys be recreated every day [16:16:42] <_joe_> I think we could live with 1 perf hit/day [16:16:55] <_joe_> while the never-used keys disappear from the hash table if not used [16:17:04] <_joe_> that might amount to a perf gain in the end [16:17:21] <_joe_> an hash table with 500K entries is not that efficient usually [16:17:42] _joe_: Sure, but that's still marginal compared to the time it takes to minify 300K of code [16:18:16] <_joe_> yes, it would happen 1/day, like it was happening a couple of months ago because HHVM 3.3 would crash every day [16:18:24] (1/day/appserver) [16:18:28] <_joe_> yes [16:18:32] which will happen in a user facing request when it's a cache miss. And due to wiki/langauge/skin fragmentation that's not one unlucky user a day but but 800*200*2 at once. [16:18:40] and indeed per app server [16:18:42] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [16:19:03] _joe_: It used to be in memcached [16:19:16] we minify cache to APC for perf wins [16:19:20] moved [16:19:21] (03PS1) 10Filippo Giunchedi: add cassandra-metrics-collector snapshot jar [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223337 [16:19:23] (03PS1) 10Filippo Giunchedi: fix .gitfat [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223338 [16:19:37] <_joe_> Krinkle: ok, it has been moved when? [16:19:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add cassandra-metrics-collector snapshot jar [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223337 (owner: 10Filippo Giunchedi) [16:19:44] <_joe_> a few months back right? [16:19:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fix .gitfat [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/223338 (owner: 10Filippo Giunchedi) [16:19:56] Until hhvm has LRU eviction for APC then it's not really so much a win [16:20:04] May 13th [16:20:04] https://github.com/wikimedia/mediawiki/commit/458e7cabbbafda81c35cf8270a8393f3fa3f29ad [16:20:23] <_joe_> ok on may 13th I'm pretty sure we were still on 3.3.1 [16:20:26] <_joe_> but lemme check [16:20:39] mayve been depliyed a few days later [16:21:12] <_joe_> Krinkle: so yeah either we roll back, or we set a ttl limit [16:21:26] <_joe_> or we reimplement the apc cache layer in HHVM to support LRU [16:21:27] _joe_: is this the main occupant of APC on app servers? [16:21:32] <_joe_> yes [16:21:34] enoguh that you won't need a ttl if we move RL cache out? [16:21:40] <_joe_> Krinkle: 99.99% of keys [16:21:43] <_joe_> or more [16:21:44] interesting [16:21:56] <_joe_> Krinkle: you have access to the cluster, right? [16:21:59] Yes [16:22:04] Where can I see this [16:22:06] <_joe_> so, mw1059 [16:22:10] <_joe_> there should be a dump [16:22:16] <_joe_> or you can generate it [16:22:24] k, I'm there [16:22:42] PROBLEM - puppet last run on mw1010 is CRITICAL Puppet has 1 failures [16:22:49] I see an apc dump in /tmp [16:22:53] <_joe_> (see https://phabricator.wikimedia.org/T104769 for details) [16:22:54] <_joe_> yes [16:23:05] <_joe_> the format is key #### serialized_php [16:23:06] (03CR) 10EBernhardson: "nothing official, just a fork[1] on github with one commit for the changes and a tag for the release." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 (https://phabricator.wikimedia.org/T100500) (owner: 10EBernhardson) [16:23:11] <_joe_> if the key is valid [16:23:17] <_joe_> key #### [16:23:31] <_joe_> if the key is over its ttl but still not evicted [16:23:36] _joe_: Hm.. how is apc/hhvm/mw related to graphite? [16:23:51] <_joe_> Krinkle: say that again? [16:23:58] the status erros on top of https://phabricator.wikimedia.org/T104769 [16:24:03] PROBLEM - puppet last run on mw1142 is CRITICAL Puppet has 1 failures [16:24:05] complain about http 5xx on graphite [16:24:23] <_joe_> yes, graphite records errors we send to the users [16:24:42] (03CR) 10Dzahn: "hmm.. but once we said base::firewall should always be on nodes directly?" [puppet] - 10https://gerrit.wikimedia.org/r/223325 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [16:24:46] Oh, those are not http 5xx responses from graphite itself [16:24:53] PROBLEM - puppet last run on mw1139 is CRITICAL Puppet has 4 failures [16:24:54] <_joe_> ehe, no [16:25:08] because graphite-web itself also has (or used to) http500 a lot at peak times when people use dashbaords [16:25:13] (only its frontend though) [16:25:21] ok [16:25:21] <_joe_> yes, not related to that at all [16:25:46] _joe_: ahm. is this dump file in some kind of format? [16:25:50] <_joe_> this is causing crashes and outages (if small ones) and needs manual restarts of the whole appserver cluster [16:25:57] <_joe_> Krinkle: see ^^ [16:26:04] OK [16:26:11] <_joe_> Krinkle: key #### [16:26:46] ah, php serialised has literal line breaks [16:26:56] <_joe_> yes] [16:27:21] <_joe_> Krinkle: cat /tmp/apc_dump | perl -ne 'print "$1\n" if /(.*?)\#{4}/;' | less [16:27:30] <_joe_> this should just print keys [16:28:44] tail -n1000 apc_dump | ack-grep '^[a-z]+:' | cut -d':' -f1-4 [16:28:47] Thanks, that's better [16:30:10] _joe_: So, setting a ttl in RL itself won't work then yeah? [16:30:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:30:22] <_joe_> setting a ttl would work, yes [16:30:40] I thought it didn't support that and you were proposing to set a global one [16:30:43] <_joe_> if you set a ttl, the key will get evicted [16:30:55] <_joe_> nope, the problem is that if you don't set one [16:31:04] <_joe_> it will simply store the value forever [16:31:12] I see [16:31:46] <_joe_> unless we set a global ttl limit [16:31:53] <_joe_> which is admittedly meh [16:31:54] (03PS3) 10Dzahn: protactinium: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223246 [16:31:56] I propose we do both [16:32:02] it should certainly not be stored forever [16:32:15] <_joe_> both? [16:32:16] not unless there is a size limit or LRU or something [16:32:28] I[ll come up with a ttl (probably 12h or 24h) [16:32:33] and a gloabl one as well of like 7 days [16:32:33] <_joe_> ok! [16:33:00] (03CR) 10Dzahn: [C: 032] "not in use per site.pp - nothing un-standard in netstat" [puppet] - 10https://gerrit.wikimedia.org/r/223246 (owner: 10Dzahn) [16:33:00] I'm actually wondering if PHP5's APC really does LRU eviction. The doc for apc.user_ttl seems to indicate that it doesn't -- https://secure.php.net/manual/en/apc.configuration.php#ini.apc.user-ttl [16:33:13] <_joe_> bd808: I was wondering the same [16:33:36] but PHP5 APC is certainly bounded and won't OOM the server [16:33:44] <_joe_> bd808: I suppose we never stored that much data into APC? [16:33:49] (03CR) 10Alexandros Kosiaris: "", while we move it to roles, with the end goal of moving it to standard"" [puppet] - 10https://gerrit.wikimedia.org/r/223325 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [16:33:57] <_joe_> bd808: yes, it will simply spit out the data [16:34:27] oh wow BagOStuff has $expiry in set() that can be both a timestamp or an age integer [16:34:29] that's evil :P [16:34:36] if it's big enough it'll assume a timestamp [16:34:48] (03Abandoned) 10Dzahn: copper: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223325 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [16:34:53] <_joe_> Krinkle: ahah really? [16:35:11] https://github.com/wikimedia/mediawiki/blob/1224a70009542dc72d371978fd7ba7195206a852/includes/libs/objectcache/BagOStuff.php#L437-L447 [16:35:26] <_joe_> nicely evil indeed [16:35:53] RECOVERY - puppet last run on mw1010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:22] !log protactinium - manual iptables rules replaced by puppet/ferm rules [16:36:27] Logged the message, Master [16:36:44] <_joe_> ok, bbiab [16:37:15] <_joe_> Krinkle: if you opt for patching RL, please backport the patch to the branches we have in prod - this is kind of urgent [16:37:36] Yep [16:37:38] writing it now [16:37:43] will deploy in a few hours at the altest [16:37:44] latest [16:38:03] RECOVERY - puppet last run on mw1139 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:39:12] RECOVERY - puppet last run on mw1142 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:41:35] just me or is there a long delay when creating new tasks on phab [16:41:49] I think it's just you [16:41:56] you do keep posting duplicates [16:42:14] all other things appear to be fast.. hmm [16:42:25] mutante: curse of the office [16:43:19] it seems it's my browser version [16:43:26] but only since the upgrade [16:43:48] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1434978 (10RobH) P917 - paste of netstat output on sodium [16:44:59] or not.. same problem when trying to select project tags [16:46:02] 6operations: ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040#1435021 (10Dzahn) [16:46:08] 6operations: Ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040#1435022 (10Dzahn) [16:46:27] 6operations: Ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040#1435026 (10Krenair) [16:47:00] damn [16:47:26] !log applied hotfix for phabricator bug: https://secure.phabricator.com/D13544 [16:47:27] Krenair: sorry for the duplicates.. to me it looks like it failed loading ..hmm [16:47:30] Logged the message, Master [16:47:58] wishes merge would actually merge [16:48:41] (03PS5) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) [16:49:02] _joe_: btw, to make matters worse, last week I change that key. [16:49:03] (03PS4) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) [16:49:22] _joe_: so all the old ones will have been staying around as well [16:49:31] <_joe_> Krinkle: eheh right [16:49:41] it used to have a wikidb-prefix [16:49:45] changed to be global instead [16:49:45] 6operations, 5Patch-For-Review: Ferm rules for dumps (ms1001/datasets) - https://phabricator.wikimedia.org/T105040#1435046 (10Dzahn) a:3Dzahn [16:49:46] (03PS1) 10Jcrespo: Adding an updated versions of the redact.sh scrip [software/redactatron] - 10https://gerrit.wikimedia.org/r/223344 (https://phabricator.wikimedia.org/T104900) [16:49:47] to share more cache [16:49:51] should've reduced load though [16:49:53] in the end [16:50:01] but only if the old stuff drops off [16:50:21] <_joe_> and we (ops) were accusing Special:RecordImpression for the acceleration in memory usage [16:50:26] <_joe_> :P [16:50:48] _joe_: do you see any significant presence of 'resourceloader:filter' with prefixes other than 'global' ? [16:50:53] <_joe_> Krinkle: well almost all appservers have been restarted since, but the memory usage still goes up [16:50:57] Or did the servers crash since then? [16:51:03] Right [16:51:08] (03CR) 10Jcrespo: "redact_standard_output.sh just makes things work." [software/redactatron] - 10https://gerrit.wikimedia.org/r/223344 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [16:51:16] <_joe_> not on that appserver, maybe some didn't restart still and they still show those [16:51:19] <_joe_> lemme check [16:51:29] _joe_: That change would've reduced number of unique keys by 2-3 orders of magnitude [16:51:37] from 800x to 1x [16:51:51] but still a lot I guess [16:51:55] due to user-specific stuff [16:52:05] <_joe_> ok, the growth slope seems to have slowed down actually, but not enough :) [16:53:12] <_joe_> mw1026 might be your best bet [16:54:14] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1435082 (10fgiunchedi) restbase1004 also OOMd at 11.32 UTC, running jdk8 too [16:54:47] <_joe_> trying to dump apc there, it's gonna take some time :P [16:55:40] (03PS1) 10Alexandros Kosiaris: Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 [16:56:46] <_joe_> Krinkle: dewiktionary:resourceloader:filter:minify-css:7:cf70a73b0182d268cd937c52e0c86075 there they are [16:57:33] I see [16:57:46] _joe_: Is there a mechanism to relatively easily drop all of those? [16:58:06] retroactively that is [16:58:21] _joe_: btw, will an hhvm restart clear APC? Or does it run in a separate process? [16:58:26] <_joe_> Krinkle: restarting HHVM is a good way :) [16:58:41] <_joe_> !log restarted HHVM on mw1026, near to OOM [16:58:46] Logged the message, Master [16:58:54] Right [16:59:35] _joe_: The default ttl, where do we configure that? [17:00:05] RobH: Respected human, time to deploy OTRS SSL Certificate Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T1700). Please do the needful. [17:00:17] huzzaaaah [17:00:23] !log starting otrs maint window [17:00:25] <_joe_> Krinkle: we don't atm, if we want to, the default place to do that would be puppet [17:00:27] Logged the message, Master [17:00:38] Keegan: ^ fyi =] [17:00:52] _joe_: yeah. Would be good I think. [17:01:29] Wheeeee [17:01:44] ok, beginning overly paranoid otrs cert update (meaning im restarting all services related to otrs on the box and puppet running successfully before i even apply my change) [17:02:47] ok, otrs works fine without my changes, successfully ran puppet and completely halted and restarted apache no issues. [17:02:59] (03PS2) 10RobH: updating ticket.wikimedia.org cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/221161 (https://phabricator.wikimedia.org/T91504) [17:03:13] (03CR) 10RobH: [C: 032] updating ticket.wikimedia.org cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/221161 (https://phabricator.wikimedia.org/T91504) (owner: 10RobH) [17:04:32] (03PS4) 10Alexandros Kosiaris: contint: install python3-tk [puppet] - 10https://gerrit.wikimedia.org/r/216969 (https://phabricator.wikimedia.org/T101697) (owner: 10Hashar) [17:04:34] ok, merged and now puppet is running on iodine again [17:04:42] should get the updated cert and regen the chain [17:05:21] Notice: /Stage[main]/Role::Otrs/Sslcert::Std_cert[ticket.wikimedia.org]/Sslcert::Certificate[ticket.wikimedia.org]/Sslcert::Chainedcert[ticket.wikimedia.org]/Exec[x509-bundle ticket.wikimedia.org]/returns: executed successfully [17:05:27] (03PS1) 10Chad: Phabricator: Remove php5 from manifest [puppet] - 10https://gerrit.wikimedia.org/r/223351 [17:05:49] (03PS5) 10Alexandros Kosiaris: contint: install python3-tk [puppet] - 10https://gerrit.wikimedia.org/r/216969 (https://phabricator.wikimedia.org/T101697) (owner: 10Hashar) [17:05:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] contint: install python3-tk [puppet] - 10https://gerrit.wikimedia.org/r/216969 (https://phabricator.wikimedia.org/T101697) (owner: 10Hashar) [17:06:15] .... when things go smoothly i wonder what i missed. [17:06:34] !log otrs is now using the new sha256 cert [17:06:38] Logged the message, Master [17:06:46] robh: Don't worry, I'll find whatever you broke [17:06:49] Keegan: OTRS should be functioning normally. I can see it presenting the new certificate and it all looks good to me. [17:06:53] (03PS2) 10Alexandros Kosiaris: Phabricator: Remove php5 from manifest [puppet] - 10https://gerrit.wikimedia.org/r/223351 (owner: 10Chad) [17:06:55] I was about to say, please check ;D [17:06:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Phabricator: Remove php5 from manifest [puppet] - 10https://gerrit.wikimedia.org/r/223351 (owner: 10Chad) [17:07:27] robh: why did you expect it to not go smoothly? [17:07:37] paravoid: because im a pessimist! [17:07:44] I mean, I wondered why we needed to schedule it in a maintenance window etc. [17:07:48] it's really simple stuff :) [17:07:58] but no one touches otrs [17:08:00] robh: I logged in and all the bits and pieces are there [17:08:06] i couldnt be sure there wasnt some odd config change that had never restarted on it [17:08:13] ie: like i had with mailman a month or so ago [17:08:21] i understand it. it broke like every other time [17:08:26] and if i broke otrs with no window, folks would be very, very upset [17:08:30] I mean, it's FOSS, plan for the weird as usual [17:08:39] <_joe_> Krinkle: on mw1026, the global: keys were 5:1 to the other keys [17:08:49] I tend to err towards always scheduling a window [17:09:01] example, im going to add a window on thursday for my planet cert update on misc-web [17:09:03] because im paranoid ;D [17:09:17] it should be a simple cert regen [17:09:19] _joe_: 5:1, as in 5 out of 6? [17:09:20] <_joe_> I guess it was restarted not long before your change [17:09:23] <_joe_> Krinkle: yes [17:09:36] k [17:09:53] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1435129 (10BBlack) > but I'm not sure if I like the discrepancy Me either. I think we'd have to consider it temporary until the eqiad network was fixed up. [17:10:02] <_joe_> 388302 global keys, wtf. [17:10:09] <_joe_> it's a lot :P [17:10:24] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1435130 (10GWicke) I think overall it looks like jdk8 might be helping a little bit, but it's not making a huge difference to OOMs and memory pressure situations. Those seem to be primarily driven by mutations backing... [17:10:26] <_joe_> then we try to cache all of this on the varnishes I guess [17:10:29] <_joe_> as well [17:10:40] !log OTRS update appears to be functioning normally. As such, ending maintenance window. [17:10:44] _joe_: actually, no. Only a fraction is in varnish [17:10:44] Logged the message, Master [17:11:02] <_joe_> Krinkle: the ones that are frequently used? [17:11:29] _joe_: the majority of minification responses will be responses that are embedded in page html, or that are embedded in unversioned load.php responses (e.g. modules=startup or the stylesheet) - which does go into varnish, but only the latest version [17:11:35] <_joe_> ok [17:11:41] <_joe_> thanks [17:11:51] so after a change, the old key is no longer used. [17:12:07] because we include a hash of the unminified content in the key [17:12:12] (03PS1) 10Chad: Phabricator: Clean up vcs manifest with proper dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223353 [17:12:16] this was to fix race conditions during deployment [17:12:24] and also to be rollback-safe [17:12:32] <_joe_> nicwe [17:12:50] <_joe_> but, given hhvm's limitations, deadly :P [17:12:55] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Clean up vcs manifest with proper dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223353 (owner: 10Chad) [17:13:25] <_joe_> ok need to chill of a bit for reals [17:14:58] 6operations, 10OTRS, 6Security, 7HTTPS, 5Patch-For-Review: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1435135 (10RobH) a:5RobH>3None I've completed the SHA256 update. I'm not entirely certain what remains, but seems to be: - No PFS -config DNSSEC for that domain t... [17:15:06] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1435137 (10RobH) [17:16:03] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1435144 (10RobH) [17:17:13] (03PS2) 10Chad: Phabricator: Clean up vcs manifest with proper dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223353 [17:17:48] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1435161 (10BBlack) DNSSEC and DANE are not things we currently do. [17:18:53] PROBLEM - puppet last run on iridium is CRITICAL puppet fail [17:19:32] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1435168 (10BBlack) And the server is currently configured to do PFS with reasonably-modern clients. It can't support PFS with some older/crappier clients until the machine is upgraded to J... [17:22:25] hey, #ops -- is there going to be a deploy freeze during wikimania? [17:22:44] https://wikitech.wikimedia.org/wiki/Deployments doesn't say anything about it, but ISTR that's the usual practice [17:23:01] robh, greg-g: ^ [17:23:56] excellent question, I have no idea! [17:23:57] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1435187 (10BBlack) Actually that's not an accurate statement at all. The Apache2.2 on precise can't really do PFS at all in our configs :/ Still, upgrade to Jessie! [17:24:49] cscott: all i know is last wikimania there wasn't [17:24:54] cough.. [17:25:10] but there should have been? [17:25:21] maybe i'm thinking of the all-hands deploy pause [17:25:48] but i do remember a bunch of panicked ops scurrying at wikimania, yeah [17:25:57] i don't know, it's also when a lot of work is being done and more people are in one room than ever [17:26:00] but wasn't that due to a newly-released exploit? [17:26:18] i also had mw config changes in mind [17:26:25] permission groups and stuff [17:26:42] (03PS2) 10John F. Lewis: mail: ferm rules for mailman [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) [17:26:55] robh / mutante ^ mergable? :) [17:27:11] i dont know, does it even work on lucid? [17:27:13] heh gerrit told me "Can Merge No" [17:27:30] mutante: oh now that'll be the security blocker of the century :) [17:27:47] (03PS3) 10John F. Lewis: mail: ferm rules for mailman [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) [17:28:54] typically we dont do a lot of deploys in wikimania [17:29:06] because we dont want to crash the site and have folks waste time at a conference havin tto fix the site [17:29:16] since we have labs now, it seems even more reason to halt deployments during that [17:29:31] (you can test your new conference coordinated items in labs!) [17:29:40] but im not sure what the actual rule is. [17:29:46] just thats how its tended to be in the past. [17:32:17] (03CR) 10RobH: [C: 031] "I think this looks good." [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [17:32:31] JohnFLewis: i like it, but im also not comfortable enough with ferm rules to be the only person to approve ;D [17:32:36] im happy to merge and babysit on sodium though [17:32:52] mutante: ^ do the rules look sane? [17:32:56] only cuz we know if sodium rejects stuff at firewall, and we fix it, the mail then goes through [17:33:05] since we've broken mail routin gon sodium before and had that happen, heh [17:33:11] base::firewall probably should wait until jessie though with the new box [17:33:14] i still would want a maint window ;D [17:33:37] the rules can be added as they'll do nothing until base::firewall is specifically added anyway [17:34:34] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1435217 (10fgiunchedi) ok, I think it makes sense to reduce the variables at play and run openjdk 7 everywhere [17:34:43] JohnFLewis: robh: what about the PDNS recursor on 53 [17:35:10] mutante: shouldn't that be handled by another module/role already though? [17:35:11] i mean, it wouldnt belong in the mail role? [17:35:11] * JohnFLewis looks [17:35:25] ahh, hrmm, yea [17:35:33] seems like its just location though [17:36:48] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1435219 (10GWicke) Since we are using G1GC I'd actually vote for using JDK8 everywhere, as that's considered less mature in JDK7. [17:36:50] how is 53 being used on sodium? :) [17:37:11] inetd on 10080 [17:37:19] covered by standard? [17:37:23] i think no [17:38:21] JohnFLewis: good question, it's not applied via a role? [17:39:14] (03CR) 10Dzahn: "the rules look sane and could be applied since they dont do anything until base::firewall is applied.. BUT do the ferm classes work ok on " [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [17:39:57] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1435225 (10fgiunchedi) that might be true, is there any evidence to suggest jdk7 vs jdk8 instances are doing better? [17:40:04] 6operations, 10OTRS, 7HTTPS, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1435226 (10RobH) 5Open>3Resolved window successfully completed [17:41:38] (03PS1) 10Muehlenhoff: enable ferm for nembus [puppet] - 10https://gerrit.wikimedia.org/r/223354 [17:42:43] PROBLEM - puppet last run on mw1018 is CRITICAL Puppet has 1 failures [17:42:53] (03PS1) 10Muehlenhoff: enable ferm for neptunium [puppet] - 10https://gerrit.wikimedia.org/r/223355 [17:43:41] (03CR) 10John F. Lewis: "I can't see how 53 or 10080 are in use. Nothing is included which would use them in role::mail::lists and following the module path shows " [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [17:43:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [17:44:02] robh / mutante ^ I can't see reason they'd exist on sodium never mind used :? [17:44:43] PROBLEM - puppet last run on mw1035 is CRITICAL Puppet has 4 failures [17:45:43] PROBLEM - puppet last run on mw1036 is CRITICAL Puppet has 4 failures [17:46:02] PROBLEM - puppet last run on mw1096 is CRITICAL Puppet has 1 failures [17:48:22] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [17:52:40] 1.26wmf13 has not been branched yet? [17:53:43] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:54:23] PROBLEM - RAID on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:54:29] (03PS2) 10Andrew Bogott: enable ferm for nembus [puppet] - 10https://gerrit.wikimedia.org/r/223354 (owner: 10Muehlenhoff) [17:55:24] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK All defined EventLogging jobs are runnning. [17:55:53] PROBLEM - puppet last run on mw1111 is CRITICAL Puppet has 2 failures [17:55:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:56:03] RECOVERY - RAID on eventlog1001 is OK no disks configured for RAID [17:56:54] PROBLEM - puppet last run on mw2020 is CRITICAL puppet fail [17:57:36] (03CR) 10Andrew Bogott: [C: 032] enable ferm for nembus [puppet] - 10https://gerrit.wikimedia.org/r/223354 (owner: 10Muehlenhoff) [17:59:53] RECOVERY - puppet last run on mw1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:59:53] RECOVERY - puppet last run on mw1035 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:00:05] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T1800). Please do the needful. [18:01:03] RECOVERY - puppet last run on mw1036 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:01:23] RECOVERY - puppet last run on mw1096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:31] anything I should be aware of before I deploy the new branch today? [18:06:53] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [18:08:57] !log restarted apache2 on iridium (phab hotfix) [18:09:01] Logged the message, Master [18:15:12] RECOVERY - puppet last run on mw1111 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:33] (03PS1) 10Yuvipanda: uwsgi: Do not use custom startup script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/223362 [18:16:14] (03PS2) 10Yuvipanda: uwsgi: Do not use custom startup script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/223362 [18:16:16] (03CR) 10John F. Lewis: [C: 031] transparency: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223229 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [18:16:21] (03CR) 10Yuvipanda: [C: 032 V: 032] uwsgi: Do not use custom startup script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/223362 (owner: 10Yuvipanda) [18:17:16] (03CR) 10John F. Lewis: [C: 031] misc-web varnish: switch transparency to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223227 (https://phabricator.wikimedia.org/T104937) (owner: 10Dzahn) [18:17:31] (03CR) 10John F. Lewis: [C: 031] misc-web varnish: switch annualreport to bromine [puppet] - 10https://gerrit.wikimedia.org/r/223222 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [18:17:46] (03CR) 10John F. Lewis: [C: 031] annualreport: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223221 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [18:19:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [18:21:24] (03PS1) 10John F. Lewis: static bugzilla: add https check [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) [18:21:40] (03PS2) 10John F. Lewis: static bugzilla: add https check [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) [18:21:53] ori: ori: aaarggh, there's no systemd unit for uwsgi in jessie, it's still using that fucking stupid init scrip. [18:22:02] PROBLEM - puppet last run on mw1229 is CRITICAL Puppet has 1 failures [18:22:03] PROBLEM - puppet last run on mw2116 is CRITICAL Puppet has 2 failures [18:22:03] robh: ^ merge? :) [18:22:17] sorry, in a meeting right now =] [18:22:22] (03PS1) 10Alex Monk: sql command: use slave server unless '--write' provided as an option [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) [18:22:30] actively engaged, cannot divide attention, will be back shortly =] [18:22:38] robh: I know, add it to your after-meeting list :) [18:22:42] ahh, ok [18:22:42] csteipp, I wonder if it also might be a good idea to change the default database from enwiki to... testwiki? [18:23:03] or maybe tlhwiki just to mess with people [18:23:07] :p [18:23:10] :) [18:23:43] PROBLEM - puppet last run on mw2159 is CRITICAL Puppet has 1 failures [18:23:53] PROBLEM - puppet last run on mw2071 is CRITICAL Puppet has 3 failures [18:23:55] actually that wouldn't work with this script [18:23:59] If it's defaulting to a slave though, it really shouldn't matter that it's enwiki [18:24:04] yeah [18:24:24] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 2 failures [18:24:24] PROBLEM - puppet last run on mw2007 is CRITICAL Puppet has 5 failures [18:25:03] PROBLEM - puppet last run on mw2141 is CRITICAL Puppet has 1 failures [18:29:18] (03CR) 10Hoo man: [C: 04-1] "If we do this, the command should also get a --help (that can also appear if no DB is passed, instead of defaulting to enwiki). Also sql " [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [18:29:55] That would be consistent with mysql calling style [18:31:31] hoo: so "sql [--help|--write dbname|dbname]", with nothing meaning assume --help? [18:32:42] yeah, something like that [18:32:43] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:35:23] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:35:33] RECOVERY - puppet last run on mw2116 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:12] RECOVERY - puppet last run on mw2159 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:37:13] RECOVERY - puppet last run on mw2071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:38:07] (03CR) 10Springle: [C: 031] "The "issues" are IMO more about centralauth being on s7 at all, instead of on x1. But that's a generic problem affecting more than just sa" [software/redactatron] - 10https://gerrit.wikimedia.org/r/223344 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [18:39:23] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:39:52] RECOVERY - puppet last run on mw2007 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:40:33] RECOVERY - puppet last run on mw2141 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [18:41:52] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:42:57] (03PS2) 10Alex Monk: sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) [18:47:07] hoo, how's that? [18:49:35] (03PS1) 10Jforrester: Enable VisualEditor by default on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223371 (https://phabricator.wikimedia.org/T104961) [18:49:45] (03PS3) 10Alex Monk: sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) [18:50:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:52:39] Krenair: mh... do you know about getopt? [18:52:56] not really [18:53:20] I don't usually do bash scripting [18:56:13] hoo: I used it for a while, but moved to more intuitive interfaces quickly after [18:56:19] getopt is just too compact [18:56:26] for no apparent reason [18:56:40] especially when using it inside PHP [18:57:17] if youre using it in bash, the PHP manual might actually help [18:57:20] http://php.net/manual/en/function.getopt.php [18:57:30] bbl [19:10:07] (03PS1) 1020after4: 1.26wmf13 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223377 [19:11:13] (03CR) 1020after4: [C: 032] 1.26wmf13 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223377 (owner: 1020after4) [19:11:21] (03Merged) 10jenkins-bot: 1.26wmf13 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223377 (owner: 1020after4) [19:14:34] (03PS1) 1020after4: remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223379 [19:15:43] !log installed PHP security updates on all trusty hosts [19:15:47] Logged the message, Master [19:18:42] (03CR) 1020after4: [C: 032] remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223379 (owner: 1020after4) [19:18:50] (03Merged) 10jenkins-bot: remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223379 (owner: 1020after4) [19:20:32] !log twentyafterfour Started scap: testwiki to php-1.26wmf13 and rebuild l10n cache [19:20:36] Logged the message, Master [19:30:33] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:52] RECOVERY - Host analytics1018 is UPING WARNING - Packet loss = 50%, RTA = 1.24 ms [19:32:23] (03CR) 10Krinkle: sql command: use slave server unless '--write' provided as an option before DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [19:33:26] (03PS1) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [19:33:31] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [19:33:38] (03PS2) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [19:33:41] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for 4 new analytics machines [dns] - 10https://gerrit.wikimedia.org/r/223295 (owner: 10Cmjohnson) [19:34:56] (03PS4) 10Alex Monk: sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) [19:37:42] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 90.0 [19:38:42] (03PS3) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [19:38:46] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [19:38:54] (03PS5) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [19:39:01] (03PS4) 10Yuvipanda: uwsgi: Clean up uwsgi module [puppet] - 10https://gerrit.wikimedia.org/r/223383 [19:39:19] ori: I redid the uwsgi module a bit https://gerrit.wikimedia.org/r/#/c/223383/ do take a look when you can [19:39:35] (03PS2) 10Yuvipanda: logstash: Enable user & group authz modules for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/223172 (https://phabricator.wikimedia.org/T103804) (owner: 10BryanDavis) [19:39:42] (03CR) 10Yuvipanda: [C: 032 V: 032] logstash: Enable user & group authz modules for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/223172 (https://phabricator.wikimedia.org/T103804) (owner: 10BryanDavis) [19:39:47] (03PS2) 10Yuvipanda: beta: Replace deployment-logstash1 with deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/223184 (https://phabricator.wikimedia.org/T101541) (owner: 10BryanDavis) [19:39:53] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Replace deployment-logstash1 with deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/223184 (https://phabricator.wikimedia.org/T101541) (owner: 10BryanDavis) [19:40:23] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 14.0 [19:42:22] (03PS2) 10Chad: Gitblit: Remove ssl cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/223170 [19:42:29] mutante: Trivial cleanup stuff ^ [19:43:14] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [19:43:52] PROBLEM - puppet last run on mw1097 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:12] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [19:45:33] RECOVERY - puppet last run on mw1097 is OK Puppet is currently enabled, last run 12 minutes ago with 0 failures [19:45:52] (03CR) 10ArielGlenn: "so ms1001 is a live fallback, dup of all dataset1001 data plus ready to serve by puppet change. Can be broken for a little while but not" [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [19:46:17] (03CR) 10ArielGlenn: "yes, ms1001 as the testing ground first please." [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [19:46:20] looking into 1005 [19:47:22] !log restarted cassandra on restbase1005 [19:47:27] Logged the message, Master [19:48:03] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [19:48:53] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [19:49:52] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.001 second response time on port 9042 [19:50:42] PROBLEM - Kafka Broker Messages In on analytics1018 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [19:51:15] (03CR) 10Gage: [C: 031] "Port 53: as far as I can tell, the powerdns instance listening on this host is doing nothing. tcpdump and strace show it as totally idle, " [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [19:52:16] jgage: oh nice :) thx [19:52:54] for checking that odd pdns there [19:55:28] (03PS1) 10BBlack: Wipe bad Equifax-signed GeoTrust_Global_CA (ca-certificates has better installed) [puppet] - 10https://gerrit.wikimedia.org/r/223390 [19:55:39] (03PS1) 10BryanDavis: beta: include deployment-mediawiki03 in scap targets [puppet] - 10https://gerrit.wikimedia.org/r/223391 (https://phabricator.wikimedia.org/T72181) [19:56:57] (03CR) 10Ori.livneh: [C: 031] "Looks good. I agree that this is easier to understand. My multi-instance Upstart pattern never caught on and I never ported it to systemd." [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [19:57:12] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [19:57:45] (03CR) 10BryanDavis: "Cherry-picked to beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/223391 (https://phabricator.wikimedia.org/T72181) (owner: 10BryanDavis) [19:57:49] (03CR) 10Ori.livneh: "(+Filippo to reviewers)" [puppet] - 10https://gerrit.wikimedia.org/r/223383 (owner: 10Yuvipanda) [19:58:36] ori: sweet :) [20:00:13] !log twentyafterfour Finished scap: testwiki to php-1.26wmf13 and rebuild l10n cache (duration: 39m 41s) [20:00:18] Logged the message, Master [20:00:29] (03CR) 10BBlack: [C: 032] Wipe bad Equifax-signed GeoTrust_Global_CA (ca-certificates has better installed) [puppet] - 10https://gerrit.wikimedia.org/r/223390 (owner: 10BBlack) [20:02:06] YuviPanda ori looks good! I'm almost out of the door but will take a closer look tomorrow [20:02:28] godog: sweet :) [20:02:58] godog: ori just the second systemd unit I'm writing so would love to get double checked etc. Will poke tomorrow [20:07:52] (03PS3) 10Dzahn: Gitblit: Remove ssl cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/223170 (owner: 10Chad) [20:08:17] (03CR) 10Dzahn: [C: 032] Gitblit: Remove ssl cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/223170 (owner: 10Chad) [20:12:04] RECOVERY - Host labnet1002 is UPING OK - Packet loss = 0%, RTA = 1.88 ms [20:12:46] http://korma.wmflabs.org/browser/irc.html [20:12:52] icinga-wm wins, who would have thought [20:14:22] (03CR) 10Dzahn: [C: 04-1] dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [20:19:24] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1435790 (10Yurik) 3NEW [20:20:32] jouncebot: next [20:20:32] In 2 hour(s) and 39 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T2300) [20:21:22] (03PS2) 10Dzahn: install mysql-client in role::deployment:server [puppet] - 10https://gerrit.wikimedia.org/r/222533 (https://phabricator.wikimedia.org/T95436) [20:22:48] (03PS3) 10Dzahn: install mysql-client in role::deployment:server [puppet] - 10https://gerrit.wikimedia.org/r/222533 (https://phabricator.wikimedia.org/T95436) [20:26:15] (03CR) 10Dzahn: [C: 032] install mysql-client in role::deployment:server [puppet] - 10https://gerrit.wikimedia.org/r/222533 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [20:32:27] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1435848 (10Yurik) 3NEW [20:33:26] (03PS11) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [20:33:33] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1435862 (10Dzahn) on mira: Notice: /Stage[main]/Role::Deployment::Server/Package[mysql-client]/ensure: ensure changed 'purged' to 'present' on tin: Notice: /Stage... [20:34:56] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Assign varnish memory-only role to maps servers - https://phabricator.wikimedia.org/T105076#1435863 (10Yurik) 3NEW [20:35:03] (03CR) 10Dzahn: [C: 031] "thanks Gage, also +1 then" [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [20:35:48] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1435870 (10Yurik) [20:40:13] twentyafterfour: Did you intentionally only do testwiki? [20:43:04] whenever making changings on tin, please also apply the same thing on mira [20:43:08] changes [20:43:26] mutante: What kind of changes? [20:44:19] hoo, I think he means system changes like packages etc. [20:44:29] hoo: applying puppet roles, basically [20:44:42] i hope there isnt much manuall installation [20:45:09] example: include role::releases::upload [20:45:17] i would like them to include the exact same things [20:45:46] I think you should include a comment in tin's node block about that [20:46:02] agreed. will do [20:46:23] * hoo wonders what's up with the train [20:46:26] let me sort that stuff first [20:48:16] hoo: yes [20:49:23] ah ok [20:50:35] (03PS1) 1020after4: group0 to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223441 [20:51:12] (03CR) 10Ori.livneh: [C: 031] Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:52:00] (03CR) 1020after4: [C: 032] group0 to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223441 (owner: 1020after4) [20:52:06] (03Merged) 10jenkins-bot: group0 to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223441 (owner: 1020after4) [20:52:21] ready to sync to group0 now [20:53:08] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf13 [20:53:12] Logged the message, Master [20:56:27] (03PS1) 10Dzahn: mira - deploy codfw - adjust roles to be like tin [puppet] - 10https://gerrit.wikimedia.org/r/223444 [20:56:38] mutante: why not make a class that both tin and mira can subclass? I know puppet inheritance is weird but I would think that would be more reliable than a comment saying please keep these two nodes the same... [20:57:24] twentyafterfour: partly, yes, some things should just move into role::deployment::server, maybe some are unrelated [20:57:49] (03PS1) 10BBlack: sslcert: output both kinds of chain(ed) files [puppet] - 10https://gerrit.wikimedia.org/r/223445 [20:57:51] i dont want inheritance though [20:58:05] including less roles on the node , yes [21:03:20] (03PS12) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [21:03:45] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1435944 (10RobH) 5Open>3stalled Stalled awaiting mgmt approval for purchase on https://rt.wikimedia.org/Ticket/Display.html?id=9467 [21:04:13] (03PS1) 10Dzahn: move 'include mysql' into role deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/223447 [21:04:39] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1435951 (10RobH) 5Open>3Resolved I'm resolving the hardware request for order, as we are now working on implementation. [21:05:01] (03PS2) 10BBlack: sslcert: output both kinds of chain(ed) files [puppet] - 10https://gerrit.wikimedia.org/r/223445 [21:07:03] 6operations, 10Wikimedia-Mailing-lists: Ban *@utdliving.com from sending any email to the mailman server - https://phabricator.wikimedia.org/T68318#1435959 (10RobH) You cannot just set the spammer address to the auto-discard filters? My understanding is that would discard the messages, but require each partic... [21:07:24] (03PS1) 10Dzahn: deployment::server: move backup code into role [puppet] - 10https://gerrit.wikimedia.org/r/223448 [21:09:30] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1435961 (10Yurik) [21:10:34] 6operations, 10Wikimedia-Mailing-lists: Ban *@utdliving.com from sending any email to the mailman server - https://phabricator.wikimedia.org/T68318#1435963 (10JohnLewis) @robh since the emails go to -owner@lists.wikimedia.org, unfortunately no. These skip all mailman logic and go from exim straight to mailman... [21:10:46] robh: ^ nice try but no dice : [21:10:50] *:/ [21:11:00] JohnFLewis: daaaamnnn [21:12:44] (03PS1) 10Dzahn: releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 [21:14:02] PROBLEM - puppet last run on mw1014 is CRITICAL Puppet has 1 failures [21:14:39] (03PS2) 10Dzahn: releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 [21:14:45] (03PS13) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [21:16:10] !log krinkle Synchronized php-1.26wmf13/includes/resourceloader/ResourceLoader.php: T104769 (duration: 00m 13s) [21:16:14] Logged the message, Master [21:17:30] !log krinkle Synchronized php-1.26wmf12/includes/resourceloader/ResourceLoader.php: T104769 (duration: 00m 12s) [21:17:32] _joe_: ^ [21:17:57] <_joe_> Krinkle: great! [21:18:02] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [21:18:18] <_joe_> let's see what effect this has on memory usage on the appservers :) [21:18:32] _joe_: I guess it'll take 24h to make a difference? [21:18:40] Or do a rolling restart [21:18:47] not sure if we have something for that in place [21:18:57] I can do a rolling restart [21:19:08] is being a "labsdb::manager" part of being a deployment server in general? [21:19:18] I'm gonna be out in a minute though,. [21:19:27] ori: just fyi, revert if there's any issues [21:19:38] mutante, I doubt it, but ask the labs ops about that.. [21:19:40] nod [21:19:45] robh: though I did hear mailman3 has some nice improvements to the owner system so I investigating if that might include possible integration spam/filtering handling [21:19:46] <_joe_> I'd not do that ori [21:19:56] _joe_, why not? [21:19:58] <_joe_> the effect should be seeable in 24 hours anyway [21:20:04] JohnFLewis: oh, that would be awesome, let me know what you find out [21:20:16] _joe_: only in that in 24h we'll go back to the size we have now [21:20:20] the existing cache remains per the bug [21:20:21] <_joe_> nope [21:20:23] which is huge [21:20:25] right? [21:20:32] <_joe_> no you're right [21:20:42] <_joe_> yes, back to the current size or something more [21:20:48] <_joe_> but it should stabilize [21:20:54] the cognitive overhead of having to remember to scrutinize this in a few days, given the backlog of issues we have to deal with, makes me inclined to just do it now [21:20:56] robh: will do, we should probably eventually create/collate tasks that require mailman 3 like I know a few like search and so need it [21:20:58] (03CR) 10BBlack: [C: 032] "Checked script manually, checked puppet-level in compiler" [puppet] - 10https://gerrit.wikimedia.org/r/223445 (owner: 10BBlack) [21:21:10] o/ bb in an hour or two [21:21:11] quit [21:21:11] <_joe_> as for rolling restarts, I'd restart them in batches over a couple of days tbh [21:21:25] why? [21:21:46] <_joe_> so that if not all the memleaks are solved, we don't have them crashing at the same time when they OOM [21:21:47] i've done cluster-wide restarts before, it's fine [21:21:49] <_joe_> like last time [21:22:03] <_joe_> just for this reason, no other :) [21:22:23] i'll watch them [21:22:31] <_joe_> me too in fact [21:22:48] <_joe_> so it's irrelevant in fact. go on :) [21:23:27] Krenair: i'm asking the DBAs because it installs "mysql/skrillex" etc [21:23:39] for sanitizing DBs for labs afaik [21:23:47] !log Restarting HHVM across all appservers [21:24:09] we'll get a very brief 5xx spike [21:24:32] (03PS1) 10BBlack: switch ticket.wm.o to proper chain.crt file [puppet] - 10https://gerrit.wikimedia.org/r/223452 [21:25:16] (03CR) 10BBlack: [C: 032 V: 032] switch ticket.wm.o to proper chain.crt file [puppet] - 10https://gerrit.wikimedia.org/r/223452 (owner: 10BBlack) [21:30:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [21:31:46] (03CR) 10John F. Lewis: [C: 031] mira - deploy codfw - adjust roles to be like tin [puppet] - 10https://gerrit.wikimedia.org/r/223444 (owner: 10Dzahn) [21:32:13] (03CR) 10John F. Lewis: [C: 031] deployment::server: move backup code into role [puppet] - 10https://gerrit.wikimedia.org/r/223448 (owner: 10Dzahn) [21:32:37] (03CR) 10John F. Lewis: [C: 031] move 'include mysql' into role deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/223447 (owner: 10Dzahn) [21:33:23] (03CR) 10John F. Lewis: [C: 031] releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 (owner: 10Dzahn) [21:34:33] RECOVERY - puppet last run on mw1014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:35:44] robh: I think they're adding the 'abide delivery preferences' to -owner which is nice [21:35:45] (03PS1) 10Mattflaschen: Remove Flow_test and Flow_test_talk overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223453 (https://phabricator.wikimedia.org/T104279) [21:36:03] oh, so it would inherit the list prefrences? [21:36:11] though if all owners go 'I don't want this spam' then mailman-owner@lists.wikimedia.org will get it which is Mark :p [21:36:22] no - only the users delivery preferences [21:36:43] oh, hrmm, non ideal [21:36:46] the whole system is being ripped out form what I can tell like the user of an admin/site/master password is going [21:37:09] it's cool to hear that they finally switch to account-based permissions [21:37:14] ideally mailman would internally have a blacklist for non delivery [21:37:17] as opposed to "_the_ site password" :p [21:37:21] indeed user based permissions are awesome [21:37:30] users will be 'delegated' superuser access by account instead of a password [21:38:01] I imagine we'll end up with more folks with the right [21:38:15] now its super restrictive just due to accountability [21:39:13] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 2 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1436051 (10dduvall) Paired with @demon and @thcipriani in rewriting the patch as much of the Puppet c... [21:39:15] JohnFLewis, robh: so when are we getting the new version of mailman? :p [21:39:23] (03PS2) 10Dzahn: mira - deploy codfw - adjust roles to be like tin [puppet] - 10https://gerrit.wikimedia.org/r/223444 [21:39:44] well, the plan is this quarter to overhaul mail, mailman, otrs [21:39:50] Krenair: 3, when ever its stable and in jessie (or the next version :p) so more than 6 months for sure :P [21:40:06] and my understanding is our mailman upgrade path is something like new os, upgrade to newest versin of mailman2 [21:40:09] then migrate to 3 [21:40:25] robh: not this quarter though for 3 :( [21:40:29] but, we havent set anythign down yet really ;D [21:40:32] yes, that, first new 2.x on jessie [21:40:35] JohnFLewis: indeed [21:40:40] but i want it! [21:40:45] So I am being an optimist. [21:40:55] (03PS1) 10GWicke: Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 [21:40:56] we aren't against it, we just won't include it as a goal. [21:41:05] we need 3.1 at least anyway [21:41:28] Krenair: if you have more questions, you now have the three folks working on this in channel, ask now! ;D [21:41:28] otherwise it will be neigh impossible to upgrade from 2.x as only 3.1 will have a converter [21:41:40] JohnFLewis: makes sense [21:41:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:41:46] (03PS1) 10BBlack: remove GeoTrust_Global_CA => absent [puppet] - 10https://gerrit.wikimedia.org/r/223455 [21:41:48] (03PS1) 10BBlack: chain fixes: ganglia, icinga, lists, rt, tendril, wikitech [puppet] - 10https://gerrit.wikimedia.org/r/223456 [21:42:16] (03PS2) 10GWicke: Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 [21:42:18] robh, overhauling otrs? [21:42:24] how the mail gets routed there? [21:42:24] (03CR) 10Dzahn: [C: 032] "note how i'm not touching tin in any way - and mira is still being setup - imho both additional roles (releasers can upload and labsdb man" [puppet] - 10https://gerrit.wikimedia.org/r/223444 (owner: 10Dzahn) [21:42:39] cleaning up the addresses pointing to otrs and making sure otrs knows which addresses those actually are? [21:42:54] Krenair: https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q1_Goals#Technical_Operations [21:43:24] upgrade version is whats listed [21:43:35] ah [21:43:37] though i imagine whoever handles point on that will then be comfortable enough to tackle other items [21:43:48] how does fr-tech depend on otrs? [21:44:01] jeff green was the last opsen to touch otrs. [21:44:04] (03PS2) 10Dzahn: move 'include mysql' into role deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/223447 [21:44:05] and he is in fr ;D [21:44:14] * Krenair facepalm [21:44:16] so we are depending on his support during the process [21:44:23] he isnt heading it though [21:44:33] so its a legit cross-departmental dependency [21:44:48] odd one, but legit. [21:44:49] so you couldn't just have put that you rely on a specific person in fr-tech rather than the whole fr-tech team? :p [21:44:52] PROBLEM - puppet last run on sodium is CRITICAL Puppet has 2 failures [21:45:04] nope, thats not how the goals listing works [21:45:23] I think that's a bit silly [21:45:27] i see most folks listing departments, not individuals. [21:45:38] you'll have to take that up with whoever designs the goals process [21:45:48] I don't feel strongly enough about the process to defend it ;D [21:45:52] (03CR) 10Dzahn: [C: 032] move 'include mysql' into role deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/223447 (owner: 10Dzahn) [21:46:00] :) [21:46:36] that example is indeed funny though [21:46:50] robh: I've found heaven [21:46:52] (not as funny for jeff!) [21:47:06] mailman3 allows global banning of emails from subscribing and emailing! [21:47:18] including the owners emails? [21:47:30] sounds like exactly what we want if so. [21:47:32] not owners, though I'm still searching [21:47:37] it may be owners [21:47:45] I just stopped after the line 'global bans' :p [21:48:01] had to stop reading and fist pump in the air and cheer, i get it =] [21:48:10] * mutante secretly adds exim alias, mailman3-owner: johnflewis [21:48:21] at least you are at home right? when i am in the office and do that its odd. [21:48:55] I am at home, mutante secretly? I'll see the commit or it won't last long :p [21:49:16] JohnFLewis: :) you see everything [21:49:36] of course I do. I signed the paperwork did I not? [21:49:40] aliases are in the private repo [21:49:43] you wont see it ;D [21:49:48] he still gets the mail :) [21:49:49] (03PS1) 10GWicke: Increase compaction parallelism to 15 [puppet] - 10https://gerrit.wikimedia.org/r/223457 [21:49:50] just magically start getting emails [21:50:03] on the list [21:50:18] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1436081 (10bd808) [21:50:30] robh: all the commits go to ops@lists ;) [21:51:10] oh, i suppose it does [21:51:17] i thought it was more restrictive than that on those emails [21:51:31] it won't show the diff [21:51:50] inserts innocent sounding commit message [21:52:15] "Adding myself to random-ops-alias@" [21:52:23] diff: adding JohnFLewis to mailman3-owner [21:52:52] Code-Review: +1 [21:53:01] Code review? [21:53:07] In puppet-private? [21:53:08] meh :p [21:53:12] yeah [21:53:14] I'm not in the ops team but somehow I doubt it. [21:53:25] Put the repo in gerrit! [21:53:30] lol [21:54:04] if you dont mind the spam [21:54:21] the main reason to not have the exim aliases in public [21:56:03] !log Restarted hhvm on mw1003 "Fatal error: Function already defined: wmfLoadInitialiseSettings in /srv/mediawiki/wmf-config/CommonSettings.php on line 187" [21:56:08] Logged the message, Master [21:57:41] robh: looking at the model docs, looks like it genuinely will be a auth system backed up by a database. looks cool [21:57:58] it sounds sane [21:58:02] find the flaw! [21:58:05] mutante: Why does mira have base::firewall, but not tin? [21:58:42] flaw: -owners still get spam [21:58:46] hoo: want to see succesful deployment on mira while having it, then enable it on tin too [21:58:54] Oh, I see [21:59:02] Meh [21:59:03] or flaw: we'll end up annoying the DBAs and they'll quit ( ;) ) [21:59:56] (03PS2) 10Dzahn: deployment::server: move backup code into role [puppet] - 10https://gerrit.wikimedia.org/r/223448 [22:00:51] (03PS2) 10BBlack: remove GeoTrust_Global_CA => absent [puppet] - 10https://gerrit.wikimedia.org/r/223455 [22:00:57] (03CR) 10BBlack: [C: 032 V: 032] remove GeoTrust_Global_CA => absent [puppet] - 10https://gerrit.wikimedia.org/r/223455 (owner: 10BBlack) [22:01:07] (03PS2) 10BBlack: chain fixes: ganglia, icinga, lists, rt, tendril, wikitech [puppet] - 10https://gerrit.wikimedia.org/r/223456 [22:02:09] (03PS1) 10Hoo man: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 [22:02:23] mutante: ^ [22:02:53] (03CR) 10jenkins-bot: [V: 04-1] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (owner: 10Hoo man) [22:02:55] meh [22:02:57] robh: in the final comment: there is a sytle system and a restful api [22:03:01] got a tab in there [22:03:16] all of that awesome stuff but still no moderation for -owner emails (/me is done with the docs now) [22:03:28] meh, lameeeee [22:03:32] well, lots of cool [22:03:35] but one final lame [22:03:44] mutante, hoo, bd808: To be honest, couldn't we rsync the entire mediawiki-staging over from tin to mira to prove it works? [22:03:52] we have a debian team, hack mailman to allow it and package it for apt.wm.o ;) [22:04:03] Krenair: yup [22:04:04] (03PS2) 10Hoo man: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 [22:04:22] and adding the magic to scap will just automate keeping it in sync (from either side) [22:04:22] and then only start actually using mira when scap is set up to do that properly? [22:04:45] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1436123 (10Eevans) One reoccurring issue seems to manifest as sharply increasing pending tasks for memtable flush and mutation stages:... [22:05:04] honestly, i don't like these constructs where first nodes are combined in a regex, but then followed by "if $::hostname" [22:05:17] How do you feel about us doing swat today from mira, bd808? [22:05:19] It's only temp. right? [22:05:41] it would be interesting to know the errors at least [22:05:49] or at least attempting [22:06:01] Krenair: it doesn't bother me, but yeah no idea in how many spectacular ways it will fail [22:06:14] we have no way to sync the staging stuff yet [22:06:27] No [22:06:37] not automatically, but we can do it manually to test mira's ability to deploy with base::firewall [22:06:43] i mean, we can run rsync [22:06:44] mutante: fetch the whole common export from tin [22:06:49] via rsync [22:06:58] mutante: Not really... w/o ssh agent forwarding [22:07:13] Yes [22:07:14] arr, that's right [22:07:15] mwdeploy user [22:07:35] SSH_AUTH_SOCK=/run/keyholder/proxy.sock [22:07:36] etc. [22:07:52] robh: I'm tempted to clean up the Wikimedia-mailing-lists work board by categorising issues into mailman v2 and mailmanv3 (as in things we can solve in v2 now or after the upgrade and things v3 is needed for) thoughs? [22:07:54] mutante: ^ [22:08:11] Krenair: is the service not on mira? or is the agent just not armed? [22:08:21] the agent should be armed [22:08:25] i did that once on mira [22:08:31] and saw the monitoring for that recover [22:08:58] (03CR) 10Eevans: [C: 031] Increase compaction parallelism to 15 [puppet] - 10https://gerrit.wikimedia.org/r/223457 (owner: 10GWicke) [22:09:10] Permission denied (publickey). [22:09:23] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=mira [22:09:33] "OK: Keyholder is armed with all configured keys. " [22:09:38] oh, wait [22:09:58] it does work [22:10:01] :) [22:10:11] JohnFLewis: wfm [22:10:15] JohnFLewis: do it [22:10:20] we'll need to know when we upgrade anyhow [22:10:55] (03PS3) 10Dzahn: deployment::server: move backup code into role [puppet] - 10https://gerrit.wikimedia.org/r/223448 [22:12:26] (03CR) 10Dzahn: [C: 032] "we do the same in a bunch of other roles already" [puppet] - 10https://gerrit.wikimedia.org/r/223448 (owner: 10Dzahn) [22:14:22] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:16:42] robh: shall I create a task for managing a global ban list for troublesome emails and mark the 2/3 tickets about global bans into that one? [22:17:11] that task can then fall under v3 and we can deal with the -owner issue at a later date in that process [22:21:06] (03CR) 10Thcipriani: [C: 031] "Definitely should be a scap target if we're going to use it as a backend." [puppet] - 10https://gerrit.wikimedia.org/r/223391 (https://phabricator.wikimedia.org/T72181) (owner: 10BryanDavis) [22:23:20] 6operations, 10Wikimedia-Mailing-lists: Ban *@utdliving.com from sending any email to the mailman server - https://phabricator.wikimedia.org/T68318#711042 (10JohnLewis) [22:23:22] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Blacklist badoo.com globally (★ fake emails and other spam) - https://phabricator.wikimedia.org/T48021#529316 (10JohnLewis) [22:24:27] (03CR) 10Eevans: [C: 031] Increase concurrent_writes to 128 [puppet] - 10https://gerrit.wikimedia.org/r/223454 (owner: 10GWicke) [22:26:42] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1436193 (10JohnLewis) [22:27:54] PROBLEM - puppet last run on mw2169 is CRITICAL Puppet has 1 failures [22:29:40] (03PS3) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (owner: 10Hoo man) [22:31:04] JohnFLewis: soudns better to file it that way yes [22:31:08] sorry, i was on phone [22:32:37] (03PS3) 10BBlack: chain fixes: ganglia, icinga, lists, rt, tendril, wikitech, gerrit [puppet] - 10https://gerrit.wikimedia.org/r/223456 [22:33:00] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1436209 (10JohnLewis) https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Fighting_spam_in_mailman is the documentation that is on wikitech. Better spam management will c... [22:33:17] (03PS4) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (owner: 10Hoo man) [22:33:44] (03CR) 10BBlack: [C: 032] chain fixes: ganglia, icinga, lists, rt, tendril, wikitech, gerrit [puppet] - 10https://gerrit.wikimedia.org/r/223456 (owner: 10BBlack) [22:34:31] (03PS5) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [22:35:43] (03PS2) 10BBlack: star.planet.wikimedia.org sha256 ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/220176 (owner: 10RobH) [22:36:51] robh: can you do me a favour? (doesn't need shell but whatever makes it easier for you) [22:37:08] (03CR) 10BBlack: [C: 032] star.planet.wikimedia.org sha256 ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/220176 (owner: 10RobH) [22:37:40] JohnFLewis: depends on the favor =] [22:37:48] always does :P [22:38:38] robh: https://meta.wikimedia.org/wiki/Mailing_lists/List_info -- can you check which languages list there don't exist in the mailman templates directory? (messages, in puppet modules/mailman/files/templates/ or sodium:/etc/mailman) [22:39:18] it'll be matching language codes as per default, nice way to kill a few minutes/seconds ;) [22:39:32] (03PS1) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [22:39:40] JohnFLewis: im not sure what you are asking me to do? [22:39:55] just cehck two public lists? [22:40:00] for a diff? [22:40:42] robh: the language boxes on that page has a list of all languages. can you check if their language code (en for English, zh for Chinese etc.) exist in the mailman templates list [22:41:51] ok so this isnt not jsut hsell but just comparing two public lists of things? im not sure thats a great use of my time but you help me enough that i'll help you ;] [22:42:22] robh: well it'll reduce one more mailman bug :) [22:42:39] heh, is there someplace i can paste the output so you guys arent taking my word for it? [22:42:51] cuz there is already one on the page that isnt in the tempaltes directory [22:43:03] RECOVERY - puppet last run on mw2169 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:43:37] paste it anywhere, https://phabricator.wikimedia.org/T71858 may be useful to paste. just the language code, I'll do the rest after [22:43:39] robh: https://phabricator.wikimedia.org/paste/create/ [22:43:51] mutante: ... yes, i got that [22:43:59] i know that feature exists ;D [22:44:03] i meant the deep link works [22:44:05] i meant where to link said paste, heh [22:44:29] :p [22:44:53] (03PS2) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [22:46:24] i dont think i get what you want cuz it seems lik eyou are just asking me to list off whats in a public puppet repo versus whats on a public wiki page? [22:46:35] nothing seems like something you need me to get to? [22:47:49] and on the https://meta.wikimedia.org/wiki/Mailing_lists/List_info languages list there [22:47:54] do you mean the language links at the top? [22:47:59] sorry man, im just confused =P [22:48:00] yeah [22:48:35] so yea to the link question or yea to: [22:48:36] and I thought this would streamline things while I work on trying to evaluate which mailman issues need to be open, when they can be done, who by, why and where :) [22:48:45] the link [22:48:59] (03CR) 10Dzahn: "anyone ever looked at mw logs older than 90 days??" [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:49:14] ok, so im comparing the links languages on the otp of the page to a public direcotry in puppet... ok [22:49:23] (03CR) 10Alex Monk: "Ori: What about the inline comments?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [22:50:02] robh: yeah. bet that wasn't in your 'job description' when you joined the WMF like 10 years ago ;) [22:50:37] (03PS2) 10Dzahn: annualreport: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223221 (https://phabricator.wikimedia.org/T104936) [22:50:46] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 4 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1436247 (10Legoktm) [22:52:21] man the page lists a shit ton more langcodes [22:53:09] it wasn't restricted to mailman-only languages either but I'll deal with that categorisation after [22:53:58] robh: awesome thanks [22:54:04] so we ripped out language support during an upgrade for all those? [22:54:09] that seems super crappy =[ [22:54:17] expect a patch in I'm going to guess, 10 minutes? [22:54:25] to add them all back in? =] [22:54:32] anyway - look at the workboard for mailing lists - it doesn't scroll down :D [22:54:47] it all fits on screen between folds [22:54:54] my ocd approves. [22:55:06] robh: not all are mailman-translated so it seems counterproductive to do at times but hey - maybe mailman3 ;) [22:55:10] I'll do swat today [22:55:23] cool, i'll merge in said changes when you push them [22:55:26] backlog is just full of 'list requests waiting for responses' or 'is this a bug still' [22:56:09] oh, other thing, bd808: [22:56:28] sync-common will need to be able to choose which host to pull from, user-defined or at random? [22:56:29] or odd mbox permissions =P [22:56:40] Krenair: it already does that [22:56:46] oh. neat. okay. [22:58:02] Krenair: although it will need some updates when mira is fully online [22:58:11] but it's just config changes [22:58:31] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests, 7RESTBase-architecture: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1436286 (10GWicke) [22:58:33] bah [22:58:50] with no list of servers everything fetches from tin [22:59:01] but we can change that in -- https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap.cfg [22:59:07] bd808: [22:59:15] actually, I'll paste this [22:59:17] master_rsync is the default host [22:59:36] https://phabricator.wikimedia.org/P919 [23:00:02] logger timeout? [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150707T2300). Please do the needful. [23:00:05] csteipp James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] (03CR) 10CSteipp: [C: 04-1] "I'm planning to implement bryan's suggestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [23:00:15] Ooh SWAT time [23:00:16] I'll take it [23:00:28] and it didn't actually sync the file [23:00:42] nor did it notify us here [23:00:51] Wait are you guys syncing stuff? [23:00:54] Well, [23:01:00] I was trying to sync a test file from mira [23:01:03] But it failed [23:01:03] so [23:01:06] back to tin [23:01:18] Pff [23:01:27] Also, csteipp's patch relies on something that's -1'd [23:01:47] OK so csteipp 1) isn't here for his SWAT, 2) listed a patch without listing its dependency, and 3) -1ed that dependency at exactly 4pm [23:01:52] So pretty much we just have James_F's patch to make VE on-by-default on wikitech [23:02:29] Yeah I'll do that one [23:02:43] And then I'll go downstairs and remind Chris that this isn't very nice [23:02:46] It's not the first time [23:02:57] (03CR) 10Catrope: [C: 032] Enable VisualEditor by default on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223371 (https://phabricator.wikimedia.org/T104961) (owner: 10Jforrester) [23:03:11] Well, I don't know if I'd say it isn't very nice [23:03:20] It's silly because we won't actually deploy his patch [23:03:23] (03Merged) 10jenkins-bot: Enable VisualEditor by default on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223371 (https://phabricator.wikimedia.org/T104961) (owner: 10Jforrester) [23:03:30] I meant more his not being on IRC [23:04:00] yeah, so we won't deploy it. meh [23:04:16] Although... who put it there [23:04:18] It's on his calendar [23:04:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [23:04:30] Oh yes he did [23:05:21] !log catrope Synchronized visualeditor-default.dblist: Enable VE by default on labswiki (duration: 00m 12s) [23:05:26] Logged the message, Master [23:06:19] WTF, I can't log into wikitech [23:06:43] ugh [23:06:47] try clearing cookies? [23:06:50] for wikitech only at least [23:06:59] (03PS1) 10John F. Lewis: mailman: add missing language templates [puppet] - 10https://gerrit.wikimedia.org/r/223466 (https://phabricator.wikimedia.org/T71858) [23:07:25] * JohnFLewis chants demonically "+2 +2" to robh [23:07:37] hm, nope [23:07:40] broken for me too [23:07:48] YuviPanda, wikitech login is broken. "Wikitech uses cookies to log in users. You have cookies disabled. Please enable them and try again." [23:07:55] was this the error given when nutcracker is broken? [23:08:19] JohnFLewis: ahh, so many trailing spaces! [23:08:22] ;D [23:08:40] really? [23:08:48] gerrit shows them [23:08:49] * JohnFLewis hates mediawiki translation extension [23:09:05] https://gerrit.wikimedia.org/r/#/c/223466/1/modules/mailman/files/templates/he/listinfo.html [23:09:37] i mean, it wont break anything [23:09:42] they are just ugly in gerrit [23:10:20] RoanKattouw, any luck? [23:10:25] JohnFLewis: im not going to be pedantic, i plan to +2 in merge unless you have ocd forcing you to fix now ;D [23:10:47] +2 then :p [23:10:57] (03CR) 10RobH: [C: 032] mailman: add missing language templates [puppet] - 10https://gerrit.wikimedia.org/r/223466 (https://phabricator.wikimedia.org/T71858) (owner: 10John F. Lewis) [23:11:02] I also have some puppet repo changes I want merged [23:11:25] https://gerrit.wikimedia.org/r/#/c/214037/ has been sitting around for a while [23:12:01] Krenair: Nope, still can't log in [23:12:03] robh: its taken a year to solve that bug, quite quickly compared to most other mailman bugs but still, a year - wow [23:12:09] I guess we'll never know if VE really is default on wikitechwiki [23:12:29] well, it looked default to me before I logged out to test this issue [23:12:48] There was a preference about disabling VE instead of enabling it [23:12:50] RoanKattouw: default for me now [23:13:08] Cool tahnks [23:14:03] with that, I shall bid you all a farewell and cya tomorrow [23:14:50] JohnFLewis: so puppet has cert failures (due to unrelated issue fixed earlier today) but since its failing that part, it simply doesnt reload or restart ssl services [23:14:54] or replace content [23:15:00] Confirmed it's working for me too. [23:15:02] but the other changes appear live from puppet run [23:15:09] RoanKattouw: editing https://wikitech.wikimedia.org/wiki/Deployments?veaction=edit is gross with VE :( [23:15:20] cool, i'm working on the puppet error for the cert stuff, cuz eww. [23:15:28] okay [23:15:45] bd808: Yes :( I still use wikitext for that one too [23:15:47] Maybe it needs some metadata added for those templates? [23:15:52] Partly [23:16:28] But most of it is that the structure of it is bascially {{echo|}} {{echo|One row of data}} {{echo|Another row of data}} ...... {{echo|
}} [23:16:40] 6operations, 10Traffic, 7HTTPS, 7Mobile: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1436332 (10BBlack) We're already sending rel=canonical to the desktop sites from both of them as well, like we do for `.m.`. The traffic logs look similar to... [23:16:44] Do we really need all those templates for anything but timing? [23:16:45] bd808: Lua is gross. [23:17:00] James_F: wikitext is gross [23:17:05] Those templates all get grouped together because you need all of them before you get balanced HTML output [23:17:06] bd808: +2. [23:17:14] (or at least sensible HTML output) [23:17:38] Right. I've heard subbu rant about that and possible solutions [23:17:43] Or.... wait no [23:17:50] (03CR) 10CSteipp: [C: 04-2] "Need to wait on this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [23:17:50] Actually in this case it surrounds the whole table with a transclusion [23:17:51] That is a way it COULD be structured, and that would be a problem [23:18:05] The way it's ACTUALLY structured, is {{schedule|item1|item2|item3|...|item18}} [23:18:37] surrounds was the wrong word [23:18:50] csteipp: Yeah we figured that change probably wasn't going out because its dependency was -1ed [23:19:11] but it's balanced, single template with other templates nested inside [23:19:20] csteipp: Also, please be on IRC at 4pm when you have a SWAT scheduled; otherwise the deployer will be sad and then move on without you [23:19:23] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [23:19:51] RoanKattouw: Sorry about that [23:21:19] the templates in templates thing is mostly to add metadata for jouncebot [23:21:20] (03CR) 10Dzahn: [C: 032] annualreport: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223221 (https://phabricator.wikimedia.org/T104936) (owner: 10Dzahn) [23:21:38] (03PS3) 10Dzahn: annualreport: add role on bromine [puppet] - 10https://gerrit.wikimedia.org/r/223221 (https://phabricator.wikimedia.org/T104936) [23:21:48] maybe I'll get bored enough to try and come up with something that VE likes [23:26:42] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 6 failures [23:31:42] I'm not sure a wiki is really the best tool for scheduling things ... we really should move that to phabricator [23:31:45] tgr: alas, the central wiki autocreation didn't quite get us to a new user being able to follow an OAuth link all the way through account creation and back to the OAuth app. [23:32:00] (the https://wikitech.wikimedia.org/wiki/Deployments page) [23:32:59] ragesoss: can you elaborate? [23:33:14] it's not the wiki, it's the templates [23:33:18] twentyafterfour: https://www.mediawiki.org/wiki/Extension:Calendar ? [23:33:24] twentyafterfour: Do you think that will really be better (phab)? [23:34:13] bd808: yes, I do think so [23:34:16] not until we have due dates on tickets [23:34:24] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [23:34:37] mutante: phab has events, which are like tickets with due dates [23:35:18] they support projects, subscribers, attendees, policies, comment threads, and can be set as recurring [23:35:32] PROBLEM - RAID on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:35:37] twentyafterfour: that sounds good [23:35:46] ah part of the calendar stuff I haven't played with [23:35:52] https://phabricator.wikimedia.org/E18/4 [23:36:06] there were (and still are) a few bugs but it's getting very close to stable [23:36:07] that could work for SSL cert expiries [23:37:03] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:37:12] RECOVERY - RAID on eventlog1001 is OK no disks configured for RAID [23:38:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:39:42] twentyafterfour: good point, that is nice indeed, yes [23:42:13] mutante: I like it, besides the bugs. (the worst one right now is, when creating a new event, it 503's and loses your form input if you provide even the slightest invalid input) [23:43:17] twentyafterfour: like leap seconds days :) [23:43:33] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:30] when you type 7:00 into the start time it auto-fills 8:0 in the end-time, but 8:0 isn't a valid time and bombs with an exception when you submit teh form :( [23:45:23] hrmm. yea, combined with auto-fill , hah :) [23:46:29] kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [23:46:37] kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.0199990467e-20 [23:47:19] so it's either 0 or it's not , but in both cases it's critical?? [23:47:57] cp3041 - HTTPS - Return code of 113 is out of bounds [23:49:23] (03CR) 10BryanDavis: [C: 032] Increment deployment stats after sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/223236 (https://phabricator.wikimedia.org/T104635) (owner: 1020after4) [23:49:44] (03Merged) 10jenkins-bot: Increment deployment stats after sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/223236 (https://phabricator.wikimedia.org/T104635) (owner: 1020after4) [23:54:05] Krenair: the failure on your scap attempt from mira was connecting to neon to announce the sync [23:54:12] !log kafka brokers 1018 & 1021 were demoted; i have triggered a leader election and they are leaders again [23:54:18] ah [23:54:19] Logged the message, Master [23:54:31] so we either need some firewalls opened or a tcpircbot in codfw [23:54:33] RECOVERY - Kafka Broker Messages In on analytics1018 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1794.89672896 [23:54:56] thanks jgage [23:55:04] something in iptables on neon? [23:55:06] (03PS4) 10Elee: added year into logging [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [23:55:14] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2385.38764012 [23:55:17] Krenair: quite possibly yes [23:55:31] can ops check that? [23:55:33] what does it try to run? [23:55:44] (03CR) 10Elee: "Okay emacs is doing something really weird for me, hold on." [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) (owner: 10Elee) [23:55:51] it tries to send a message to a service on neon that announces things here [23:56:00] mutante: it tries to connect to port 9200 to send a tcpircbot message [23:56:02] I hate emacs [23:56:04] via logmsgbot [23:56:18] bd808: confirmed, i see the iptables rule for that port, can fix [23:57:59] (03PS5) 10Elee: added year into logging [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223046 (https://phabricator.wikimedia.org/T85803) [23:59:25] mutante, are these not puppetised?