[01:30:46] ori: about? [01:32:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [01:36:24] Looking for Failed to mmap persistent RDS region [01:36:27] http://irc.cakephp.nu/hhvm/2014-05-22 [01:36:30] Find IRC logs with tim in it [01:46:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [01:57:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [02:00:11] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:02:07] grrrrr [02:05:49] ostriches: I've got oss-performance running. finally [02:09:33] https://github.com/hhvm/oss-performance/pull/57#issuecomment-160360650 [02:26:02] error: Cannot update the ref 'refs/remotes/origin/master': unable to append to .git/logs/refs/remotes/origin/master: Permission denied [02:26:02] From https://gerrit.wikimedia.org/r/p/mediawiki/core [02:26:02] ! 2bfde35..870ccaf master -> origin/master (unable to update local ref) [02:26:02] Updating core FAILED. [02:27:14] YuviPanda: Still about? [02:27:23] Reedy: yes but about to get in a car [02:27:29] Reedy: I'll be back in like 10mins [02:27:37] heh, ok, not urgent :) [02:28:00] !log l10nupdate failed because some git objects owned by 997:l10nupdate [02:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:34] chown -R l10nupdate: /var/lib/l10nupdate/mediawiki [02:28:39] ^ I need that running as root on tin please [02:58:50] Reedy: ok doing [02:58:55] that was more than 10min heh [03:01:13] !log run chown -R l10nupdate: /var/lib/l10nupdate/mediawiki for Reedy on tin [03:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:43] thanks [03:14:09] dear MW 1.26 on hhvm, why do you seem to be giving error 500? :( [03:24:12] Reedy: what's up? [03:24:26] delayed much? ;) [03:24:35] I'm quite a few steps further on than I was [03:24:51] But I seem to be getting an error 500 from hhvm on mw 1.26 [03:24:58] 1.24 is ok [03:25:03] what are you trying to do? [03:25:13] https://github.com/hhvm/oss-performance/pull/57 [03:25:37] Probably only read/skim the last couple of comments [03:27:15] I'm currently trying to work out how to reuse the hhvm bytecode cache or similar and work out why it seems to be giving me error 500 when trying to poke MW [03:28:26] wow, thanks for doing this [03:29:55] i'll clone it and give it a shot and see if i run into the same issue [03:30:14] ori: have you used it before? [03:30:23] The setup overheard is a PITA [03:30:34] mostly, to get a version 2 of siege [03:30:43] hence me being on 12.04 [03:31:47] I guess it's not impossible I'm hitting some old hhvm bug [03:34:00] woo bd808 https://phabricator.wikimedia.org/D65 :D [03:34:25] I'm working on the puppet changes to go with it now [03:34:25] * YuviPanda should use diffusion for something too [03:34:46] bd808: <3 that you've fixed some of the fixmes I left in the shell script [03:35:04] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:05] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 70442 MB (3% inode=99%) [03:35:06] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:22] ori: ?? I can give you access to the vm I'm using if it's easier [03:35:49] gwicke: ^^ restbase alert - is that compaction running slowly again? [03:35:54] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:43] ok so it's got 80G or so left [03:37:13] if it goes below 60 I'll page gwicke [03:37:44] PROBLEM - puppet last run on mw2051 is CRITICAL: CRITICAL: Puppet has 1 failures [03:38:11] * Reedy hacks more debugging into the thing [03:38:56] /usr/bin/hhvm '-m' 'server' '-p' '8092' '-v' 'AdminServer.Port=8093' '-v' 'Server.Type=fastcgi' '-v' 'Server.DefaultDocument=index.php' '-v' 'Server.ErrorDocument404=index.php' '-v' 'Server.SourceRoot=/tmp/hhvm-nginxQglP5I/mediawiki-1.26.0' '-v' 'Eval.Jit=1' '-d' 'pid='\''/tmp/hhvm-nginxQglP5I/hhvm.pid'\''' '-c' '/home/reedy/oss-performance/base/../conf/php.ini' '-v' 'Repo.Authoritative=true' '-v' 'Repo.Central.Path=/tmp/hhvm- [03:38:57] nginxQglP5I/hhvm.hhbc' '-v' 'Server.FileCache=/tmp/hhvm-nginxQglP5I/static.content' '-v' 'Server.SourceRoot=/tmp/hhvm-nginxQglP5I/mediawiki-1.26.0' [03:38:58] There we go [03:42:13] hmm it's at 78 [03:42:20] imma page gwicke and let him know [03:43:39] done [03:48:50] (03PS1) 10BryanDavis: l10nupdate: replace ssh key with new scap script [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) [03:50:45] (03CR) 10BryanDavis: [C: 04-1] "Blocked on https://phabricator.wikimedia.org/D65" [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [03:52:59] ori: http://hhvm-oss-performance.default.reedy.uk0.bigv.io:8090/ [04:00:06] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:02:05] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:54] RECOVERY - puppet last run on mw2051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:54] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:37] damn it hhvm [04:03:40] why u no log errors [04:08:45] ori: Found it [04:08:46] MediaWiki requires the PSR-3 logging library to be present. This library is not embedded directly in MediaWiki's git repository and must be installed separately by the end user. Please see mediawiki.org for help on installing the required components. [04:08:49] FUCK YOU HHVM [04:09:26] bd808: http://hhvm-oss-performance.default.reedy.uk0.bigv.io:8090/ [04:09:29] I blame you :P [04:11:36] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 70358 MB (3% inode=99%) [04:11:55] (03CR) 10BryanDavis: "I think that T119165 explains (b). The TL;DR is that the root level sync of /srv/mediawiki-staging between tin and mira uses uids rather t" [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [04:14:55] Reedy: did you install from a tarball? [04:15:01] Yup [04:15:02] it should have psr-3 [04:15:09] Yeah, I think hhvm is doing something cooky [04:15:11] *kooky [04:15:26] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 70378 MB (3% inode=99%) [04:15:49] Where's that explicit check... [04:16:09] I think it is in mwexceptionhandler now [04:16:24] it just sniffs for thing blowing up [04:16:26] if ( !interface_exists( '\Psr\Log\LoggerInterface' ) ) { [04:16:38] So it thinks that interface doesn't exit [04:18:21] Reedy: it should be in $IP/vendor/psr/log/Psr/Log/LoggerInterface.php [04:18:28] Yeah, it is [04:18:36] It's hhvm in repo authorative mode IIRC [04:18:50] Hang on [04:18:50] https://github.com/facebook/hhvm/issues/5834 [04:19:01] oh, maybe its not building the repo correctly [04:20:07] "Changing the line interface_exists( '\Psr\Log\LoggerInterface' ) to interface_exists( 'Psr\Log\LoggerInterface' ) in the file includes/debug/logger/LoggerFactory.php and running the hhvm-repo-mode command again solves the problem." [04:20:19] hmm.. ok [04:20:30] I see that in the report [04:21:06] hhvm's parser being picky I guess [04:21:55] I'd be prone to call that an HHVM bug [04:22:26] Looks very much to be [04:22:33] Let me try making that "fix" [04:28:12] bd808: yup, removing the leading \ fixes it [04:28:59] Now I get ERR_TOO_MANY_REDIRECTS [04:29:00] lol [04:29:58] reedy@hhvm-oss-performance:/tmp/hhvm-nginx7QltZa/mediawiki-1.26.0$ curl -L http://hhvm-oss-performance.default.reedy.uk0.bigv.io:8090/ [04:29:58] curl: (47) Maximum (50) redirects followed [04:30:12] the leading \ is canonical naming. Bad hhvm, no cookie [04:31:00] Reedy: crappy rewrite rule? [04:31:45] No rewrites [04:31:57] I think I just need to update the LocalSettings.php to match MW updates [04:32:29] > echo $wgArticlePath; [04:32:29] index.php?title=$1 [04:32:29] > echo $wgScriptPath; [04:32:34] /mediawiki [04:32:59] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836473 (10bd808) The combination of D65 and https://gerrit.wikimedia.org/r/255916 should make `l10nupdate-1` less unique when it comes to syncing things. As a bo... [04:36:45] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 70584 MB (3% inode=99%) [04:48:11] bd808: ori: Woo. All fixed up, bar the stupid hhvm bug :( [04:48:54] Reedy: we can patch mw-core, or you can file a bug against hhvm, or I guess both [04:49:02] Well, the bug is there against hhvm [04:49:18] it wouldn't be the first tweak we made to core for hhvm [04:49:28] Is it worth making the patch more widely? [04:49:46] for some strange reason, they include the full tarball in their repo [04:50:05] RECOVERY - Disk space on restbase1009 is OK: DISK OK [04:50:11] So I can just patch that one line (noting in the commit summary) until we get an upstream fix, and we fix MW core too [04:50:17] !log restarted cassandra on restbase1009 to avoid it running out of disk space; had large compaction (~2TB) at 80% and only 64G disk space left [04:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:50:57] Reedy: oh, if you can just "fix"it in their little test toy that sounds good to me [04:51:04] thanks gwicke [04:51:39] YuviPanda: yw; we really need to expand disk space in those nodes [04:51:53] see https://phabricator.wikimedia.org/T119659 [04:52:37] YuviPanda: thanks for letting me know! [04:53:18] gwicke: yw! [04:54:16] (03CR) 10Dereckson: "If you find a consensus, and need these groups, we can deploy it the December 7 week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [04:55:08] (03CR) 10Ori.livneh: "Paladox, you're the only one who seems to care about this. We're not going to stick with Gitblit anyhow, so just drop it already." [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [04:55:14] (03Abandoned) 10Ori.livneh: Re enable tags [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [04:56:07] (03CR) 10Ori.livneh: "No, stop messing around with these." [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [04:56:13] (03Abandoned) 10Ori.livneh: Show more then 5 commits per repo page [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [04:57:50] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1836499 (10bd808) [04:58:12] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1834606 (10bd808) p:5Triage>3High [05:03:12] (03CR) 10Ori.livneh: [C: 04-2] "This strikes me as fundamentally the wrong approach to this problem. Don't make it easier to intervene manually; make manual intervention " [puppet] - 10https://gerrit.wikimedia.org/r/255695 (https://phabricator.wikimedia.org/T119718) (owner: 10Filippo Giunchedi) [05:06:51] bd808: I landed your scap change; want me to merge the Puppet patch and run puppet on tin? [05:07:36] hmmm... maybe we should test it manually first? [05:08:16] did you actually land it or just approve it? [05:08:25] bd808: What would you consider better? Modifying the tarball, and then tar/gz it up again, or just str_replace on the file we care about after untarring? [05:08:53] I'm leaning towards the latter. Shipping a pristine tarball to their repo seems a lot lot nicer [05:09:22] Hmm. It's now 5am [05:09:25] Reedy: yeah, either str_replace or a proper patch file seems nicer [05:09:37] lol [05:09:54] bd808: I just approved it [05:10:03] it's only 10pm here but I got sidetracked from whatever I was doing 5 hours ago [05:10:04] *accepted [05:10:19] ori: cool. I'll merge it and we can get the code deployed. [05:11:07] how has using phab / arc for code review been working out? [05:11:17] Their current way of applying patches is optional if you pass a flag [05:11:19] * Reedy grimaces [05:11:56] Reedy: it might not be worth it if it is complicated to do [05:12:07] (03PS5) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [05:12:08] MediaWiki's workload doesn't fundamentally change from minor version to minor version [05:12:21] we're still loading wikitext from the database and parsing it [05:12:33] It's what, 3 lines of php to read a file, modify it, and shove it back out again? [05:12:34] if ($options->applyPatches) { [05:12:35] self::PrintProgress('Applying patches'); [05:12:35] $target->applyPatches(); [05:12:35] } [05:12:46] ori: arc is growing on me, but full of "oops what do I do now" differences [05:13:36] the nicest thing about gerrit is that it is basically just git. arc is utterly not just git [05:14:37] Reedy: does their test suite compare against php7 now? [05:14:46] * bd808 is kind of interested in that [05:15:02] It can, yeah [05:16:04] http://hhvm.com/blog/9293/lockdown-results-and-hhvm-performance [05:16:09] I thought there was a newer blog post [05:17:16] bd808: let me know when to merge [05:17:32] * YuviPanda futzes with more docker containers excitedly [05:17:44] ori: k. I'm going to test in beta cluster first (crazy I know) [05:18:17] bd808: is it alright if I go out for a bit and do it later, then? [05:18:20] or would you rather be around? [05:18:34] or should this just wait? [05:18:58] It could wait, I'll pull my -1 off the puppet patch if I get it to work in beta [05:19:18] also, can someone remind me wtf localisation updates need to go out automatically rather than with scap? [05:19:43] is l10nupdate basically me, deploying random shit on the weekend for no good reason? [05:19:45] because translators want to see their work (and do it for good reason) [05:20:03] * YuviPanda re-instates icinga check for ori [05:20:04] developers want to see their work too but they wait for the train [05:20:29] In the world of weekly trains it may be less urgent [05:20:32] i don't know why translators need gratification more immediately than developers do [05:20:46] also it does backports that scap wouldn't do [05:20:59] (master messages to older branches) [05:21:41] blergh [05:22:36] it's one of those things that needs a complete do-over [05:22:39] no time [05:23:45] ok, i'm out, good night [05:24:42] (03CR) 10Dereckson: [C: 04-1] Enable new user groups on gu.wikipedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [05:34:30] 05:34 [05:34:39] I'm out too now I think :P [05:36:19] * YuviPanda hates his current time zone [05:56:39] (03CR) 10BryanDavis: "Cherry-picked to beta cluster for testing. Good thing I did because it reminded me why we hadn't done this yet. The keyholder agent is onl" [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [06:06:56] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1836542 (10bd808) D65 is merged and deployed to beta cluster. It is not deployed to production yet. In... [06:09:35] PROBLEM - configured eth on lvs1009 is CRITICAL: eth3 reporting no carrier. [06:10:44] PROBLEM - configured eth on lvs1007 is CRITICAL: eth3 reporting no carrier. [06:11:16] PROBLEM - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier. [06:30:44] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:26] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:45] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [06:33:06] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:06] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:04] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [06:56:15] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:26] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:24] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:10:25] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /var 70497 MB (3% inode=99%) [09:47:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [10:01:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:55:13] (03PS6) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [11:17:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [11:29:05] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [11:34:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:56:35] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:35] RECOVERY - Disk space on restbase1008 is OK: DISK OK [13:38:04] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [150.0] [13:38:35] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [13:39:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [13:48:16] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:49:44] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:55:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:56:06] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [14:23:44] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:50:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [14:50:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [14:54:57] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837026 (10thcipriani) >>! In T119746#1836542, @bd808 wrote: > I think there is some configurability in... [15:00:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [15:00:44] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [15:02:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [15:08:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [15:10:25] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: puppet fail [15:28:04] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [15:37:54] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:30] (03Abandoned) 10Andrew Bogott: Remove labs_ldap_dns_ip_override hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/255495 (owner: 10Andrew Bogott) [17:26:27] (03CR) 1020after4: [C: 031] l10nupdate: replace ssh key with new scap script [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [17:41:12] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837158 (10bd808) >>! In T119746#1837026, @thcipriani wrote: > This is somewhat puppetized in the `keyh... [18:07:54] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1837205 (10Aklapper) @Elee: Any news here? Are you still working on this (as you're set as assignee)? [18:39:18] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837241 (10thcipriani) >>! In T119746#1837158, @bd808 wrote: > I think what I would like here is for th... [18:56:43] (03PS5) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [18:58:29] (03CR) 10jenkins-bot: [V: 04-1] etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [19:00:47] <_joe_> pffft jenkins schmenkins [19:51:36] !log importing user.user_touched (s4) from dbstore1002 to sanitarium. s4 lab will be affected for some minutes. [19:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:01] (03PS1) 10Ricordisamoa: maintain-replicas: add cx_translations and cx_translators [software] - 10https://gerrit.wikimedia.org/r/255943 [20:11:01] (03CR) 10Jcrespo: "Instead of creating a patch with a comment "I don't know what I am doing", can you create a ticket on Phabricator with what you are trying" [software] - 10https://gerrit.wikimedia.org/r/255943 (owner: 10Ricordisamoa) [20:28:04] (03CR) 10Jcrespo: "That sounded badly, I forgot the emoticon :-)" [software] - 10https://gerrit.wikimedia.org/r/255943 (owner: 10Ricordisamoa) [20:28:43] !log importing user.user_touched (s5) from dbstore1002 to sanitarium. s5 lag on labs replicas will be higher for some minutes. [20:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:58] !log importing user.user_touched (s6) from dbstore1002 to sanitarium. s6 lag on labs replicas will be higher for some minutes. [20:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:28] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837350 (10Paladox) 3NEW [21:01:08] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837358 (10Peachey88) SVN is two (or will be shortly) revision control systems ago, Do we really still need these redirects at all? [21:09:27] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837362 (10Paladox) Not really. This is only a suggestion since some website still use the old revision of the extension. [21:21:02] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1837380 (10Peachey88) I'm sure people would still use gopher if we kept it around, I'm not sure that's a use case for keeping these domain redirects around. [21:23:38] (03CR) 10Jcrespo: [C: 04-1] "This will not work, see ticket." [software] - 10https://gerrit.wikimedia.org/r/255943 (https://phabricator.wikimedia.org/T119847) (owner: 10Ricordisamoa) [21:25:14] !log importing user.user_touched (s7) from dbstore1002 to sanitarium. s7 lag on labs replicas will be higher for some minutes. [21:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:57] jynus: you might want to look at https://phabricator.wikimedia.org/T119841 [22:28:37] low priority [22:28:55] category links gets cleaned up once a week [22:29:13] will look at it if it doesn't [22:30:32] jynus: I suspect that there is some database corruption involved [22:31:04] Its not normal issues, since 01 and 03 are giving different results [22:31:34] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [22:31:49] it is just that *links tables are not very reliable [22:32:13] jynus: they should be [22:32:21] if the job queue fails, it can get outdated [22:32:30] and a cache refresh fixes them [22:32:38] true, but this has nothing to do with the queue [22:33:15] our replicas should have the same data, but dont [22:33:25] oh, I agree [22:33:33] but that is almost impossible for labs [22:33:45] jynus: ?? [22:33:47] due to the filtering [22:34:00] jynus: No, you miss my meaning [22:34:32] jynus: labsdb1001 should be identical to labsdb1003 [22:34:38] regardless of the filtering [22:34:41] I agree [22:35:10] In this case labsdb1001 gives A and labsdb1003 returns B [22:36:11] which is why I suspect corruption [22:36:20] no corruption [22:36:25] data drift [22:36:30] it is not the same [22:37:56] what do you mean by data drift? [22:38:07] data is different [22:38:47] if there was data corruption, Innodb would signal and stop [22:39:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [22:39:05] jynus: why? according to the process and replication 01 and 03 should have identical data [22:39:21] according to the process? [22:40:24] jynus: we apply replication and filtering. each machine goes thru the same process with the same data. The results should be the same [22:40:25] we replicate in statement format ver unsafe statements, data changes are mere suggestions :-) [22:41:02] a single lock here or there makes the data drift [22:41:26] how do we fix those drifts? most users assume that there isnt drift [22:41:29] specially when data has been deleted in the first place compared to the master [22:41:54] now? doing small adjustments every now and then [22:42:22] what do you mean adjustments? [22:42:43] for ever? switching to row-based replication and creating all tables with primary keys and doing only safe queries [22:42:56] Betacommand, reimports/resyncs [22:43:03] ah [22:43:12] how often do those happen? [22:43:32] when nothing else is broken elsewhere [22:53:25] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1837436 (10jcrespo) [22:54:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:57:15] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:20:55] (03PS2) 10BryanDavis: l10nupdate: replace ssh key with new scap script [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) [23:39:58] (03CR) 10BryanDavis: "Tested core functional change on beta cluster:" [puppet] - 10https://gerrit.wikimedia.org/r/255916 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [23:48:29] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1837472 (10bd808) I updated `keyholder::agent` in https://gerrit.wikimedia.org/r/255916 to support asso... [23:54:26] ori: I'm not sure why, but I think the redis refactoring has broken the Trebuchet redis status returner in both the beta cluster and production. All the trebuchet deploys I've done lately report 0 successes. [23:54:42] was there new firewall stuff involved in the things you moved around? [23:58:50] bd808: I didn't actually modify the trebuchet redis [23:59:16] hmmm... there goes that theory then