[00:10:14] (03CR) 10Ori.livneh: [C: 031] graphite: introduce local c-relay (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181080 (owner: 10Filippo Giunchedi) [00:21:32] PROBLEM - RAID on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:08] RECOVERY - RAID on neon is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:29:55] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [00:33:39] (03PS1) 10Springle: repool db1072. depool db1066 (odd replication error). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181219 [00:34:15] (03CR) 10Springle: [C: 032] repool db1072. depool db1066 (odd replication error). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181219 (owner: 10Springle) [00:35:03] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [00:35:33] !log springle Synchronized wmf-config/db-eqiad.php: repool db1072 warm up. depool db1066 replication error (duration: 00m 05s) [00:35:39] Logged the message, Master [00:39:38] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [00:48:19] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:19:57] <^d> Huh, gerritadmin? [01:22:11] <^d> Krenair: Hmm, what about it? [01:23:06] the bot that goes around commenting on phab tickets that someone uploaded a related gerrit change [01:23:59] <^d> I haven't touched that in ages, qchris did the work to refactor it for Phab [01:24:27] <^d> What're you wanting from it? [01:25:59] I wondered if we could make it show the owner [01:26:05] rather than individual patchset author [01:26:19] <^d> I don't remember. That code's pretty jankey. [01:26:29] because the way it attributes people who just add "Bug: TNNNNN" to other people's commits is weird [01:26:47] <^d> It's definitely more than just a config change iirc. [01:26:50] yes [01:27:27] <^d> In other Phab news, they fixed our "commits to repos won't show up" bug today. :) [01:27:34] <^d> So hopefully we'll get that in our next code pull. [01:31:54] ^d: in *other* phab news, want to deploy a patch for me?… :P [01:32:12] <^d> On the friday before christmas? [01:32:15] <^d> You're joking, right? [01:32:23] https://gerrit.wikimedia.org/r/179407 [01:32:33] <^d> Heck, I shouldn't even /be/ on irc anymore, I'm on vacation dammit! [01:32:44] not my fault no one has deployed it yet :P [01:32:57] <^d> puppet? since when do you think I'm a root :p [02:31:54] btw, superm401 / spagewmf are going to deploy a bug fix shortly. They have my blessing. (It's to address an issue that is breaking some userscripts) [02:32:11] (03PS1) 10Andrew Bogott: Reorder package install for qemu a bit. [puppet] - 10https://gerrit.wikimedia.org/r/181230 [02:32:15] greg-g, not only user scripts. One of them is to fix a bug that makes PageCuration/PageTriage unusable. [02:32:22] Which is an extension. [02:33:04] that too, yeah [02:33:08] :) [02:33:19] greg-g, I guess I still need to force the Tidy thing? :( [02:35:53] (03CR) 10Andrew Bogott: [C: 032] Reorder package install for qemu a bit. [puppet] - 10https://gerrit.wikimedia.org/r/181230 (owner: 10Andrew Bogott) [02:37:38] * greg-g 's bus is almost home... goes [02:44:12] PROBLEM - Host virt1012 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:02] RECOVERY - Host virt1012 is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [02:57:30] Hi Experts, I would like to read the sampled web request log, i.e. api.log on fluorine and sampled-1000.log on erbium for research purposes, is it possible? how should I proceed? Appreciate a lot for your advice! [02:57:39] !log mattflaschen Synchronized php-1.25wmf12/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/: Fix to PageTriage not to use jQuery live (duration: 00m 05s) [02:57:45] Logged the message, Master [02:57:56] !log mattflaschen Synchronized php-1.25wmf13/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/: Fix to PageTriage not to use jQuery live (duration: 00m 07s) [02:57:58] Logged the message, Master [02:59:03] !log mattflaschen Synchronized php-1.25wmf12/resources/src/jquery.tipsy/jquery.tipsy.js: Fix "live" deprecated live mode of jQuery tipsy (duration: 00m 05s) [02:59:05] Logged the message, Master [02:59:09] meng: Hey, you're going to want to contact some of our research/analytics team to discuss that. I can send you emails via PM [02:59:15] !log mattflaschen Synchronized php-1.25wmf13/resources/src/jquery.tipsy/jquery.tipsy.js: Fix "live" deprecated live mode of jQuery tipsy (duration: 00m 05s) [02:59:17] Logged the message, Master [02:59:47] thanks Jamesofur! [03:04:54] superm401: that tipsy example JS from T69989 works OK on mw.org [03:05:22] spagewmf, thanks for testing. [03:05:55] well, it logs a deprecation warning [03:05:56] PROBLEM - configured eth on virt1012 is CRITICAL: Connection refused by host [03:06:17] PROBLEM - DPKG on virt1012 is CRITICAL: Connection refused by host [03:06:17] PROBLEM - Disk space on virt1012 is CRITICAL: Connection refused by host [03:06:37] PROBLEM - dhclient process on virt1012 is CRITICAL: Connection refused by host [03:06:52] PROBLEM - puppet last run on virt1012 is CRITICAL: Connection refused by host [03:07:01] PROBLEM - RAID on virt1012 is CRITICAL: Connection refused by host [03:07:04] PROBLEM - salt-minion processes on virt1012 is CRITICAL: Connection refused by host [03:15:14] RECOVERY - configured eth on virt1012 is OK: NRPE: Unable to read output [03:15:22] RECOVERY - DPKG on virt1012 is OK: All packages OK [03:15:22] RECOVERY - Disk space on virt1012 is OK: DISK OK [03:15:46] RECOVERY - dhclient process on virt1012 is OK: PROCS OK: 0 processes with command name dhclient [03:15:51] RECOVERY - puppet last run on virt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:16:19] RECOVERY - RAID on virt1012 is OK: OK: no RAID installed [03:17:09] Reedy, can you make me an admin on https://test.wikipedia.org/wiki/Special:NewPagesFeed ? I need to test PageTriage, and it looks like it's misconfigured so only admin has 'patrol' [03:17:16] Mattflaschen (WMF) [03:23:09] Patroller might be enough too, if it exists. [03:28:33] legoktm, ^ [03:28:55] (if yr still around) [05:50:44] (Krinkle handled.) [05:52:27] (Matt left..) [06:17:51] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [06:18:31] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:18:31] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:31:33] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:31:41] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:34:31] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:53] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:22] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:52] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:52] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:36] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:49:09] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:49:10] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:19:47] RECOVERY - salt-minion processes on virt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:29:08] PROBLEM - salt-minion processes on virt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:18:07] (03CR) 1020after4: [C: 031] admin: grant twentyafterfour gallium [puppet] - 10https://gerrit.wikimedia.org/r/181211 (owner: 10John F. Lewis) [13:38:34] (03CR) 10Hashar: [C: 031] "Deploying a Zuul change is done directly on gallium for now: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Update_configurati" [puppet] - 10https://gerrit.wikimedia.org/r/181211 (owner: 10John F. Lewis) [14:05:24] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [14:20:42] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:01:43] Hi there, where I can find some resources about caching in wikipedia platform? [18:03:02] jazgot: Is there anything is particular you want to know? [18:03:38] jazgot: You might start at https://www.mediawiki.org/wiki/Manual:Cache [18:11:34] JohnLewis, in general I'm looking for information about infrastructure and how you handle huge loads of traffic, in particular what technologies are used for code and object cache [18:13:39] bd808, this applies equally to mediawiki engine and wikipedia? [18:18:14] jazgot: Yes, in so far as wikipedia is an mediawiki deployment. In WMF's prod server cluster we use memcached for $wgMainCacheType, CDB localization cache files, and two layers of Varnish for statically cacheable assest [18:19:06] The production MediaWiki configuration is shown at https://github.com/wikimedia/operations-mediawiki-config [18:23:00] The PHP5 servers that remain in the prod cluster use APC byte code caching, but the majority of servers have been converted to HHVM which has it's own byte code cache system (hhbc) which is described a bit at http://hhvm.com/blog/4061/go-faster [18:25:43] I can't find a great page describing our Varnish configuration, but it is similar to the older Squid caching layer -- https://wikitech.wikimedia.org/wiki/Squids [18:25:48] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [18:44:01] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:48:31] bd808, thanks for resources, I'm just looking for ideas which I could adopt in my company [18:49:37] jazgot: https://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast is a good think to read through [18:49:45] *thing [18:51:23] I found this article https://wikitech.wikimedia.org/wiki/Load_balancing_architecture, there are a 6 tcp hops in request route, isn't this much? I know most request will stop on varnish but still... [18:53:06] The LVS layers are basically a transparent pass through. And these hosts are all in the same data center. [18:53:57] The varnish frontend to backend layer is generally communication in the same rack (and sometimes the same physical host) [18:55:19] But honestly, that's how a service tier is scaled. I don't know how you could pull any of those layers out and have a separation of duties among services [19:43:21] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [19:58:16] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:08:22] (03PS1) 10John F. Lewis: etherpad: move behind misc-lb.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/181268 [20:08:32] (03PS1) 10John F. Lewis: etherpad->misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/181269 [20:29:24] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [20:40:15] anyone have a list of RT #s and where they ended up in phabricator? [20:45:06] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:45:40] !log restarted webperf service statsd-mw-js-deprecate on hafnium. It seems it did not send metrics to statsd after an EventLogging restart. [20:46:36] Krenair: In a status update from the migration, it was said that RT issues start at https://phabricator.wikimedia.org/T78842 and end at https://phabricator.wikimedia.org/T84827 [20:46:53] But I am not sure if that is (still) correct. [20:47:03] are they all in the same order? [20:47:37] I don't know. [20:48:14] rt10 is T78850, rt 12 is T78851 [20:48:31] So there are at least holes. [20:48:32] @seen Jamesofur [20:48:41] what was the wm-bot command again? [20:49:26] "@seen" looks right. But wm-bot did not react to me above !log. So maybe wm-bot is having issues? [20:49:59] it worked in #mediawiki and #wikimedia-tech [20:50:42] qchris: logmsgbot is the one that does !logs, i think [20:51:02] yeah [20:51:03] MatmaRex: Argh. Right. Sorry. Mybad. [20:51:12] but it didn't react either [20:51:16] so that's probably not good [20:51:37] (wm-bot has different configurations on different channels, and i can never tell which commands work in the given one) [20:53:09] Krenair: @seen is a private message command. [20:53:23] /msg wm-bot @seen Jamesofur [20:53:27] worked for me. [20:53:51] it also worked in -tech [20:54:08] or was it #mediawiki. one of the two [20:54:57] k [20:57:36] qchris, pm [20:58:04] !log restarted webperf "ve" service on hafnium. It seems it did not send metrics to statsd after an EventLogging restart. [22:20:22] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [22:24:16] (03PS1) 10BryanDavis: beta: honor log sampling for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181349 [22:24:18] (03PS1) 10BryanDavis: monolog: honor log sampling for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 [22:27:10] (03PS2) 10BryanDavis: monolog: honor log sampling for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 [22:27:12] (03PS3) 10BryanDavis: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 [22:29:02] (03CR) 10BryanDavis: [C: 04-1] "Needs I6d79f27e59cbd5ed0dd441707b328179102cb2f0 which in turn needs Icd14fc8c44ca9eef0f3f5cc4f1d1d8b68d517f07 in MW on group0+group1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (owner: 10BryanDavis) [22:29:54] (03CR) 10BryanDavis: [C: 04-1] "Needs Icd14fc8c44ca9eef0f3f5cc4f1d1d8b68d517f07 on group0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (owner: 10BryanDavis) [22:30:26] (03CR) 10BryanDavis: [C: 04-1] "Needs Icd14fc8c44ca9eef0f3f5cc4f1d1d8b68d517f07 to be merged first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181349 (owner: 10BryanDavis) [22:35:24] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:45:30] (03PS7) 10BryanDavis: logstash: parse json encoded hhvm fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/179759 [22:45:32] (03PS5) 10BryanDavis: logstash: Parse apache syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/179480 [23:18:55] (03CR) 10Ori.livneh: [C: 031] logstash: Parse apache syslog messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/179480 (owner: 10BryanDavis)