[00:00:05] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150205T0000). Please do the needful. [00:00:39] mwahah empty SWAT [00:01:00] andrewbogott: where is virt1000 physically? [00:01:13] both in eqiad [00:01:21] virt1000 is in the labs network, silver is not [00:01:22] you might be able to use nice to influence which things are killed first by OOMkiller [00:01:32] " if the task has nice value above zero, its score doubles " [00:01:54] springle: But as long as you talk to both via the public ip they should connect just fine... [00:01:59] springle: is that what you meant? [00:02:33] mutante: thanks! With any luck we will… not have to think about how to optimize OOMkiller behavior :) [00:03:42] you can explicitly set oomkill preference as well [00:04:18] with /proc//oom_adj [00:05:16] (and that gets simpler for daemons under systemd, as you can add OOMScoreAdjust= to their unit file: http://www.freedesktop.org/software/systemd/man/systemd.exec.html _ [00:06:02] !log xtrabackup clone virt1000 to silver [00:06:10] Logged the message, Master [00:08:03] andrewbogott: that's fine. just missing my racktables pw on this laptop, and wondered where the boxes were [00:08:38] springle: so what happens next? [00:08:53] !log Added 'dduvall' to integration group ACL on Gerrit [00:08:57] Logged the message, Master [00:09:25] andrewbogott: it's running. now we wait. once it finishes there is a backup preparation step to get silver replicating from virt1000, then we decide to flip the switch [00:09:36] (03PS1) 10Dzahn: ci/jenkins: add public key for VE sync to puppet [puppet] - 10https://gerrit.wikimedia.org/r/188708 (https://phabricator.wikimedia.org/T84731) [00:09:56] springle: cool. I’m also not 100% sure that the wiki is set up properly, so there will be a ‘test’ step in there someplace :) [00:10:11] any idea how long the replication takes? hours, days? [00:10:45] andrewbogott: hehe np. i would think hours. we can do a production shard clone in hours, and they're much larger [00:11:27] andrewbogott: silver can keep replicating until the wiki is ready i guess [00:11:32] cool [00:11:54] OK, I’m going to take a break, will check back in this evening. [00:11:56] Thank you! [00:12:03] yw [00:12:52] (03PS2) 10Dzahn: ci/jenkins: add public key for VE sync to puppet [puppet] - 10https://gerrit.wikimedia.org/r/188708 (https://phabricator.wikimedia.org/T84731) [00:15:28] 3operations: Re: Need to add fundraising contractor to email list, Phabricator T87672 - https://phabricator.wikimedia.org/T87674#1016500 (10Krenair) [00:22:56] (03PS3) 10Springle: use m1-master CNAME [puppet] - 10https://gerrit.wikimedia.org/r/188508 [00:23:49] (03CR) 10Springle: [C: 032] use m1-master CNAME [puppet] - 10https://gerrit.wikimedia.org/r/188508 (owner: 10Springle) [00:25:51] 3operations: Re: Need to add fundraising contractor to email list, Phabricator T87672 - https://phabricator.wikimedia.org/T87674#1016555 (10chasemp) >>! In T87674#998834, @Aklapper wrote: > This is a Phab task about a Phab task. Could someone help me understand? delightfully meta! [00:29:36] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1016586 (10Dzahn) [00:33:06] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1016598 (10Dzahn) Cool, thanks. So that means they are both in puppet, just without the redis role, intentionally. That should probably wait for the patch above first. Because that role also... [00:36:23] do you have an opinion on including "standard" only on nodes directly or inside roles [00:38:35] I would tend to think we'd have all nodes be in some role, and all roles inherit from a base role, and that base role include standard [00:38:42] but I think we're a few seperate steps from such a thing [00:40:15] in this case we want to test a few things before applying the role [00:40:35] but since that role includes admin and we dont in site.pp, it's like.. you don't even have vim, just old vi [00:40:38] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1016638 (10tstarling) I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be replaced by 50 lines of your favourit... [00:40:43] (and none of the other standard packages) [00:41:08] s/admin/standard/ ..gee [00:42:25] well the current count is we're using it in 28 roles from manifests/roles/, 82 times in site.pp, and 3 times in modules for some reason [00:42:38] hah, yea :) it's well mixed [00:45:41] (03PS1) 10Dzahn: include standard on rbf codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/188713 [00:46:35] (03PS2) 10Dzahn: include standard on rbf codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/188713 (https://phabricator.wikimedia.org/T86898) [00:47:35] (03PS1) 10Ori.livneh: Update my (=ori's) deployment shell helpers [puppet] - 10https://gerrit.wikimedia.org/r/188714 [00:47:42] (03CR) 10Dzahn: [C: 032] include standard on rbf codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/188713 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [00:47:46] andrewbogott_afk: fyi when you return, silver db is up and replicating. we still need to puppetize /etc/mysql/my.cnf and review grants [00:50:54] phaste was the thing to paste into phab from any shell? [00:51:15] (03PS1) 10Dzahn: add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 [00:51:22] (03CR) 10jenkins-bot: [V: 04-1] add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 (owner: 10Dzahn) [00:54:32] !log truncated redis input queues for logstash on all 3 hosts to see if cluster can keep up now with 3 elasticsearch writer threads [00:54:41] Logged the message, Master [00:54:43] (03PS2) 10Dzahn: add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 [00:56:13] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1016695 (10Dzahn) added "standard" in https://gerrit.wikimedia.org/r/#/c/188713/ and ran puppet on both, so they installed all of the default stuff, and we have a usable editor and all that... [00:56:54] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1016697 (10ori) >>! In T88393#1016638, @tstarling wrote: > I think the reason rotation is done daily is because logrotate is a daily cron job, and does not support a shorter rotation period. It could be... [00:58:42] (03PS3) 10Dzahn: add base::firewall on codfw redis nodes [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) [01:00:14] (03PS1) 10BryanDavis: logstash: move apifeatureusage output after default [puppet] - 10https://gerrit.wikimedia.org/r/188718 [01:02:56] (03CR) 10Gage: [C: 032] logstash: move apifeatureusage output after default [puppet] - 10https://gerrit.wikimedia.org/r/188718 (owner: 10BryanDavis) [01:06:01] (03PS1) 10Dzahn: redisdb: add ferm::service for redis-server [puppet] - 10https://gerrit.wikimedia.org/r/188719 (https://phabricator.wikimedia.org/T86898) [01:07:01] (03CR) 10Dzahn: "also see Change-Id: I0957b96b99d525d" [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [01:08:41] (03PS1) 10Ori.livneh: mw-log-cleanup: find and compress uncompressed rotated files [puppet] - 10https://gerrit.wikimedia.org/r/188720 [01:08:45] Tim-away: ^ [01:09:54] !change 178873 | ori [01:09:55] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:09:59] * mutante misses that bot [01:10:18] * ori reviews [01:11:39] thanks, and ignore the path conflict.. just lemme know if the general move seems ok :p [01:11:50] also made a bug for it because it touches them all [01:12:37] (03PS1) 10BryanDavis: Limit runJobs output to warning and higher severity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 [01:16:16] (03CR) 10MaxSem: "Omg autoloader invocation:P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 (owner: 10BryanDavis) [01:16:37] mutante: it looks good, but i think you should move the includes from site.pp into a small role [01:17:16] (03CR) 10Ori.livneh: "it looks good, but i think you should move the includes from site.pp into a small role" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [01:17:27] MaxSem: ah true. should I use a string constant instead? [01:18:06] The `\Psr\Log\LogLevel::WARNING` == 'warning' [01:18:07] hmmmmmm [01:18:17] yep,sounds right [01:18:22] and even more readable [01:18:30] cool I'll amend [01:18:37] ori: thanks, i'll look [01:18:44] why does gerrit use flash btw? [01:18:51] clipboard access [01:19:00] ah:) my browser just blocked that [01:19:00] for the "copy url" feature [01:20:04] http://caniuse.com/#feat=clipboard shows the standard api as being supported for 85% of users, so it's kinda antiquated now, but not long ago flash was the best way to do this. [01:20:36] (03PS2) 10BryanDavis: Limit runJobs output to warning and higher severity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 [01:21:57] (03PS1) 10Dzahn: switch servermon to misc-web [dns] - 10https://gerrit.wikimedia.org/r/188723 (https://phabricator.wikimedia.org/T88427) [01:22:00] (03PS3) 10Dzahn: misc-web-lb changes to support servermon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/188389 (https://phabricator.wikimedia.org/T88427) (owner: 10RobH) [01:22:09] (03CR) 10jenkins-bot: [V: 04-1] misc-web-lb changes to support servermon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/188389 (https://phabricator.wikimedia.org/T88427) (owner: 10RobH) [01:22:52] (03PS1) 10Ori.livneh: Set $wgUDPProfilerHost to service alias rather than hard-code IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188724 [01:22:53] ^ AaronSchulz [01:23:46] 3MediaWiki-Core-Team, operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1016814 (10Dzahn) p:5Triage>3Normal [01:24:14] (03CR) 10Aaron Schulz: [C: 031] Set $wgUDPProfilerHost to service alias rather than hard-code IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188724 (owner: 10Ori.livneh) [01:24:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:26:42] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1016833 (10Dzahn) yea, let's keep using .deb instead of bundler. in reply to "removed a few stupid checks", we can also disable them. afaict each chec... [01:26:57] (03PS2) 10Ori.livneh: Set $wgUDPProfilerHost to service alias rather than hard-code IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188724 [01:27:08] (03CR) 10Ori.livneh: [C: 032] Set $wgUDPProfilerHost to service alias rather than hard-code IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188724 (owner: 10Ori.livneh) [01:27:13] (03Merged) 10jenkins-bot: Set $wgUDPProfilerHost to service alias rather than hard-code IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188724 (owner: 10Ori.livneh) [01:29:19] (03PS2) 10Ori.livneh: Update my (=ori's) deployment shell helpers [puppet] - 10https://gerrit.wikimedia.org/r/188714 [01:30:51] (03CR) 10Ori.livneh: [C: 032] Update my (=ori's) deployment shell helpers [puppet] - 10https://gerrit.wikimedia.org/r/188714 (owner: 10Ori.livneh) [01:37:24] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [01:40:57] that's me, got pulled into an unscheduled meeting [01:40:59] syncing now [01:41:41] !log ori Synchronized wmf-config/CommonSettings.php: I7b270eb8a: Set $wgUDPProfilerHost to service alias rather than hard-code IP (duration: 00m 05s) [01:41:43] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [01:41:48] Logged the message, Master [01:49:03] 3Scrum-of-Scrums, operations, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1016879 (10GWicke) [01:51:31] jouncebot: reload [01:51:41] jouncebot: refresh [01:51:43] I refreshed my knowledge about deployments. [01:51:55] thanks jouncebot. you're pretty cool [02:03:04] 3Services, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1016913 (10GWicke) [02:19:27] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:19:34] Logged the message, Master [02:20:34] !log LocalisationUpdate completed (1.25wmf14) at 2015-02-05 02:19:31+00:00 [02:20:38] Logged the message, Master [02:20:50] (03PS1) 10Kaldari: Adding original language of this work campaign for WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 [02:30:30] (03PS1) 10Ori.livneh: Set a statsd-compatible $wgStatsFormatString [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188734 [02:33:58] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:34:04] Logged the message, Master [02:35:05] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-05 02:34:02+00:00 [02:35:09] Logged the message, Master [02:36:26] (03CR) 10Ori.livneh: [C: 032] Set a statsd-compatible $wgStatsFormatString [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188734 (owner: 10Ori.livneh) [02:36:32] (03Merged) 10jenkins-bot: Set a statsd-compatible $wgStatsFormatString [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188734 (owner: 10Ori.livneh) [02:37:20] i got a 502 bad gateway error while trying to preview an edit, and it was just a standard nginx one, not the usual wmf one [02:37:51] !log ori Synchronized wmf-config/CommonSettings.php: Ia59e654e8: Set a statsd-compatible $wgStatsFormatString (duration: 00m 07s) [02:37:54] Logged the message, Master [02:38:20] bblack: ^ (jackmcbarn's message) [02:41:34] (03PS2) 10Ori.livneh: mw-log-cleanup: find and compress uncompressed rotated files [puppet] - 10https://gerrit.wikimedia.org/r/188720 [02:41:43] (03CR) 10Ori.livneh: [C: 032 V: 032] mw-log-cleanup: find and compress uncompressed rotated files [puppet] - 10https://gerrit.wikimedia.org/r/188720 (owner: 10Ori.livneh) [02:46:03] PROBLEM - puppet last run on gold is CRITICAL: CRITICAL: Puppet has 1 failures [02:46:44] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet has 1 failures [02:46:54] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 2 failures [02:47:04] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [02:47:14] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [02:47:24] PROBLEM - puppet last run on platinum is CRITICAL: CRITICAL: Puppet has 1 failures [02:47:24] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [02:47:43] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: Puppet has 1 failures [02:47:53] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [02:51:49] jackmcbarn: a standard nginx 502 would suggest that the nginx SSL terminator was not able to speak with varnish. I pinged bblack because he has been upgrading both tiers in the past few days and is therefore most likely to have a clue about why you saw the 502. [02:51:59] thanks for reporting it [03:03:04] RECOVERY - puppet last run on platinum is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:03:44] RECOVERY - puppet last run on gold is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:03:54] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:04:24] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:04:25] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:04:34] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:04:45] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [03:05:14] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:06:04] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [03:26:28] ori / jackmcbarn: it's possible it was an isolated transient thing, we'll have to grep logs unless it's reproducible I think [03:27:15] I don't know, offhand, of any reason that should be happening. Not much has been happening with the caches cluster the past couple of hours, I've just been editing docs. [03:46:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [03:47:04] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [03:52:07] 3Services, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1017016 (10bd808) Can basic testing setup be added to the template/skeleton project as well? At $DAYJOB-1 we made a skeleton project and used Phing (our PHP build tool of choice)... [04:03:54] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:08:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:09:03] anyone know where/how these addresses are configured? https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/requesttracker/files/rt.aliases [04:15:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:17:29] 3Scrum-of-Scrums, operations, RESTBase, Services: Create a /revision/{revision} entry point - https://phabricator.wikimedia.org/T88652#1017048 (10GWicke) 3NEW a:3GWicke [04:19:53] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [04:23:30] 3Scrum-of-Scrums, operations, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1017063 (10GWicke) [04:23:43] 3Scrum-of-Scrums, operations, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#21142 (10GWicke) [04:26:57] 3Scrum-of-Scrums, operations, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1017075 (10GWicke) [04:28:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:28:37] springle: I’m back. I’m trying to access the wiki on silver and it says ‘You don't have permission to access /wiki/ on this server.’ My guess is that’s an Apache thing and not a mysql thing though… [04:28:43] * andrewbogott digs in logs [04:31:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:37:34] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [04:40:18] well, crap. As usual I am baffled by apache configs [04:40:29] andrewbogott: the silver db is read-only, fwiw. is silver wiki using same db user/pass as virt1000 did? [04:41:07] The error (so far) is reading an index file. So not even hitting the db yet. [04:41:13] But, yes, should be identical passwords. [04:41:26] springle: If you feel like looking at the Apache config I welcome your help. [04:42:42] (03PS1) 10BBlack: disable cp1064 upload backend, tuning issues... [puppet] - 10https://gerrit.wikimedia.org/r/188744 [04:42:59] (03CR) 10BBlack: [C: 032 V: 032] disable cp1064 upload backend, tuning issues... [puppet] - 10https://gerrit.wikimedia.org/r/188744 (owner: 10BBlack) [04:43:01] hm, apache2.conf is very different between silver and virt1000, probably trusty vs. precise [04:44:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:47:10] andrewbogott: the most likely search results i find talk about a "Require all granted" clause for apache 2.4 [04:47:25] but i'm not that up to date with this stuff [04:48:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:49:12] springle: that was exactly it! Now on to the next problem :) [04:49:31] heh [04:49:39] (03PS1) 10Springle: m2-master CNAME to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/188745 [04:50:21] (03CR) 10Springle: [C: 032] m2-master CNAME to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/188745 (owner: 10Springle) [04:55:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Feb 5 04:54:00 UTC 2015 (duration 53m 59s) [04:55:10] Logged the message, Master [05:01:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:06:22] (03PS1) 10Andrew Bogott: Support the new 'Require all granted' rule on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/188746 [05:07:34] (03CR) 10Andrew Bogott: [C: 032] Support the new 'Require all granted' rule on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/188746 (owner: 10Andrew Bogott) [05:09:33] (03PS1) 10Andrew Bogott: Include the private wikitech password config on silver [puppet] - 10https://gerrit.wikimedia.org/r/188747 [05:10:42] (03PS2) 10Andrew Bogott: Include the private wikitech password config on silver [puppet] - 10https://gerrit.wikimedia.org/r/188747 [05:12:08] (03CR) 10Andrew Bogott: [C: 032] Include the private wikitech password config on silver [puppet] - 10https://gerrit.wikimedia.org/r/188747 (owner: 10Andrew Bogott) [05:13:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [05:19:43] (03PS1) 10Andrew Bogott: Second attempt at adding 'Require all granted' [puppet] - 10https://gerrit.wikimedia.org/r/188750 [05:20:44] (03CR) 10Andrew Bogott: [C: 032] Second attempt at adding 'Require all granted' [puppet] - 10https://gerrit.wikimedia.org/r/188750 (owner: 10Andrew Bogott) [05:28:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:36:24] (03PS1) 10Andrew Bogott: Puppetize some symlinks that are on virt1000 and seem important. [puppet] - 10https://gerrit.wikimedia.org/r/188751 [05:37:03] (03CR) 10jenkins-bot: [V: 04-1] Puppetize some symlinks that are on virt1000 and seem important. [puppet] - 10https://gerrit.wikimedia.org/r/188751 (owner: 10Andrew Bogott) [05:38:06] (03PS2) 10Andrew Bogott: Puppetize some symlinks that are on virt1000 and seem important. [puppet] - 10https://gerrit.wikimedia.org/r/188751 [05:39:19] (03CR) 10Andrew Bogott: [C: 032] Puppetize some symlinks that are on virt1000 and seem important. [puppet] - 10https://gerrit.wikimedia.org/r/188751 (owner: 10Andrew Bogott) [05:47:24] springle: ok, I’m going to have to wait until tomorrow and bug some Mediawiki folks about the mw config bugs I’m seeing now. It’s pretty clear that if I were to do a ‘sync-common’ on wikitech that it would break there as well. [05:47:44] Meanwhile… let’s get the db into shape. Is there a standard puppet pattern for setting up my.cnf? [05:59:28] andrewbogott: this ia stock ubuntu mysql, so i guess mysql::server. if you want to try switching to mariadb, then we add a mariadb role [06:00:22] switching might be fine, or it might introduce some unknown surprise variables for openstack :) idk [06:02:31] oh man, switching to mysql::server is going to be messy [06:02:43] I guess that’s configured via hiera? [06:02:57] no clue :) i tend to ignore it [06:03:53] ok… well, anyway, I can work on this without you. What about grants — are there decisions to make there? [06:03:53] production is very simple in comparison. mariadb roles that just deploy a few things from erb [06:06:54] andrewbogott: pm link [06:07:13] we need to decide if those apps will be connecting from local or remote clients [06:07:28] hm... [06:08:08] on silver, many of those (glance, keystone, neutron, nova, puppet) are moot, and can be dropped. [06:08:21] The wiki dbs only need to be visible to localhost, as far as I can think. [06:08:47] do you plan to leave the non-wiki stuff on virt1000? [06:10:04] yes [06:10:09] virt1000 will remain as an openstack controller [06:10:21] and silver will just be the wikitech wiki [06:10:30] um… virt1000 will remain the puppetmaster too [06:10:37] Unless even /that/ turns out to be too much for it :( [06:10:56] This isn’t a full migration from virt1000 but rather an attempt to split responsibilities. [06:12:03] gotcha [06:12:25] andrewbogott: then i think grants are fine [06:12:41] ok [06:12:47] also, why is switching to mysql::server a big deal? [06:13:10] Switching silver isn’t, I’m just scared to do it on virt1000 [06:13:17] if silver is only handling wiki, doe sit need to use that openstack::database-server::mysql [06:13:20] oh [06:13:48] If I fork that code a bit more I can probably use mysql::server on silver and leave virt1000 be. [06:14:03] That’s probably the right approach for now. [06:14:22] Since (as you note) it’s weird to use these ‘openstack’ classnames on silver if it’s not going to have any openstack on it. [06:14:38] since if silver is only wiki and we know wiki likes mariadb, how about i just add a mariadb role for it? [06:15:05] works for me, if it’s easy [06:15:22] if we start deploying to it, it should be consistent with the rest of the cluster anyway [06:16:13] yep [06:16:27] ok, doing that [06:17:33] andrewbogott: hey! you still up? [06:17:44] YuviPanda: for a few more minutes. What’s up? [06:17:58] ah, was going to ask how the migration is going but I'll just read backlog instead :) [06:18:09] so 'tis ok [06:18:37] YuviPanda: it’s going ok. My current issue is with some mediawiki config settings that don’t seem to work on silver. [06:18:50] I suspect that if I were to apply them (via sync-common) on virt1000 that they’d break there too [06:19:13] …basically, assumptions in the wmf-standard config that we need to make exceptions for. [06:19:21] aaah [06:19:33] If you’re interested in hunting them down… I can set you on the path [06:19:44] Otherwise I will bug bd808|BUFFER tomorrow [06:20:49] nah, I think I've enough beta stuff on my hands... [06:20:57] 'k [06:20:59] and don't want to get involved in something and basically just leave immediately [06:21:45] YuviPanda: oh, another thing that’s trivial that I haven’t done yet — we could increase the time between puppet runs to 40 mins. That’ll substantially reduce the load on virt1000 and I bet no one will care [06:21:59] noooo :) [06:22:00] well [06:22:04] we can do that as a temp. measure [06:22:09] but ideally we want to match prod.. [06:22:24] I guess, although I can’t imagine how that difference could possibly matter [06:22:37] some things depend on it [06:22:40] like, shinken for example [06:22:47] new host config is generated every puppet run [06:22:49] really? On the puppet interval? [06:22:58] so I'll just have to wait longer [06:23:11] I should perhaps move it to a cron instead. [06:23:12] hm. Ok, if it’s not free, then probably not needed. [06:24:01] andrewbogott: can remove role::nova::manager and openstack::database-server::mysql from silver? [06:24:11] or does it still need the former somehow [06:24:35] nova::manager is where the wiki comes from [06:24:40] so please leave that [06:24:44] hehe [06:25:13] as long as you arrange for the password to stay the same, you can remove openstack::database-server::mysql and passwords::openstack::nova [06:25:35] "the" password? [06:25:36] well, and wikitech::wiki::passwords since that was a mistake and never should’ve been there in the first place [06:26:03] $controller_mysql_root_pass [06:26:25] Hm… probably the wiki doesn’t depend on that, I’m not sure. [06:26:32] Mostly I was blindly reproducing virt1000 [06:26:33] why does it need to know root pw? tsounds dangerous [06:27:29] openstack::database-server::mysql knows it to set it when creating the db in the first place [06:27:55] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:01] and drop it in /root/.my.cnf I believe [06:28:24] mariadb::config will clash with that [06:28:31] or override [06:28:53] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:04] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:24] springle: ok, I’m looking through puppet… pretty sure nothing depends on any particular password on silver [06:30:34] so, I think mariadb::config can do its worst, shouldn’t matter [06:33:17] (03PS1) 10Springle: mariadb config for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/188754 [06:34:49] springle: right now mysql data is in /srv/mysql [06:34:54] yep [06:34:57] your patch has it in /srv/sqldata [06:35:08] I don’t care, just so you’re aware that it needs moving [06:35:13] is that a problem? [06:35:14] ok [06:35:39] sqldata just brings it into line with everything else around here [06:35:53] though it isn't a lovely name [06:36:14] (03CR) 10Andrew Bogott: [C: 031] "Looks good as long as you remember to stop everything and mv /srv/mysql /srv/sqldata at some strategic time." [puppet] - 10https://gerrit.wikimedia.org/r/188754 (owner: 10Springle) [06:36:44] andrewbogott: strategic time will be any second now. that ok? [06:36:49] yep! [06:36:56] (03CR) 10Springle: [C: 032] mariadb config for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/188754 (owner: 10Springle) [06:38:18] springle: I’m about 5 minutes from bedtime. Any last requests? [06:39:22] andrewbogott: nope. sleep well [06:39:54] ok — thanks for wrangling all this. With any luck I’ll get the wiki working tomorrow. [06:40:07] :) [06:43:09] !log upgrade silver to mariadb 10 [06:43:14] Logged the message, Master [06:44:21] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1017154 (10Yuhong) This is a myth. "RapidSSL SHA256 CA - G3" is definitely SHA256. The other intermediate is for compatibility with older clients such as XP/Server 2003 (but note that XP has root certificate update enab... [06:45:14] 3operations: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1017156 (10Yuhong) This is a myth. "RapidSSL SHA256 CA - G3" is definitely SHA256. The other intermediate is for compatibility with older clients such as XP/Server 2003 (but note... [06:45:35] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:44] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:43:01] (03PS1) 10Springle: point dbproxy1001 at m1 db1001 db1016 [puppet] - 10https://gerrit.wikimedia.org/r/188756 [07:46:45] (03CR) 10Springle: [C: 032] point dbproxy1001 at m1 db1001 db1016 [puppet] - 10https://gerrit.wikimedia.org/r/188756 (owner: 10Springle) [07:51:06] (03PS3) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all mediawiki hhvm installations [puppet] - 10https://gerrit.wikimedia.org/r/188531 [07:51:19] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use lru pcre cache for all mediawiki hhvm installations [puppet] - 10https://gerrit.wikimedia.org/r/188531 (owner: 10Giuseppe Lavagetto) [07:51:27] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: use lru pcre cache for all mediawiki hhvm installations [puppet] - 10https://gerrit.wikimedia.org/r/188531 (owner: 10Giuseppe Lavagetto) [08:40:58] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not escape urls in the catchall redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/188762 [08:41:35] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1017349 (10Joe) [08:42:03] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010014 (10Joe) Patch is up for review here: https://gerrit.wikimedia.org/r/188762 [08:44:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't want to add anything new at first. We should first get things to work, then make them better before we go to prod. Adding new thin" [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [09:03:01] greetings [09:03:11] hello [09:03:21] <_joe_> heya godog [09:04:07] ciao _joe_ [09:04:22] <_joe_> I am hating the puppet DSL, as usual [09:04:39] <_joe_> how difficult it is to do simple things [09:05:22] (03CR) 10Alexandros Kosiaris: [C: 031] "Needs to be coordinated though with the puppet change" [dns] - 10https://gerrit.wikimedia.org/r/188723 (https://phabricator.wikimedia.org/T88427) (owner: 10Dzahn) [09:07:39] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1017388 (10hashar) From my comment on https://gerrit.wikimedia.org/r/#/c/188375/ ---- Ubuntu provides a package for puppet-lint 1.1.0 ( http://packag... [09:08:12] (03CR) 10Hashar: [C: 04-1] "Other comments have been made on T88430 , I have copy pasted my above comment there. Lets see what is the preferred approach." [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T88430) (owner: 10Hashar) [09:10:42] (03CR) 10Alexandros Kosiaris: "For now varnish is going to be the parsoid dual layer varnishes but in non-caching setup (same as for citoid/cxserver). The SSL terminatio" [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) (owner: 10Filippo Giunchedi) [09:45:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 641 [09:50:15] RECOVERY - check_mysql on db1008 is OK: Uptime: 40408 Threads: 3 Questions: 167043 Slow queries: 383 Opens: 1270 Flush tables: 2 Open tables: 64 Queries per second avg: 4.133 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:04:22] (03PS2) 10Florianschmidtwelzow: mediawikiwiki: Allow sysop to add and remove themself from translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) [10:06:52] (03PS1) 10Yuvipanda: toollabs: Add generic webgrid node type [puppet] - 10https://gerrit.wikimedia.org/r/188776 (https://phabricator.wikimedia.org/T1102) [10:07:30] (03PS2) 10Yuvipanda: toollabs: Add generic webgrid node type [puppet] - 10https://gerrit.wikimedia.org/r/188776 (https://phabricator.wikimedia.org/T1102) [10:07:37] (03CR) 10Yuvipanda: [C: 032] toollabs: Add generic webgrid node type [puppet] - 10https://gerrit.wikimedia.org/r/188776 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [10:08:04] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1017454 (10faidon) [10:08:05] 3operations, Continuous-Integration: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1017451 (10faidon) 5Open>3Resolved a:3faidon It took me all of 3 minutes to reprepro includedeb/dsc puppet-lint into our apt for both precise &... [10:08:08] (03CR) 10Yuvipanda: [V: 032] toollabs: Add generic webgrid node type [puppet] - 10https://gerrit.wikimedia.org/r/188776 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [10:09:19] 3operations: revisit what percentiles are calculated by txstatsd - https://phabricator.wikimedia.org/T88662#1017457 (10fgiunchedi) 3NEW [10:09:37] (03PS2) 10Filippo Giunchedi: gdash: deprecate 75percentile and median [puppet] - 10https://gerrit.wikimedia.org/r/188573 (https://phabricator.wikimedia.org/T88662) [10:16:12] (03PS1) 10Yuvipanda: toollabs: Add nodejs support to webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/188779 (https://phabricator.wikimedia.org/T1102) [10:16:32] (03PS2) 10Yuvipanda: toollabs: Add nodejs support to webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/188779 (https://phabricator.wikimedia.org/T1102) [10:16:45] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Add nodejs support to webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/188779 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [10:47:25] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1017533 (10fgiunchedi) >>! In T88412#1016386, @Tgr wrote: >> I'm sure this is known already but it'd seem more logical for such errors to return a 400 to the client since... [11:05:28] (03PS6) 10Giuseppe Lavagetto: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 (https://phabricator.wikimedia.org/T78076) [11:14:26] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [11:16:12] 3Services, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1017567 (10fgiunchedi) +1 also I think it'd make sense to capture what metadata we need in a (per-repo?) file so templates and other artifacts can be regenerated at will [11:18:19] (03PS3) 10Filippo Giunchedi: gdash: deprecate 75percentile and median [puppet] - 10https://gerrit.wikimedia.org/r/188573 (https://phabricator.wikimedia.org/T88662) [11:20:36] (03PS4) 10Filippo Giunchedi: gdash: deprecate 75percentile and median [puppet] - 10https://gerrit.wikimedia.org/r/188573 (https://phabricator.wikimedia.org/T88662) [11:21:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "merging for now to unbreak dashboards, can be revisited later" [puppet] - 10https://gerrit.wikimedia.org/r/188573 (https://phabricator.wikimedia.org/T88662) (owner: 10Filippo Giunchedi) [11:27:24] (03PS4) 10Giuseppe Lavagetto: labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 (https://phabricator.wikimedia.org/T78076) [11:28:48] (03CR) 10Giuseppe Lavagetto: [C: 032] labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [11:32:39] (03PS4) 10Giuseppe Lavagetto: maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (https://phabricator.wikimedia.org/T78076) [11:34:12] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [11:37:17] (03PS4) 10Giuseppe Lavagetto: beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 (https://phabricator.wikimedia.org/T78076) [11:38:08] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [11:48:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:48:25] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:52:34] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:52:35] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:54:55] (03PS1) 10Filippo Giunchedi: graphite: move gdash+performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188788 (https://phabricator.wikimedia.org/T85909) [11:55:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: move gdash+performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188788 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [12:06:03] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1017628 (10hashar) No FUD intended. I just wanted to tied the puppet lint version within the source repository and Gemfile/bundler is a way to achieve... [12:09:49] (03PS4) 10Filippo Giunchedi: Make gdash's uWSGI config.ru Ruby 1.9-compatible [puppet] - 10https://gerrit.wikimedia.org/r/188069 (https://phabricator.wikimedia.org/T85909) (owner: 10Ori.livneh) [12:09:54] (03PS1) 10Yuvipanda: toollabs: Fix errors with nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188790 (https://phabricator.wikimedia.org/T1102) [12:09:56] (03PS1) 10Yuvipanda: beta: Make web user be www-data instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/188791 (https://phabricator.wikimedia.org/T78076) [12:09:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Make gdash's uWSGI config.ru Ruby 1.9-compatible [puppet] - 10https://gerrit.wikimedia.org/r/188069 (https://phabricator.wikimedia.org/T85909) (owner: 10Ori.livneh) [12:10:04] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Fix errors with nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188790 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [12:10:13] (03CR) 10jenkins-bot: [V: 04-1] beta: Make web user be www-data instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/188791 (https://phabricator.wikimedia.org/T78076) (owner: 10Yuvipanda) [12:10:15] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1017647 (10hashar) [12:10:40] (03PS2) 10Yuvipanda: toollabs: Fix errors with nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188790 (https://phabricator.wikimedia.org/T1102) [12:10:56] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Fix errors with nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188790 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [12:10:58] (03CR) 10Filippo Giunchedi: "I checked permissions on tungsten for /srv/deployment/gdash/gdash/lib and it seems readable already by www-data, merging" [puppet] - 10https://gerrit.wikimedia.org/r/188069 (https://phabricator.wikimedia.org/T85909) (owner: 10Ori.livneh) [12:11:10] (03PS2) 10Yuvipanda: beta: Make web user be www-data instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/188791 (https://phabricator.wikimedia.org/T78076) [12:11:15] YuviPanda: good to merge? [12:11:20] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1017650 (10hashar) Faidon has backported puppet-lint 1.1.0 from Ubuntu Vivid in our apt for both Precise and Trusty. I am upgrading puppet-lint on the CI slaves. [12:11:22] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Make web user be www-data instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/188791 (https://phabricator.wikimedia.org/T78076) (owner: 10Yuvipanda) [12:11:25] godog: yup! [12:11:29] godog: can you merge ^ as well? [12:11:37] ack, done [12:11:56] (03CR) 10Hashar: "Faidon has backported puppet-lint 1.1.0 from Ubuntu Vivid in our apt for both Precise and Trusty. I am upgrading puppet-lint on the CI sla" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T88430) (owner: 10Hashar) [12:18:29] (03PS2) 10Hashar: Move puppet-lint options to .puppet-lint.rc [puppet] - 10https://gerrit.wikimedia.org/r/188375 [12:19:11] _joe_: those icinga mw* alerts are you, right? [12:19:25] <_joe_> paravoid: which ones? [12:19:33] WARNING: Puppet is currently disabled, last run 1 hour ago with 0 failures [12:19:35] <_joe_> oh did I forgot to reenable puppet [12:19:36] <_joe_> right [12:19:37] <_joe_> sorry [12:19:47] <_joe_> I was working with yuvi on converting beta [12:19:55] <_joe_> and forgot to reenabling it [12:19:56] <_joe_> meh [12:20:08] <_joe_> I'll do in a few [12:20:12] :) [12:21:02] (03CR) 10Hashar: "I have removed all the bundle / Gemfile stuff, so this patch is just about moving puppet-lint options from the rakefile to .puppet-lint.rc" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (owner: 10Hashar) [12:21:03] <_joe_> paravoid: we're moving beta to www-data, see if someone screams, then move to prod [12:21:58] <_joe_> the mediawikis are ok, basically [12:28:13] (03PS1) 10KartikMistry: WIP: cxserver: Use different registry for beta and production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [12:30:04] !log Upgrading Jenkins and restarting it [12:30:09] Logged the message, Master [12:38:18] (03PS1) 10Giuseppe Lavagetto: mw1018 uses www-data as its main user [puppet] - 10https://gerrit.wikimedia.org/r/188797 [12:40:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "mw1018 is already running as www-data. This should allow me to enable puppet on that host" [puppet] - 10https://gerrit.wikimedia.org/r/188797 (owner: 10Giuseppe Lavagetto) [12:40:38] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1017764 (10Chmarkine) >>! In T73156#1017154, @Yuhong wrote: > This is a myth. "RapidSSL SHA256 CA - G3" is definitely SHA256. The other intermediate is for compatibility with older clients such as XP/Server 2003 (but no... [12:41:25] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:25] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:34] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:35] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:45] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:55] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:56] <_joe_> wat? [12:42:09] !log cp*/amssq*: salt rm /etc/logrotate.d/varnishkafka-frontend-stats to fix cronspam [12:42:14] Logged the message, Master [12:42:15] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:24] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:25] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:54] <_joe_> oh it's just the demented check [12:44:35] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:44:41] (03PS2) 10KartikMistry: WIP: cxserver: Use different registry for beta and production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [12:47:49] (03PS3) 10KartikMistry: cxserver: Use different registry for beta and production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [12:48:28] (03PS4) 10KartikMistry: cxserver: Use different registry for beta and production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [12:55:55] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:56:05] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:56:15] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:56:25] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:56:26] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:56:45] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:57:05] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:57:15] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:58:03] (03PS1) 10Yuvipanda: beta: Remove mediawiki03 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/188803 [12:58:12] 3operations, MediaWiki-Database: Add a "datasets" database to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T85277#1017818 (10Aklapper) @Springle can you comment on this and set priority you think? [13:00:52] (03PS2) 10Yuvipanda: beta: Remove mediawiki03 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/188803 [13:01:18] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Remove mediawiki03 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/188803 (owner: 10Yuvipanda) [13:02:00] (03PS5) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [13:08:50] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1017835 (10Merl) [13:11:24] (03CR) 10Yuvipanda: [C: 04-1] "I would prefer it if we used hiera instead of realm branching for all these." [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [13:11:54] (03CR) 10Yuvipanda: [C: 04-2] "Oh, this is in a module. -2, no realm branching in modules please. We are trying to get rid of all the ones we have, should not introduce " [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [13:12:51] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1017846 (10Krenair) [13:14:29] godog: what's with graphite1002/tungsten alerts? [13:14:55] they're ~50% of our unhandled alerts right now :) [13:16:39] RECOVERY - Disk space on analytics1027 is OK: DISK OK [13:18:17] paravoid: you might like: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1423142289.335&target=deployment-prep.deployment-salt.puppetmaster.cherrypicked_commits.value [13:18:35] lol [13:18:42] which one is that? [13:18:46] oh [13:18:48] it's zero now [13:18:50] sorry, misread [13:18:51] yes :D [13:18:53] we killed them all [13:18:55] congrats! [13:18:58] :D [13:19:07] there's a cherry pick on scap now, though. [13:19:19] can't merge until prod moves to www-data too [13:19:25] or i could make it configurable [13:19:28] wondering if that's worth it [13:19:43] thanks to tireless yuvi :] [13:20:46] it's not [13:21:19] PROBLEM - Host d-i-test is DOWN: PING CRITICAL - Packet loss = 100% [13:21:27] that would be moi [13:21:28] ignore [13:24:27] jesus fucking c hrist, nodejs [13:24:41] 3operations, MediaWiki-Database: Add a "datasets" database to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T85277#1017855 (10Springle) Created on analytics-store, granted to research user. Are we still duplicating such things on s1-analytics-slave (db1047)? We'll soon replicate analytics-sto... [13:24:51] npm start won't work if you have a package'd version of nodejs, and you have to use their own packagemanager to install node to get it to work properly?! [13:24:52] wtf [13:27:28] 3operations, MediaWiki-Database: Add a "datasets" database to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T85277#1017858 (10Ironholds) I don't think we are? I mean, I've not used that in...months. Might be worth an email to research-internal though, to check. [13:29:10] (03PS3) 10Hashar: Move puppet-lint options to .puppet-lint.rc [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) [13:29:29] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:29:52] (03PS1) 10Hashar: puppet-lint: ignore some var in single quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) [13:32:11] yououuuo puppet-lint errors all solved / ignored \O/ [13:32:22] (03PS1) 10Faidon Liambotis: monitoring: fix logo image location [puppet] - 10https://gerrit.wikimedia.org/r/188806 [13:32:43] (03CR) 10Faidon Liambotis: [C: 032] monitoring: fix logo image location [puppet] - 10https://gerrit.wikimedia.org/r/188806 (owner: 10Faidon Liambotis) [13:33:23] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1017876 (10hashar) https://gerrit.wikimedia.org/r/188375 brings it a bunch of ignore we had in the rakefile and expose them in .puppet-lint.rc. That let us discards a lot of warnings and errors. ht... [13:34:13] (03CR) 10Hashar: "operations-puppet-puppetlint-lenient pass now since there is no more any puppet-lint error (or they are ignored)." [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [13:50:46] paravoid: yep fallout from the graphite migration, taking a look [13:51:21] godog: graphite1001 is also full of alerts, but those are acknowledged [13:52:32] indeed, host is in downtime [13:52:41] so what is actually in production now? [13:52:48] graphite1001, right? [13:54:05] correct, 1001 [13:54:18] we definitely need working monitoring for what's in prod [13:55:32] I agree, I've been taking notes on the various whackamole games I've been playing yesterday when moving from tungsten [14:05:26] PROBLEM - puppet last run on platinum is CRITICAL: Connection refused by host [14:05:47] PROBLEM - RAID on platinum is CRITICAL: Connection refused by host [14:05:47] PROBLEM - DPKG on platinum is CRITICAL: Connection refused by host [14:06:06] PROBLEM - dhclient process on platinum is CRITICAL: Connection refused by host [14:06:07] PROBLEM - Disk space on platinum is CRITICAL: Connection refused by host [14:06:07] PROBLEM - salt-minion processes on platinum is CRITICAL: Connection refused by host [14:06:17] PROBLEM - configured eth on platinum is CRITICAL: Connection refused by host [14:06:49] (03CR) 10Faidon Liambotis: [C: 04-1] Strongswan: IPsec Puppet module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [14:07:22] 3ops-esams, operations: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#1017923 (10mark) 3NEW [14:09:25] YuviPanda: So, should I use: hieradata/labs/deployment-prep/common.yaml ? [14:09:48] YuviPanda: see how large registry stuff can be handled? :( [14:11:18] kart_: link me to the changeset again? [14:12:57] (03PS1) 10Filippo Giunchedi: Depend on python-carbon [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188807 [14:14:03] (03PS2) 10Filippo Giunchedi: Depend on graphite-carbon [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188807 [14:14:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Depend on graphite-carbon [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188807 (owner: 10Filippo Giunchedi) [14:16:18] YuviPanda: https://gerrit.wikimedia.org/r/#/c/188796/ [14:17:29] _joe_: multiple mw* alerts, should I just blindly restart or do you want to take a look? [14:17:37] I have a meeting in 10' so no time to investigate [14:17:59] <_joe_> looking [14:18:11] (03CR) 10Yuvipanda: "So the config in the config file can be rendered as json, and info for it can be passed as a parameter? Then we can pass the info via hier" [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [14:19:16] <_joe_> paravoid: uhm maybe transient alerts? [14:19:29] probably.. [14:20:27] <_joe_> yes, checked the event log [14:22:22] PROBLEM - Host platinum is DOWN: PING CRITICAL - Packet loss = 100% [14:22:40] lies [14:22:53] akosiaris: what is platinum? [14:23:11] openstack test host? [14:23:12] openstack evaluation [14:23:27] not sure why icinga says down though [14:23:48] I am logged in and playing with openstack firewalling though [14:24:40] <_joe_> and that looks like a good candidate [14:27:35] (03CR) 10Yuvipanda: "Ugh, that was more cryptic than I intended :)" [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [14:27:37] kart_: ^ more details! [14:28:04] :) [14:32:52] 3operations, Labs: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1017943 (10yuvipanda) Sorry for the delayed response. I don't know if we have done this before (@andrew?), but it's trivial to do. However, because we have never done this before, I do... [14:32:59] kart_: hope that's more useful? [14:34:19] !log upload txstatsd 1.0.0-3 to trusty-wikimedia [14:34:26] Logged the message, Master [14:37:44] (03PS12) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [14:41:54] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1017945 (10mark) [14:45:48] (03PS1) 10Filippo Giunchedi: Depend on python-twisted-web [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188811 [14:46:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Depend on python-twisted-web [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188811 (owner: 10Filippo Giunchedi) [14:46:32] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1017951 (10mark) >>! In T76560#1013360, @bd808 wrote: > Some implementation related wisdom spotted on irc: > ``` > [23:20:49] seems like every 2-3 years, we go through a cycle of: > [23:20:59] 1. Replace... [14:47:55] (03PS1) 10Filippo Giunchedi: actually bump version in changelog [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188812 [14:48:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] actually bump version in changelog [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/188812 (owner: 10Filippo Giunchedi) [14:59:27] (03PS1) 10coren: Labs: Make sure manage-nfs-volumes is running [puppet] - 10https://gerrit.wikimedia.org/r/188814 (https://phabricator.wikimedia.org/T88669) [15:01:05] YuviPanda: ^^ [15:05:14] (03CR) 10Yuvipanda: [C: 031] "LGTM, unsure about the icinga check (haven't written one of those checks before)" [puppet] - 10https://gerrit.wikimedia.org/r/188814 (https://phabricator.wikimedia.org/T88669) (owner: 10coren) [15:06:47] (03CR) 10coren: [C: 032] "This should work." [puppet] - 10https://gerrit.wikimedia.org/r/188814 (https://phabricator.wikimedia.org/T88669) (owner: 10coren) [15:09:34] (03PS1) 10Filippo Giunchedi: graphite/txstatsd: fix require_packages vs package [puppet] - 10https://gerrit.wikimedia.org/r/188815 [15:12:03] _joe_: ^^ should be fixed once and for all [15:12:58] * YuviPanda goes afk for food [15:17:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "moving dependencies where they belong is a good idea in general. But I don't see the need to switch back to "package" in general." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188815 (owner: 10Filippo Giunchedi) [15:17:33] <_joe_> godog: I'm not sure why you wanted to move away from require_packages [15:17:54] <_joe_> the problem is the dependency between packages expressed in puppet, which is of course lame and wrong [15:18:35] exactly, we don't need it anymore and I _think_ it is what caused the dependency cycle in the commit message [15:19:10] <_joe_> but require_packages has another goal [15:19:27] <_joe_> you can have multiple classes needing to declare the same package [15:20:34] indeed, and there's no need for that anymore in those classes [15:21:32] <_joe_> but your change is correct anyways, and is not bad per se, so just comment that on the CR, I'm willing to revert to +1 [15:22:11] ack, also note that the Class['packages::'] construct is used only there [15:22:27] <_joe_> yes I know [15:22:41] <_joe_> that was the only way to declare dependencies between packages [15:22:53] <_joe_> basically require_package foo does the following [15:23:15] <_joe_> attaches an empty packages::foo class to the node scope [15:23:29] <_joe_> only if it does not exist [15:23:49] <_joe_> and injects the package {'foo': } define inside of it [15:24:05] <_joe_> and finishes by including it in the current scope [15:24:22] <_joe_> so it's us trying to play with puppet internals [15:24:45] (03CR) 10Filippo Giunchedi: graphite/txstatsd: fix require_packages vs package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188815 (owner: 10Filippo Giunchedi) [15:25:04] yeah that's what I wasn't sure about when I saw the cycle [15:30:53] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4260 MB (3% inode=94%): [15:31:13] anyways _joe_ if that looks good I'll go ahead [15:42:04] (03PS1) 10coren: Labs: User guard manage-nfs-volumes-deamon [puppet] - 10https://gerrit.wikimedia.org/r/188817 (https://phabricator.wikimedia.org/T88579) [15:42:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Again, this is probably perfectly fine, but I'd like to minimize changes for now, so put this on hold for now." [puppet] - 10https://gerrit.wikimedia.org/r/188719 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [15:43:00] (03PS4) 10Giuseppe Lavagetto: redisdb: add codfw monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/188274 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [15:43:29] 3operations, Labs: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1018041 (10Chad) We have done it before, we have docs for it (see my previous comment). You need 3 separate accesses to make this change (meaning only a root can do all of it): * LDAP... [15:44:08] (03CR) 10Giuseppe Lavagetto: [C: 032] "This is the right way to go for now, unless we rewrite monitoring::group from scratch, which doesn't seem like a good idea to me." [puppet] - 10https://gerrit.wikimedia.org/r/188274 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [15:45:40] (03CR) 10coren: [C: 032] "Tested and WAD" [puppet] - 10https://gerrit.wikimedia.org/r/188817 (https://phabricator.wikimedia.org/T88579) (owner: 10coren) [15:46:43] _joe_: Okay if I push a0429e8? [15:46:44] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This approach won't work, alas, as it will just be evaluated on neon, where $::site is just 'eqiad'" [puppet] - 10https://gerrit.wikimedia.org/r/188275 (https://phabricator.wikimedia.org/T86894) (owner: 10Dzahn) [15:47:24] godog: any luck with those graphite alerts? [15:47:24] 0b27942 even. "redisdb: add codfw monitoring group" [15:47:41] <_joe_> Coren: yes go on [15:48:02] <_joe_> what was the other one? [15:48:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [15:48:37] paravoid: yep I was looking at those, currently the kafka ones that require a jmxtrans restart I think [15:48:50] ? [15:48:54] kafka? [15:49:09] CRITICAL: Not all configured Carbon instances are running. [15:49:14] Difference between raw and validated EventLogging overall message rates [15:49:17] etc. [15:49:37] swift container availability, which I don't understand why it's there in the first place [15:50:11] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018065 (10Nemo_bis) Of course not as important as the requirements by Mark and Tim above, but the proposed HTML has some issues which make it not an improvement. Also, it doesn't meet goals (1) and (2) in the task descrip... [15:50:30] I swat I swat! [15:50:34] * anomie sees nothing for SWAT this morning [15:50:49] <_joe_> manybubbles: mmmh you may encounter problems on mw1018 [15:50:57] <_joe_> in case, tell me [15:51:42] paravoid: for graphite1001 I'm looking at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=graphite1001 [15:51:43] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: reporter/statsd consumer/server-side-events-log consumer/mysql-m2-master consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [15:52:23] _joe_: good to go for https://gerrit.wikimedia.org/r/#/c/188815/ ? [15:52:45] _joe_: k. I'm having a slow internet day, it seems. still loading the deployments page to check what is scheduled [15:52:56] nothing! [15:52:58] victory! [15:53:42] paravoid: graphite1002 isn't in service so kinda low priority for those [15:54:32] and tungsten? [15:54:32] (03CR) 10Giuseppe Lavagetto: [C: 031] graphite/txstatsd: fix require_packages vs package [puppet] - 10https://gerrit.wikimedia.org/r/188815 (owner: 10Filippo Giunchedi) [15:57:09] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1018086 (10Joe) a:3Joe [15:57:13] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [15:57:20] paravoid: tungsten I'm looking at kafka/jmxtrans, eventlogging is next too [15:58:11] anyways, I think it is safe to bounce jmxtrans on kafka/analytics boxes but no idea how to confirm for sure [15:58:51] (03PS2) 10Filippo Giunchedi: graphite/txstatsd: fix require_packages vs package [puppet] - 10https://gerrit.wikimedia.org/r/188815 [15:59:00] paravoid: You see it as a separate issue? It'll still be cleaned up as a side effect of the latter [15:59:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite/txstatsd: fix require_packages vs package [puppet] - 10https://gerrit.wikimedia.org/r/188815 (owner: 10Filippo Giunchedi) [15:59:25] no [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150205T1600). Please do the needful. [16:00:18] nfs-common in Debian ships with a single init script, /etc/init.d/nfs-common and a /etc/default/nfs-common to configure which daemons you want it to spawn [16:00:22] * ^d takes empty swat [16:00:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:00:29] Ubuntu's version split those into separate upstart services [16:00:34] <^d> Swat over! [16:00:38] thus Service['idmapd'] doesn't exist in Debian [16:00:56] paravoid: Right, but we expect to strip idmapd entirely don't we? [16:01:30] dunno? depends on how the other ticket goes? [16:01:34] they're related, but not duplicate [16:03:14] paravoid: Yeah, okay, it makes sense to keep them separate for now. [16:03:15] !log re-enabled puppet on graphite1001, bounce uwsgi [16:03:21] Logged the message, Master [16:03:31] you weren't sure if we should disable idmapd in that other ticket, right? [16:04:03] No, I agree that disabling idmapd is probably the right thing to do, I'm just not sure how to work around the secondary issues this would cause. [16:04:40] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: reporter/statsd consumer/server-side-events-log consumer/mysql-m2-master consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [16:05:11] The ID range allocation is work but easy, but I don't think there is a sane way around the uid-allocated-by-dpkg issue other than "don't use NFS with those uids" [16:05:37] Which, honestly, sucks balls. [16:11:10] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [16:11:59] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:14:09] RECOVERY - DPKG on labmon1001 is OK: All packages OK [16:14:58] !log bounce jmxtrans on analytics1012 [16:15:05] Logged the message, Master [16:16:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [16:17:19] (03CR) 10GWicke: "Okay, sounds good if this speeds up the first iteration." [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) (owner: 10Filippo Giunchedi) [16:17:46] Reedy: g'morn, when do you want to do wmf16? [16:19:55] 3ops-codfw, operations: Update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018143 (10Joe) 3NEW [16:20:31] 3Beta-Cluster, operations: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076#1018151 (10bd808) a:5bd808>3yuvipanda Handing this off to @yuvipanda as the owner of seeing the change through. Thanks Yuvi! [16:20:58] 3ops-codfw, operations: Update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018156 (10Joe) [16:20:59] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1018155 (10Joe) [16:21:10] !log bounce jmxtrans on analytics1018, analytics1021 and analytics1022 [16:21:16] Logged the message, Master [16:21:33] (03PS6) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [16:32:54] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1018184 (10Joe) [16:32:57] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1018182 (10Joe) 5Open>3Resolved a:3Dzahn [16:39:57] (03PS1) 10Giuseppe Lavagetto: memcached: add puppet resources so that the role can be applied in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188822 [16:40:18] RECOVERY - dhclient process on platinum is OK: PROCS OK: 0 processes with command name dhclient [16:40:28] RECOVERY - Host platinum is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [16:40:28] RECOVERY - Disk space on platinum is OK: DISK OK [16:40:28] RECOVERY - puppet last run on platinum is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:40:37] RECOVERY - salt-minion processes on platinum is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:40:37] RECOVERY - DPKG on platinum is OK: All packages OK [16:40:37] RECOVERY - configured eth on platinum is OK: NRPE: Unable to read output [16:40:47] RECOVERY - Host cp1063 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [16:40:48] RECOVERY - RAID on platinum is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:43:48] PROBLEM - Host cp1063 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:54] (03PS1) 10BBlack: Bump to 3.0.6plus-wm5 for 2x patches, jessie+ only [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/188825 [16:49:37] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [16:50:08] RECOVERY - Host cp1063 is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [16:51:45] can someone check vanadium? [16:51:51] seems to have a full / [16:54:18] PROBLEM - DPKG on cp1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:55:27] RECOVERY - DPKG on cp1063 is OK: All packages OK [16:55:33] 2015-02-05 16:55:06,255 Unable to decode: ?%7B%22event%22%3A%7B%22duration%22%3Anull%7D%2C%22clientValidated%22%3Af [16:55:36] alse%2C%22revision%22%3A7536956%2C%22schema%22%3A%22Popups%22%2C%22webHost%22%3A%22ru.wikipedia.org%22%2C%22wiki%22 [16:55:39] %3A%22ruwiki%22%7D; [16:55:45] (None is not of type u'integer' [16:55:52] 82G logs full of that [16:56:50] nuria: around? [16:56:55] or milimetric? [16:57:03] paravoid: in meeting 10 mins? [16:57:19] EL trouble [16:59:33] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1018242 (10Qgil) I have updated https://wikitech.wikimedia.org/wiki/Volunteer_NDA Now THIS is the process. Feedback and fine tuning is still welcome. We might find details... [17:00:00] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1018245 (10Qgil) [17:00:24] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#11021 (10Qgil) [17:01:06] events with duration "null" are coming in; as far as I understand they're being decoded in Python to a None type which fails schema validation [17:01:35] multiple of such errors are being logged per second and quickly filled up logs [17:03:13] PROBLEM - puppet last run on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:44] PROBLEM - Varnish HTTP upload-backend on cp1063 is CRITICAL: Connection refused [17:04:52] PROBLEM - RAID on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:43] RECOVERY - RAID on cp1063 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:05:51] 3ops-codfw, operations: Update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018271 (10RobH) a:3RobH Claiming until I update this ticket with the IP addresses for papaul to use for their mgmt. [17:07:48] (03CR) 10Phuedx: [C: 031] Adding original language of this work campaign for WikiGrok (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [17:08:03] RECOVERY - Varnish HTTP upload-backend on cp1063 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.009 second response time [17:10:34] PROBLEM - Host cp1063 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:43] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:57] manybubbles: thanks for the detailed logging last night! I'm going to bounce elastic1002 in 15m and see how that does [17:11:53] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [17:12:03] PROBLEM - Host mc1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:03] PROBLEM - Host mc1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:07] <_joe_> mmm what? [17:12:13] <_joe_> oh ffs [17:12:13] PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:23] I'm getting 503s on and off on svwp... [17:12:33] PROBLEM - Host cp1061 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:33] _joe_ i believe that was me...trying to fix the power [17:12:43] ditto on enwiki [17:12:43] PROBLEM - Host cp1067 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] PROBLEM - Host cp1062 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] PROBLEM - Host cp1058 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] PROBLEM - Host mc1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] https://www.irccloud.com/pastebin/WSPhuorc [17:12:45] cp1063 is an intentional reboot of mine, ignore it [17:12:48] but the rest, are not! [17:12:49] <_joe_> oh shit [17:12:53] PROBLEM - Host cp1068 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:58] <_joe_> this is bad BAD [17:13:03] PROBLEM - Host mc1014 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:03] PROBLEM - Host mc1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:03] PROBLEM - Host mc1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:04] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:04] PROBLEM - Host mc1015 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:04] PROBLEM - Host mc1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:04] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:13] PROBLEM - Host mc1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:13] PROBLEM - Host mc1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:13] PROBLEM - Host mc1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:13] PROBLEM - Host mc1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:13] PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:14] PROBLEM - Host cp1069 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:14] PROBLEM - Host mc1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:15] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:15] PROBLEM - Host mc1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:16] PROBLEM - Host cp1064 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:23] fuck [17:13:23] PROBLEM - Host mc1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:23] PROBLEM - Host mc1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:43] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: No route to host [17:13:51] cmjohnson: what's going on? [17:14:03] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: No route to host [17:14:12] PROBLEM - HHVM rendering on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:23] mark: trying to fix the power alarms...i don't think the power supplies on servers are set to be redundant [17:14:23] <_joe_> ook [17:14:42] PROBLEM - LVS HTTPS IPv4 on upload-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [17:14:43] what did you do? [17:14:48] and what's the status now? [17:14:49] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [17:14:55] PROBLEM - HHVM rendering on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:56] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:56] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [17:15:02] PROBLEM - HHVM rendering on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:02] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:02] PROBLEM - HHVM rendering on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:02] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:07] <_joe_> of course [17:15:09] Commons not good... [17:15:13] is commons hosed right nwo? [17:15:14] what the [17:15:16] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:17] <_joe_> everything is down [17:15:18] everything is [17:15:25] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:26] PROBLEM - HHVM rendering on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:26] PROBLEM - HHVM rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:26] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:26] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:34] Doing an upgrade ? [17:15:36] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:36] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:46] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 64155 bytes in 9.771 second response time [17:15:46] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:46] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:46] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:47] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:47] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:52] <_joe_> Qcoder00: no, power outage [17:15:55] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:55] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:56] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:56] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:56] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:56] all power was never truly disconnected. [17:16:05] RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.183 second response time [17:16:05] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:06] PROBLEM - HHVM rendering on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:06] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:06] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:06] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:06] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:07] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:07] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.205 second response time [17:16:07] <_joe_> thx guillom [17:16:09] cmjohnson: WHAT is the status now [17:16:09] Power outage? [17:16:15] do the servers have powers again? [17:16:15] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:16] Don't the WMF have UPS? [17:16:19] we can't help if we don't know what's going on [17:16:21] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:21] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:21] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:22] everything is powered on [17:16:22] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:22] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:23] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:23] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:25] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:25] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:25] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19040 bytes in 0.020 second response time [17:16:28] so are they rebooting? [17:16:35] PROBLEM - HHVM rendering on mw1146 is CRITICAL: Connection timed out [17:16:35] PROBLEM - HHVM rendering on mw1143 is CRITICAL: Connection timed out [17:16:35] RECOVERY - HHVM rendering on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.159 second response time [17:16:36] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:36] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:36] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:36] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [500.0] [17:16:46] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:47] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:47] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:55] PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection timed out [17:16:56] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:56] PROBLEM - HHVM rendering on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:56] PROBLEM - HHVM rendering on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:56] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:56] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:56] PROBLEM - HHVM rendering on mw1133 is CRITICAL: Connection timed out [17:16:57] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:57] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:58] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:05] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:05] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:05] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:05] PROBLEM - HHVM rendering on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.082 second response time [17:17:05] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:06] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:06] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:06] mark: the switch [17:17:11] went down [17:17:15] PROBLEM - Apache HTTP on mw1121 is CRITICAL: Connection timed out [17:17:15] RECOVERY - HHVM rendering on mw1038 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.190 second response time [17:17:15] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:16] PROBLEM - Apache HTTP on mw1140 is CRITICAL: Connection timed out [17:17:16] PROBLEM - HHVM rendering on mw1142 is CRITICAL: Connection timed out [17:17:16] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:19] what switch? [17:17:22] asw2 [17:17:24] asw2-a5 [17:17:26] PROBLEM - HHVM rendering on mw1067 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 6.816 second response time [17:17:26] PROBLEM - HHVM rendering on mw1121 is CRITICAL: Connection timed out [17:17:26] PROBLEM - HHVM rendering on mw1130 is CRITICAL: Connection timed out [17:17:26] PROBLEM - HHVM rendering on mw1231 is CRITICAL: Connection timed out [17:17:26] PROBLEM - HHVM rendering on mw1148 is CRITICAL: Connection timed out [17:17:26] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:26] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:27] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:27] PROBLEM - HHVM rendering on mw1189 is CRITICAL: Connection timed out [17:17:28] PROBLEM - Apache HTTP on mw1136 is CRITICAL: Connection timed out [17:17:29] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:29] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:29] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:30] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:31] <_joe_> cmjohnson: any way to power back the mc servers? [17:17:32] is it back up? [17:17:38] Wuh oh [17:17:38] yes [17:17:41] powering up now [17:17:50] cmjohnson: did the servers lose power too? [17:17:51] it was the swtih [17:18:02] talk in full sentences please [17:18:21] no the servers didn't lose power, it was the switch. Apprently the cables were loose and when I was in there they disconneted [17:18:28] ok, good [17:18:29] that's good [17:18:30] ok [17:18:31] phew [17:18:33] switch is powering up now [17:18:39] which scs is it connected to? [17:18:42] Jeff_Green: ok hehe so it's not just us [17:18:44] a8? [17:18:50] <_joe_> oh just the switch? [17:18:56] paravoid: you gonna log in? [17:19:04] <_joe_> so it's just going to be a lot of restarting of appservers [17:19:05] scs-a [17:19:08] just to make sure it's booting [17:19:09] PROBLEM - HHVM busy threads on mw1190 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [115.2] [17:19:09] PROBLEM - HHVM busy threads on mw1204 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [17:19:10] RECOVERY - HHVM rendering on mw1043 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.184 second response time [17:19:10] PROBLEM - Apache HTTP on mw1126 is CRITICAL: Connection timed out [17:19:11] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:11] PROBLEM - HHVM rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:12] PROBLEM - Apache HTTP on mw1207 is CRITICAL: Connection timed out [17:19:15] PROBLEM - Apache HTTP on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:15] PROBLEM - HHVM busy threads on mw1134 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:19:15] PROBLEM - HHVM busy threads on mw1144 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:19:16] PROBLEM - HHVM rendering on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:16] PROBLEM - HHVM rendering on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:16] PROBLEM - HHVM rendering on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:16] PROBLEM - HHVM rendering on mw1236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:17] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.204 second response time [17:19:17] PROBLEM - HHVM rendering on mw1207 is CRITICAL: Connection timed out [17:19:18] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.086 second response time [17:19:18] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:19:19] PROBLEM - HHVM queue size on mw1225 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:19:19] PROBLEM - HHVM busy threads on mw1197 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:19:32] PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:35] PROBLEM - HHVM busy threads on mw1128 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:19:35] PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:35] PROBLEM - HHVM rendering on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:35] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:35] PROBLEM - HHVM rendering on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.120 second response time [17:19:35] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:35] RECOVERY - HHVM rendering on mw1067 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.188 second response time [17:19:36] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.208 second response time [17:19:36] RECOVERY - HHVM rendering on mw1090 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.191 second response time [17:19:37] RECOVERY - HHVM rendering on mw1100 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.209 second response time [17:19:37] PROBLEM - HHVM rendering on mw1040 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.072 second response time [17:19:38] PROBLEM - HHVM busy threads on mw1198 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [17:19:49] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:54] hi. I'm guessing there's no need to add a report about an outage I'm seeing? [17:19:55] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 1.309 second response time [17:19:55] PROBLEM - HHVM rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:55] PROBLEM - HHVM busy threads on mw1194 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [17:19:55] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:19:56] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [17:19:56] PROBLEM - HHVM busy threads on mw1208 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [17:19:56] PROBLEM - HHVM busy threads on mw1206 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:19:57] PROBLEM - HHVM busy threads on mw1123 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4] [17:19:57] PROBLEM - HHVM busy threads on mw1148 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:19:57] PROBLEM - HHVM queue size on mw1197 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:19:58] PROBLEM - HHVM rendering on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:58] PROBLEM - Apache HTTP on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:59] RECOVERY - HHVM rendering on mw1035 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.266 second response time [17:19:59] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:20:02] abartov: Pretty sure we know [17:20:14] Right. :) [17:20:15] PROBLEM - HHVM busy threads on mw1142 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:20:15] PROBLEM - HHVM busy threads on mw1146 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:20:15] PROBLEM - HHVM busy threads on mw1199 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:20:16] RECOVERY - HHVM rendering on mw1214 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.211 second response time [17:20:16] PROBLEM - HHVM busy threads on mw1124 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:20:16] PROBLEM - HHVM busy threads on mw1126 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:20:17] abartov: right :P [17:20:25] PROBLEM - HHVM busy threads on mw1117 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [17:20:25] PROBLEM - HHVM busy threads on mw1130 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:20:25] PROBLEM - HHVM queue size on mw1224 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:20:25] PROBLEM - HHVM busy threads on mw1196 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:20:26] PROBLEM - HHVM rendering on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.077 second response time [17:20:26] PROBLEM - HHVM rendering on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.074 second response time [17:20:26] RECOVERY - HHVM rendering on mw1056 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 1.161 second response time [17:20:26] PROBLEM - HHVM rendering on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.106 second response time [17:20:26] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.200 second response time [17:20:27] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:27] PROBLEM - HHVM rendering on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:31] Great enwp blew up [17:20:35] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.206 second response time [17:20:35] PROBLEM - HHVM rendering on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:36] PROBLEM - HHVM queue size on mw1227 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:20:36] PROBLEM - HHVM rendering on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:36] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.190 second response time [17:20:36] PROBLEM - HHVM rendering on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:36] PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:37] PROBLEM - Apache HTTP on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:37] PROBLEM - HHVM queue size on mw1233 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:20:37] everything went, T13|mobile [17:20:38] PROBLEM - HHVM rendering on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:38] PROBLEM - HHVM queue size on mw1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:20:39] PROBLEM - HHVM busy threads on mw1111 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:20:39] PROBLEM - HHVM busy threads on mw1202 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [17:20:50] Kablooie [17:20:52] PROBLEM - Apache HTTP on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:52] PROBLEM - HHVM queue size on mw1221 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:20:52] PROBLEM - HHVM busy threads on mw1119 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:20:52] PROBLEM - HHVM rendering on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:52] PROBLEM - HHVM busy threads on mw1147 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [17:20:53] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.155 second response time [17:20:53] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 3030 bytes in 0.609 second response time [17:20:58] PROBLEM - HHVM queue size on mw1222 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [17:20:59] PROBLEM - HHVM rendering on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 2.912 second response time [17:20:59] PROBLEM - HHVM rendering on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:59] PROBLEM - HHVM rendering on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:08] PROBLEM - HHVM busy threads on mw1240 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [115.2] [17:21:08] RECOVERY - HHVM rendering on mw1082 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.201 second response time [17:21:08] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.188 second response time [17:21:08] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.227 second response time [17:21:08] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.196 second response time [17:21:08] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.198 second response time [17:21:08] PROBLEM - HHVM busy threads on mw1253 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:21:09] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.190 second response time [17:21:09] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.212 second response time [17:21:10] PROBLEM - HHVM rendering on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.074 second response time [17:21:10] PROBLEM - HHVM rendering on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.215 second response time [17:21:11] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.208 second response time [17:21:27] switch is up [17:21:28] PROBLEM - HHVM rendering on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:28] PROBLEM - HHVM busy threads on mw1127 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:21:28] PROBLEM - HHVM queue size on mw1230 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:21:29] PROBLEM - HHVM busy threads on mw1137 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:21:29] PROBLEM - HHVM queue size on mw1194 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:21:29] PROBLEM - HHVM queue size on mw1114 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:21:29] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.184 second response time [17:21:29] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.478 second response time [17:21:30] PROBLEM - HHVM rendering on mw1051 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.078 second response time [17:21:30] PROBLEM - HHVM rendering on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:31] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.179 second response time [17:21:31] PROBLEM - Apache HTTP on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:37] maybe we can kick icinga-wm for time being? it's like a spam bot now :( [17:21:39] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.201 second response time [17:21:39] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.092 second response time [17:21:40] PROBLEM - HHVM queue size on mw1206 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:21:40] PROBLEM - HHVM busy threads on mw1028 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:21:40] PROBLEM - HHVM busy threads on mw1106 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:21:40] PROBLEM - HHVM queue size on mw1190 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:21:40] PROBLEM - HHVM queue size on mw1202 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:21:41] PROBLEM - HHVM busy threads on mw1248 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [17:21:41] PROBLEM - HHVM queue size on mw1200 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:21:42] RECOVERY - HHVM rendering on mw1069 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 5.339 second response time [17:21:43] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.188 second response time [17:21:43] RECOVERY - HHVM rendering on mw1033 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.205 second response time [17:21:43] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66588 bytes in 0.180 second response time [17:21:44] FlorianSW: no [17:21:48] ok [17:21:48] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.279 second response time [17:21:48] PROBLEM - HHVM queue size on mw1115 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [17:21:48] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:48] RECOVERY - Host cp1063 is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [17:21:48] PROBLEM - HHVM rendering on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.082 second response time [17:21:54] it's useful to us, and that's the primary purpose of this channel :) [17:22:08] PROBLEM - HHVM rendering on mw1042 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.150 second response time [17:22:08] RECOVERY - Host mc1009 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [17:22:08] RECOVERY - Host mc1003 is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [17:22:08] RECOVERY - Host mc1004 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [17:22:09] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [17:22:09] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [17:22:09] RECOVERY - Host mc1008 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [17:22:10] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:22:10] PROBLEM - Apache HTTP on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:11] RECOVERY - Host mc1015 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [17:22:11] RECOVERY - Host mc1014 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [17:22:12] RECOVERY - Host cp1068 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [17:22:12] PROBLEM - HHVM queue size on mw1193 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:22:13] PROBLEM - HHVM busy threads on mw1243 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [115.2] [17:22:14] <_joe_> ok [17:22:25] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:25] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:25] PROBLEM - HHVM queue size on mw1253 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:22:25] PROBLEM - HHVM busy threads on mw1093 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:22:25] PROBLEM - HHVM queue size on mw1199 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:22:25] RECOVERY - HHVM rendering on mw1068 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.189 second response time [17:22:25] PROBLEM - HHVM busy threads on mw1247 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:22:26] RECOVERY - Host mc1010 is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [17:22:26] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 3.46 ms [17:22:27] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19111 bytes in 0.060 second response time [17:22:31] PROBLEM - HHVM queue size on mw1228 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [17:22:31] PROBLEM - HHVM busy threads on mw1246 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:22:31] PROBLEM - HHVM rendering on mw1094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 3.326 second response time [17:22:31] PROBLEM - HHVM busy threads on mw1056 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:22:31] PROBLEM - HHVM busy threads on mw1254 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:22:32] PROBLEM - HHVM busy threads on mw1245 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:22:48] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19080 bytes in 0.019 second response time [17:22:55] PROBLEM - HHVM queue size on mw1119 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:22:55] PROBLEM - HHVM busy threads on mw1058 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:22:56] PROBLEM - HHVM busy threads on mw1044 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:22:56] PROBLEM - HHVM queue size on mw1203 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [17:22:56] RECOVERY - HHVM rendering on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 1.218 second response time [17:22:56] RECOVERY - HHVM rendering on mw1045 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 4.364 second response time [17:22:58] PROBLEM - HHVM queue size on mw1198 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [17:22:58] PROBLEM - HHVM busy threads on mw1249 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [17:22:58] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:58] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:58] PROBLEM - HHVM queue size on mw1116 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [80.0] [17:22:59] PROBLEM - HHVM busy threads on mw1031 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:22:59] PROBLEM - HHVM busy threads on mw1072 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4] [17:23:08] PROBLEM - HHVM busy threads on mw1113 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:23:08] PROBLEM - HHVM queue size on mw1249 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [17:23:08] PROBLEM - HHVM busy threads on mw1252 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [17:23:08] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66428 bytes in 0.071 second response time [17:23:14] RECOVERY - HHVM rendering on mw1051 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.195 second response time [17:23:15] RECOVERY - HHVM rendering on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.211 second response time [17:23:15] PROBLEM - HHVM rendering on mw1066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.167 second response time [17:23:15] PROBLEM - HHVM busy threads on mw1108 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4] [17:23:15] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19040 bytes in 0.057 second response time [17:23:21] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:21] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [17:23:21] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:21] PROBLEM - HHVM rendering on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:21] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:31] PROBLEM - HHVM busy threads on mw1060 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:23:31] PROBLEM - HHVM busy threads on mw1069 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:23:31] PROBLEM - HHVM busy threads on mw1242 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:23:31] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures [17:23:31] PROBLEM - HHVM busy threads on mw1079 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:23:31] PROBLEM - HHVM busy threads on mw1024 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:23:31] PROBLEM - HHVM busy threads on mw1257 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:23:32] PROBLEM - HHVM busy threads on mw1255 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [17:23:32] PROBLEM - HHVM queue size on mw1246 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:23:33] PROBLEM - HHVM rendering on mw1043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.350 second response time [17:23:33] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.216 second response time [17:23:41] PROBLEM - HHVM rendering on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.087 second response time [17:23:41] PROBLEM - HHVM rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:41] PROBLEM - HHVM busy threads on mw1037 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:23:42] PROBLEM - HHVM rendering on mw1086 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.971 second response time [17:23:42] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:51] RECOVERY - HHVM rendering on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 8.695 second response time [17:23:51] PROBLEM - HHVM rendering on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:51] PROBLEM - HHVM busy threads on mw1181 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:23:51] PROBLEM - HHVM busy threads on mw1186 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:23:51] PROBLEM - HHVM busy threads on mw1071 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:23:51] PROBLEM - HHVM busy threads on mw1086 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:23:51] PROBLEM - HHVM busy threads on mw1097 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:23:52] PROBLEM - HHVM busy threads on mw1099 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:23:52] PROBLEM - HHVM busy threads on mw1109 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:23:53] PROBLEM - HHVM busy threads on mw1101 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:23:53] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:54] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:03] SInce I can't use Wikipedia for this, can someone with a subscription to britannica.com check what Wikipedia is? [17:24:08] PROBLEM - HHVM rendering on mw1038 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.087 second response time [17:24:08] PROBLEM - HHVM busy threads on mw1107 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [17:24:09] PROBLEM - HHVM queue size on mw1244 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [80.0] [17:24:09] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:09] PROBLEM - HHVM busy threads on mw1035 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:24:09] PROBLEM - HHVM queue size on mw1207 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [80.0] [17:24:11] PROBLEM - HHVM rendering on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:11] PROBLEM - HHVM rendering on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:11] PROBLEM - HHVM rendering on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:11] PROBLEM - HHVM rendering on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:11] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:12] PROBLEM - HHVM rendering on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.085 second response time [17:24:12] PROBLEM - HHVM rendering on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:22] PROBLEM - HHVM queue size on mw1256 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [17:24:22] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66456 bytes in 0.800 second response time [17:24:27] PROBLEM - HHVM busy threads on mw1090 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:24:27] PROBLEM - HHVM busy threads on mw1250 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [17:24:28] PROBLEM - HHVM busy threads on mw1100 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:24:28] PROBLEM - HHVM busy threads on mw1027 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:24:28] PROBLEM - HHVM busy threads on mw1168 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:24:28] PROBLEM - HHVM rendering on mw1089 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 2.807 second response time [17:24:28] RECOVERY - HHVM rendering on mw1103 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 3.454 second response time [17:24:29] RECOVERY - HHVM rendering on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.202 second response time [17:24:29] PROBLEM - HHVM rendering on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:30] PROBLEM - HHVM busy threads on mw1104 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:24:31] PROBLEM - HHVM queue size on mw1226 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:24:31] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.427 second response time [17:24:31] PROBLEM - HHVM rendering on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:32] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:36] Hi, https://dpaste.de/0SKg Here is the output of installation checking of pear as given at http://pear.php.net/manual/en/installation.checking.php [17:24:43] PROBLEM - HHVM rendering on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:43] PROBLEM - HHVM rendering on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:51] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 0.222 second response time [17:24:51] PROBLEM - HHVM busy threads on mw1110 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:24:51] PROBLEM - HHVM rendering on mw1214 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.212 second response time [17:24:51] PROBLEM - HHVM busy threads on mw1059 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:24:51] PROBLEM - HHVM busy threads on mw1043 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:24:52] PROBLEM - HHVM rendering on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:52] PROBLEM - HHVM busy threads on mw1162 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:24:52] PROBLEM - HHVM busy threads on mw1046 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:24:52] PROBLEM - HHVM queue size on mw1248 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [80.0] [17:24:53] PROBLEM - HHVM rendering on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:53] PROBLEM - HHVM rendering on mw1056 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.330 second response time [17:24:54] PROBLEM - HHVM rendering on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:01] RECOVERY - HHVM rendering on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 66265 bytes in 0.174 second response time [17:25:01] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:02] PROBLEM - HHVM rendering on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:02] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 66265 bytes in 0.175 second response time [17:25:02] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.067 second response time [17:25:02] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 3080 bytes in 0.597 second response time [17:25:03] PROBLEM - HHVM rendering on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 9.608 second response time [17:25:03] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:04] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:04] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:05] JacksonIsaac: I think you might be in the wrong channel? [17:25:05] PROBLEM - HHVM rendering on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:05] PROBLEM - HHVM busy threads on mw1177 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:25:11] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:12] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.168 second response time [17:25:12] PROBLEM - HHVM busy threads on mw1066 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:12] PROBLEM - HHVM busy threads on mw1087 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:12] PROBLEM - HHVM busy threads on mw1182 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:25:12] PROBLEM - HHVM busy threads on mw1033 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:23] PROBLEM - HHVM rendering on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.078 second response time [17:25:23] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:24] Josve05a: wrong channel? [17:25:24] PROBLEM - HHVM busy threads on mw1164 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:25:24] PROBLEM - HHVM busy threads on mw1075 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:25] PROBLEM - HHVM queue size on mw1245 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:25:25] PROBLEM - HHVM busy threads on mw1022 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:25:26] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [17:25:26] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.557 second response time [17:25:27] PROBLEM - HHVM rendering on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.082 second response time [17:25:27] PROBLEM - HHVM rendering on mw1212 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.081 second response time [17:25:28] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:28] PROBLEM - HHVM queue size on mw1257 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [17:25:29] PROBLEM - HHVM queue size on mw1240 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:25:29] RECOVERY - HHVM rendering on mw1036 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 2.766 second response time [17:25:30] PROBLEM - HHVM rendering on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.080 second response time [17:25:31] RECOVERY - HHVM rendering on mw1074 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 2.288 second response time [17:25:31] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 64154 bytes in 1.846 second response time [17:25:31] PROBLEM - Apache HTTP on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:37] legoktm: bd808 asked me to join here [17:25:38] andre__: Joke on the servers being down... [17:25:39] it looks like we may want to send out a tweet per https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public - if so, let me know (or email communications@) [17:25:43] PROBLEM - HHVM busy threads on mw1218 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:25:43] PROBLEM - HHVM rendering on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:44] PROBLEM - HHVM busy threads on mw1103 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:44] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 5.801 second response time [17:25:45] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.670 second response time [17:25:45] PROBLEM - HHVM rendering on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.086 second response time [17:25:46] PROBLEM - HHVM queue size on mw1254 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:25:46] PROBLEM - HHVM rendering on mw1049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.086 second response time [17:25:47] JacksonIsaac: just bad timing :) [17:25:49] HaeB: guillom is already on it [17:25:51] PROBLEM - HHVM busy threads on mw1034 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:25:52] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [17:25:52] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: puppet fail [17:25:52] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:52] PROBLEM - HHVM busy threads on mw1081 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:25:52] PROBLEM - HHVM rendering on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:52] PROBLEM - HHVM rendering on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.073 second response time [17:25:52] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.580 second response time [17:25:53] PROBLEM - HHVM busy threads on mw1023 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:25:53] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:54] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: puppet fail [17:25:54] PROBLEM - HHVM busy threads on mw1167 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:25:55] PROBLEM - HHVM busy threads on mw1092 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:25:55] PROBLEM - HHVM busy threads on mw1054 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:26:11] RECOVERY - HHVM rendering on mw1071 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 7.665 second response time [17:26:11] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [17:26:11] RECOVERY - HHVM rendering on mw1151 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.199 second response time [17:26:11] PROBLEM - HHVM busy threads on mw1080 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:26:11] PROBLEM - HHVM rendering on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:11] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:12] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.969 second response time [17:26:12] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: puppet fail [17:26:12] PROBLEM - HHVM busy threads on mw1042 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:26:13] PROBLEM - HHVM busy threads on mw1041 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:26:13] RECOVERY - HHVM rendering on mw1110 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 6.317 second response time [17:26:17] paravoid: cool [17:26:18] * JacksonIsaac icinga is raining with messages [17:26:21] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.763 second response time [17:26:21] PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:26:22] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:22] PROBLEM - HHVM busy threads on mw1057 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:26:22] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:26:22] HaeB: Feel free to do so; I just sent a message to the list [17:26:25] !log repooled cp1063 frontend-only [17:26:31] PROBLEM - HHVM busy threads on mw1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:26:31] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.082 second response time [17:26:31] RECOVERY - HHVM rendering on mw1094 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.187 second response time [17:26:32] PROBLEM - HHVM busy threads on mw1112 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [17:26:32] PROBLEM - HHVM rendering on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.078 second response time [17:26:32] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.163 second response time [17:26:32] PROBLEM - HHVM busy threads on mw1212 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:26:33] RECOVERY - HHVM rendering on mw1109 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.958 second response time [17:26:33] PROBLEM - HHVM busy threads on mw1063 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:26:34] Logged the message, Master [17:26:34] RECOVERY - HHVM rendering on mw1089 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.186 second response time [17:26:34] PROBLEM - HHVM rendering on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 5.732 second response time [17:26:41] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 2.583 second response time [17:26:41] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 6.217 second response time [17:26:42] PROBLEM - HHVM busy threads on mw1084 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:26:42] PROBLEM - HHVM busy threads on mw1213 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:26:42] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 4.741 second response time [17:26:43] PROBLEM - HHVM rendering on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:43] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.223 second response time [17:26:43] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.900 second response time [17:26:43] PROBLEM - HHVM busy threads on mw1082 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:26:44] PROBLEM - HHVM rendering on mw1045 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.370 second response time [17:26:51] PROBLEM - HHVM queue size on mw1242 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [17:26:51] PROBLEM - HHVM busy threads on mw1091 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:26:51] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2971 bytes in 0.040 second response time [17:26:57] PROBLEM - HHVM rendering on mw1183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.078 second response time [17:26:57] PROBLEM - HHVM queue size on mw1237 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:26:57] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2930 bytes in 0.025 second response time [17:27:04] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:04] PROBLEM - HHVM rendering on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - HHVM rendering on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:05] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 9.985 second response time [17:27:10] HaeB: Too late... https://twitter.com/JonatanGlad/status/563388304591429632 [17:27:14] RECOVERY - HHVM rendering on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.156 second response time [17:27:14] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 64147 bytes in 0.184 second response time [17:27:15] RECOVERY - HHVM rendering on mw1186 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.194 second response time [17:27:15] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2981 bytes in 0.096 second response time [17:27:24] PROBLEM - HHVM rendering on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.818 second response time [17:27:24] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.097 second response time [17:27:24] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.952 second response time [17:27:24] PROBLEM - HHVM rendering on mw1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.536 second response time [17:27:24] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.539 second response time [17:27:32] is that really coming back up? ganglia does not look promising [17:27:35] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:27:35] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.176 second response time [17:27:36] PROBLEM - HHVM busy threads on mw1098 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [17:27:36] PROBLEM - HHVM rendering on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:36] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [17:27:41] I was first :P [17:27:44] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.954 second response time [17:27:44] https://twitter.com/Wikimedia/status/563388375898411008 [17:27:44] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.775 second response time [17:27:45] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.084 second response time [17:27:45] PROBLEM - HHVM busy threads on mw1211 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:27:45] PROBLEM - HHVM busy threads on mw1185 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:27:45] RECOVERY - HHVM rendering on mw1215 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 2.179 second response time [17:27:45] PROBLEM - HHVM busy threads on mw1105 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:27:54] RECOVERY - HHVM rendering on mw1217 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.201 second response time [17:27:54] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.189 second response time [17:27:54] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.187 second response time [17:27:54] RECOVERY - HHVM rendering on mw1212 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.213 second response time [17:27:55] RECOVERY - HHVM rendering on mw1078 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 3.449 second response time [17:27:55] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 2.978 second response time [17:27:55] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.553 second response time [17:27:56] PROBLEM - HHVM queue size on mw1252 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [17:28:05] PROBLEM - HHVM busy threads on mw1175 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:28:05] PROBLEM - HHVM rendering on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:05] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.149 second response time [17:28:05] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.091 second response time [17:28:05] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.225 second response time [17:28:05] PROBLEM - HHVM busy threads on mw1172 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:28:06] PROBLEM - HHVM rendering on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:14] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17115 bytes in 6.444 second response time [17:28:20] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.622 second response time [17:28:21] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.446 second response time [17:28:21] PROBLEM - HHVM busy threads on mw1184 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:28:21] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66428 bytes in 0.081 second response time [17:28:27] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.447 second response time [17:28:27] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.096 second response time [17:28:27] PROBLEM - HHVM queue size on mw1258 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [17:28:27] PROBLEM - HHVM queue size on mw1255 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [17:28:38] RECOVERY - HHVM rendering on mw1069 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.186 second response time [17:28:38] PROBLEM - HHVM busy threads on mw1019 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:28:38] PROBLEM - HHVM queue size on mw1241 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [17:28:47] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:47] RECOVERY - HHVM rendering on mw1064 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.197 second response time [17:28:47] RECOVERY - HHVM rendering on mw1048 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.190 second response time [17:28:47] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 2.018 second response time [17:28:47] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.129 second response time [17:28:48] RECOVERY - HHVM rendering on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 2.702 second response time [17:28:48] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 64181 bytes in 0.172 second response time [17:28:57] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.602 second response time [17:28:57] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:57] PROBLEM - HHVM rendering on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:57] PROBLEM - HHVM rendering on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:57] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:57] PROBLEM - HHVM rendering on mw1102 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.404 second response time [17:28:58] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 64171 bytes in 3.383 second response time [17:28:58] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:04] <_joe_> paravoid: which one did you restart? [17:29:08] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:08] PROBLEM - HHVM busy threads on mw1077 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:29:08] RECOVERY - HHVM rendering on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 64171 bytes in 0.197 second response time [17:29:08] PROBLEM - HHVM rendering on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.080 second response time [17:29:08] PROBLEM - HHVM rendering on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:08] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:09] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 64171 bytes in 0.195 second response time [17:29:18] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 64171 bytes in 0.206 second response time [17:29:18] RECOVERY - HHVM rendering on mw1068 is OK: HTTP OK: HTTP/1.1 200 OK - 64171 bytes in 0.402 second response time [17:29:18] PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:19] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:19] PROBLEM - HHVM rendering on mw1085 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.074 second response time [17:29:19] PROBLEM - HHVM rendering on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 2.107 second response time [17:29:19] PROBLEM - HHVM rendering on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 2.788 second response time [17:29:19] PROBLEM - HHVM busy threads on mw1029 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:29:20] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.060 second response time [17:29:20] PROBLEM - HHVM rendering on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:37] PROBLEM - HHVM busy threads on mw1051 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:29:37] PROBLEM - HHVM rendering on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.067 second response time [17:29:37] PROBLEM - HHVM rendering on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - HHVM rendering on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - HHVM busy threads on mw1188 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:29:47] PROBLEM - HHVM busy threads on mw1068 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:29:47] PROBLEM - HHVM rendering on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - HHVM rendering on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - HHVM busy threads on mw1179 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:29:48] PROBLEM - HHVM rendering on mw1214 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 2.337 second response time [17:29:48] PROBLEM - HHVM rendering on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.098 second response time [17:29:49] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:57] PROBLEM - HHVM rendering on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.078 second response time [17:29:57] PROBLEM - HHVM rendering on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.082 second response time [17:29:57] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.193 second response time [17:29:58] RECOVERY - HHVM rendering on mw1080 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.189 second response time [17:29:58] RECOVERY - HHVM rendering on mw1063 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 2.628 second response time [17:29:58] RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 3.121 second response time [17:29:58] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 5.109 second response time [17:29:59] PROBLEM - HHVM rendering on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:59] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:00] PROBLEM - HHVM rendering on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:08] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:08] PROBLEM - HHVM busy threads on mw1161 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:30:08] PROBLEM - HHVM busy threads on mw1049 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4] [17:30:08] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:17] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.936 second response time [17:30:17] PROBLEM - HHVM rendering on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:17] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:30:17] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.202 second response time [17:30:18] RECOVERY - HHVM rendering on mw1104 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.196 second response time [17:30:18] PROBLEM - HHVM rendering on mw1094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.092 second response time [17:30:18] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.845 second response time [17:30:18] RECOVERY - HHVM rendering on mw1070 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 1.514 second response time [17:30:18] PROBLEM - HHVM rendering on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.801 second response time [17:30:19] 3ops-codfw, operations: Update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018355 (10RobH) a:5RobH>3Papaul Papaul, Please confirm the bios settings on mc2001-2008 and ensure they have the CPU logical processor enabled (hyperthreading). Then confir... [17:30:19] PROBLEM - HHVM rendering on mw1040 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.072 second response time [17:30:28] RECOVERY - HHVM rendering on mw1085 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.186 second response time [17:30:28] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.074 second response time [17:30:28] PROBLEM - HHVM rendering on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.751 second response time [17:30:28] RECOVERY - HHVM rendering on mw1103 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.185 second response time [17:30:28] PROBLEM - HHVM busy threads on mw1165 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:30:29] PROBLEM - HHVM rendering on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:29] RECOVERY - HHVM rendering on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.207 second response time [17:30:29] RECOVERY - HHVM rendering on mw1105 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.244 second response time [17:30:37] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.818 second response time [17:30:37] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 2.355 second response time [17:30:37] PROBLEM - HHVM rendering on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:37] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:38] PROBLEM - HHVM busy threads on mw1215 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:30:38] PROBLEM - HHVM rendering on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:38] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.192 second response time [17:30:38] RECOVERY - HHVM rendering on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.213 second response time [17:30:38] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66437 bytes in 0.016 second response time [17:30:44] PROBLEM - HHVM rendering on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:50] 3ops-codfw, operations: Update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018359 (10RobH) p:5Normal>3High [17:30:57] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 2.907 second response time [17:30:57] PROBLEM - HHVM busy threads on mw1163 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:30:58] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:30:58] PROBLEM - HHVM rendering on mw1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.079 second response time [17:30:58] PROBLEM - HHVM busy threads on mw1217 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:30:58] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:58] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [17:31:08] PROBLEM - HHVM busy threads on mw1173 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:31:08] PROBLEM - HHVM rendering on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:08] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:08] RECOVERY - HHVM rendering on mw1086 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 3.086 second response time [17:31:09] PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.093 second response time [17:31:09] RECOVERY - HHVM rendering on mw1102 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.183 second response time [17:31:09] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2998 bytes in 0.381 second response time [17:31:27] RECOVERY - HHVM rendering on mw1098 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.200 second response time [17:31:28] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.586 second response time [17:31:28] PROBLEM - HHVM busy threads on mw1085 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:31:28] PROBLEM - HHVM busy threads on mw1021 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:31:28] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:37] PROBLEM - HHVM rendering on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.086 second response time [17:31:37] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 3030 bytes in 0.600 second response time [17:31:44] RECOVERY - HHVM rendering on mw1092 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.216 second response time [17:31:44] RECOVERY - HHVM rendering on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.228 second response time [17:31:44] RECOVERY - HHVM rendering on mw1059 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.194 second response time [17:31:44] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.569 second response time [17:31:44] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.190 second response time [17:31:47] PROBLEM - HHVM rendering on mw1087 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 2.160 second response time [17:31:47] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 1.624 second response time [17:31:47] PROBLEM - HHVM rendering on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:47] PROBLEM - HHVM rendering on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:47] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:47] PROBLEM - HHVM rendering on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:47] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:48] PROBLEM - HHVM rendering on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:08] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:14] RECOVERY - HHVM rendering on mw1183 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.193 second response time [17:32:14] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.191 second response time [17:32:14] RECOVERY - HHVM rendering on mw1051 is OK: HTTP OK: HTTP/1.1 200 OK - 64162 bytes in 0.188 second response time [17:32:14] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:15] PROBLEM - HHVM rendering on mw1077 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 1.257 second response time [17:32:15] PROBLEM - HHVM rendering on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.083 second response time [17:32:15] RECOVERY - HHVM rendering on mw1079 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 4.959 second response time [17:32:15] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:15] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.375 second response time [17:32:17] PROBLEM - HHVM rendering on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:17] PROBLEM - HHVM rendering on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:17] PROBLEM - HHVM rendering on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:27] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66460 bytes in 0.042 second response time [17:32:33] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:32:34] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [17:32:34] RECOVERY - HHVM rendering on mw1071 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 1.861 second response time [17:32:34] PROBLEM - HHVM rendering on mw1186 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 6.729 second response time [17:32:37] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.293 second response time [17:32:37] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 3040 bytes in 0.515 second response time [17:32:57] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.736 second response time [17:32:58] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.313 second response time [17:32:58] PROBLEM - HHVM busy threads on mw1070 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:32:58] PROBLEM - HHVM rendering on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.072 second response time [17:33:07] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.078 second response time [17:33:07] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:07] Looks bad ^ [17:33:07] RECOVERY - HHVM rendering on mw1094 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 2.786 second response time [17:33:07] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.184 second response time [17:33:08] PROBLEM - HHVM rendering on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.076 second response time [17:33:08] PROBLEM - HHVM rendering on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 2.436 second response time [17:33:08] PROBLEM - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2986 bytes in 0.307 second response time [17:33:14] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.211 second response time [17:33:14] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2988 bytes in 0.382 second response time [17:33:14] seriously, what's happening? [17:33:21] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.892 second response time [17:33:21] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:30] PROBLEM - HHVM busy threads on mw1176 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:33:31] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.093 second response time [17:33:31] things are rebooting. the ops folks are working on it. [17:33:40] PROBLEM - HHVM busy threads on mw1151 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:33:40] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.823 second response time [17:33:40] RECOVERY - HHVM rendering on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 64157 bytes in 8.624 second response time [17:33:40] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.414 second response time [17:33:40] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:41] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.470 second response time [17:33:41] RECOVERY - HHVM rendering on mw1150 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 3.601 second response time [17:33:41] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.268 second response time [17:33:41] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.414 second response time [17:33:42] RECOVERY - HHVM rendering on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 9.274 second response time [17:33:42] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:43] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [17:33:54] PROBLEM - HHVM busy threads on mw1219 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:33:54] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:54] quiddity: what, servers or services on it? [17:33:55] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.211 second response time [17:33:55] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 4.083 second response time [17:33:56] SPF|Cloud: a report will be made public later [17:33:56] RECOVERY - HHVM rendering on mw1186 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.345 second response time [17:33:56] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.221 second response time [17:33:57] SPF|Cloud: Power outage [17:33:57] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.049 second response time [17:33:57] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.317 second response time [17:33:58] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.200 second response time [17:33:58] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 2.316 second response time [17:33:58] !log restarting HHVM on all appservers/API appservers in 10%/6s batches [17:34:00] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [17:34:01] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [17:34:01] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [17:34:01] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.156 second response time [17:34:01] RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.151 second response time [17:34:01] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 66267 bytes in 9.127 second response time [17:34:02] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 66268 bytes in 9.594 second response time [17:34:02] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.048 second response time [17:34:06] Logged the message, Master [17:34:14] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [17:34:14] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [17:34:15] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.104 second response time [17:34:15] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.216 second response time [17:34:16] RECOVERY - HHVM rendering on mw1110 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.345 second response time [17:34:20] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.532 second response time [17:34:20] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 66268 bytes in 2.348 second response time [17:34:20] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.246 second response time [17:34:20] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.281 second response time [17:34:20] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.128 second response time [17:34:20] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.138 second response time [17:34:20] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.361 second response time [17:34:21] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 2.004 second response time [17:34:21] RECOVERY - Apache HTTP on mw1250 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.680 second response time [17:34:22] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 4.641 second response time [17:34:22] PROBLEM - HHVM busy threads on mw1021 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:34:23] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.097 second response time [17:34:30] SPF|Cloud: Looks like a lot of stuff is recovering now [17:34:34] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.335 second response time [17:34:34] RECOVERY - HHVM rendering on mw1078 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.332 second response time [17:34:35] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.482 second response time [17:34:35] RECOVERY - HHVM rendering on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.788 second response time [17:34:36] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64332 bytes in 1.216 second response time [17:34:37] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.128 second response time [17:34:37] RECOVERY - HHVM rendering on mw1093 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.246 second response time [17:34:37] RECOVERY - HHVM rendering on mw1037 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.345 second response time [17:34:38] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.593 second response time [17:34:38] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 66274 bytes in 1.687 second response time [17:34:39] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 66267 bytes in 1.767 second response time [17:34:39] PROBLEM - HHVM busy threads on mw1025 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:34:41] RECOVERY - HHVM rendering on mw1089 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 1.070 second response time [17:34:41] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 66268 bytes in 2.109 second response time [17:34:41] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 66274 bytes in 2.160 second response time [17:34:41] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 1.361 second response time [17:34:42] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.983 second response time [17:34:42] PROBLEM - HHVM rendering on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:49] yeah, enwiki is throwing 503 Service Unavailable errors, but thanks for the update [17:34:57] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.105 second response time [17:34:57] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.466 second response time [17:34:57] PROBLEM - HHVM busy threads on mw1020 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:34:57] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 9.677 second response time [17:34:57] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 64673 bytes in 9.795 second response time [17:34:57] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.332 second response time [17:34:57] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.675 second response time [17:35:00] PROBLEM - HHVM rendering on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:00] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time [17:35:00] RECOVERY - HHVM rendering on mw1035 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.171 second response time [17:35:00] RECOVERY - HHVM rendering on mw1084 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.205 second response time [17:35:00] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.077 second response time [17:35:00] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.621 second response time [17:35:01] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.211 second response time [17:35:02] SPF|Cloud: we know :) [17:35:12] RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.690 second response time [17:35:12] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.953 second response time [17:35:13] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.139 second response time [17:35:13] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:14] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.172 second response time [17:35:14] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.200 second response time [17:35:15] PROBLEM - HHVM busy threads on mw1183 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:35:20] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] RECOVERY - HHVM rendering on mw1088 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.169 second response time [17:35:20] RECOVERY - HHVM rendering on mw1149 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.223 second response time [17:35:20] RECOVERY - HHVM rendering on mw1033 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.334 second response time [17:35:21] RECOVERY - HHVM rendering on mw1046 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.168 second response time [17:35:21] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.211 second response time [17:35:21] RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.172 second response time [17:35:22] RECOVERY - HHVM rendering on mw1080 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.254 second response time [17:35:22] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.540 second response time [17:35:23] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 64157 bytes in 3.414 second response time [17:35:23] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [17:35:24] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.171 second response time [17:35:24] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 66267 bytes in 0.230 second response time [17:35:35] anyone looking? [17:35:37] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.212 second response time [17:35:40] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:44] coming back [17:35:51] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time [17:35:51] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.082 second response time [17:35:51] RECOVERY - HHVM rendering on mw1038 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.412 second response time [17:35:51] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66473 bytes in 1.268 second response time [17:35:51] aude: ops are on it [17:35:53] aude: yes, being worked on [17:35:53] <_joe_> aude: we all are [17:35:57] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 4.089 second response time [17:35:57] RECOVERY - HHVM rendering on mw1100 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.192 second response time [17:35:57] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66479 bytes in 0.481 second response time [17:36:03] RECOVERY - HHVM rendering on mw1118 is OK: HTTP OK: HTTP/1.1 200 OK - 64157 bytes in 3.726 second response time [17:36:03] PROBLEM - HHVM rendering on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:03] PROBLEM - HHVM rendering on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:03] PROBLEM - Apache HTTP on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:14] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.118 second response time [17:36:14] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [17:36:15] ok [17:36:24] RECOVERY - HHVM rendering on mw1097 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.205 second response time [17:36:24] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:36:24] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:24] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:36:33] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.191 second response time [17:36:33] RECOVERY - HHVM rendering on mw1083 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.188 second response time [17:36:34] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.585 second response time [17:36:34] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:43] PROBLEM - HHVM rendering on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:43] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:36:43] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.243 second response time [17:36:43] RECOVERY - HHVM rendering on mw1043 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.179 second response time [17:36:44] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:36:53] PROBLEM - HHVM rendering on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:53] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.103 second response time [17:36:54] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:03] RECOVERY - HHVM rendering on mw1042 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.196 second response time [17:37:04] RECOVERY - HHVM rendering on mw1055 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.181 second response time [17:37:04] RECOVERY - HHVM rendering on mw1036 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.221 second response time [17:37:13] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:23] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:23] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.203 second response time [17:37:23] RECOVERY - HHVM rendering on mw1047 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.181 second response time [17:37:34] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.187 second response time [17:37:34] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.191 second response time [17:37:34] PROBLEM - Apache HTTP on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:34] PROBLEM - Apache HTTP on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:35] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 1.576 second response time [17:37:35] PROBLEM - Apache HTTP on mw1227 is CRITICAL: Connection timed out [17:37:43] PROBLEM - HHVM rendering on mw1224 is CRITICAL: Connection timed out [17:37:44] PROBLEM - HHVM rendering on mw1235 is CRITICAL: Connection timed out [17:37:44] PROBLEM - Apache HTTP on mw1223 is CRITICAL: Connection timed out [17:37:44] PROBLEM - HHVM rendering on mw1200 is CRITICAL: Connection timed out [17:37:45] PROBLEM - HHVM rendering on mw1194 is CRITICAL: Connection timed out [17:37:45] PROBLEM - HHVM rendering on mw1222 is CRITICAL: Connection timed out [17:37:54] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - HHVM rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - HHVM rendering on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - HHVM rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:54] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:55] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:55] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:56] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:56] PROBLEM - HHVM rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:57] PROBLEM - HHVM rendering on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:08] PROBLEM - Apache HTTP on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:08] PROBLEM - Apache HTTP on mw1145 is CRITICAL: Connection timed out [17:38:08] hmm... [17:38:09] PROBLEM - Apache HTTP on mw1123 is CRITICAL: Connection timed out [17:38:09] PROBLEM - HHVM rendering on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - HHVM rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - Apache HTTP on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - Apache HTTP on mw1236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:13] PROBLEM - Apache HTTP on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:14] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:14] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:15] PROBLEM - HHVM rendering on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:15] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:16] PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - HHVM rendering on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - HHVM rendering on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:33] PROBLEM - Apache HTTP on mw1134 is CRITICAL: Connection timed out [17:38:34] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:34] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:35] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:35] PROBLEM - HHVM rendering on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:36] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:53] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:53] PROBLEM - HHVM rendering on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:53] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:53] PROBLEM - HHVM rendering on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:54] PROBLEM - HHVM busy threads on mw1169 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [115.2] [17:39:03] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:03] PROBLEM - HHVM rendering on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:03] PROBLEM - HHVM rendering on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:04] PROBLEM - HHVM rendering on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:13] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:13] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:13] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:14] PROBLEM - HHVM rendering on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:14] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:14] PROBLEM - HHVM busy threads on mw1187 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:39:14] PROBLEM - HHVM rendering on mw1134 is CRITICAL: Connection timed out [17:39:23] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:24] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.233 second response time [17:39:24] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:24] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:24] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:39:33] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:33] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: puppet fail [17:39:34] PROBLEM - HHVM rendering on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:34] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:34] RECOVERY - HHVM rendering on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 5.611 second response time [17:39:34] PROBLEM - HHVM rendering on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:34] PROBLEM - HHVM rendering on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:34] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:44] PROBLEM - HHVM rendering on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:44] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 5.978 second response time [17:39:44] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.184 second response time [17:39:44] PROBLEM - HHVM busy threads on mw1178 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:39:44] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:44] PROBLEM - HHVM rendering on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:44] PROBLEM - HHVM rendering on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:45] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:45] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:46] PROBLEM - HHVM rendering on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:53] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:53] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:53] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 5.229 second response time [17:39:54] PROBLEM - HHVM busy threads on mw1096 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [17:39:55] Um..... Hello? [17:40:04] PROBLEM - HHVM busy threads on mw1167 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [17:40:04] PROBLEM - HHVM rendering on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:04] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:04] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:04] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:04] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:04] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:05] PROBLEM - HHVM rendering on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:05] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:06] PROBLEM - HHVM rendering on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:13] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:14] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:14] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:14] PROBLEM - Apache HTTP on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:21] Skeeball93: they're well aware ;) [17:40:24] PROBLEM - HHVM rendering on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 7.523 second response time [17:40:24] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - HHVM rendering on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - HHVM rendering on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - HHVM rendering on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:25] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:25] RECOVERY - HHVM busy threads on mw1088 is OK: OK: Less than 30.00% above the threshold [57.6] [17:40:26] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 0.738 second response time [17:40:26] PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:40:33] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:34] PROBLEM - HHVM rendering on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:35] PROBLEM - HHVM rendering on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.095 second response time [17:40:43] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.505 second response time [17:40:43] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:43] PROBLEM - HHVM rendering on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.075 second response time [17:40:44] PROBLEM - HHVM rendering on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:44] PROBLEM - HHVM rendering on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:44] PROBLEM - HHVM rendering on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:44] PROBLEM - HHVM rendering on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:44] PROBLEM - HHVM rendering on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 4.498 second response time [17:40:45] PROBLEM - HHVM rendering on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:45] PROBLEM - HHVM rendering on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:48] 3ops-codfw, operations, hardware-requests, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1018399 (10RobH) p:5Triage>3Normal [17:40:48] I can see that [17:40:53] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 45 below the confidence bounds [17:40:53] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.966 second response time [17:40:54] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:54] PROBLEM - HHVM rendering on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:54] PROBLEM - HHVM rendering on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:54] PROBLEM - HHVM rendering on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:54] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 5.714 second response time [17:41:03] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.559 second response time [17:41:03] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.127 second response time [17:41:03] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:03] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:03] RECOVERY - HHVM busy threads on mw1051 is OK: OK: Less than 30.00% above the threshold [57.6] [17:41:04] PROBLEM - Redis on logstash1003 is CRITICAL: Connection timed out [17:41:04] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:04] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 66267 bytes in 0.192 second response time [17:41:05] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:05] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.509 second response time [17:41:07] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018401 (10VictorGrigas) Question: could we add a link to Kiwix.org during any outages? [17:41:09] 3ops-codfw, operations, hardware-requests: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1018402 (10RobH) p:5Triage>3High [17:41:13] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.302 second response time [17:41:13] PROBLEM - HHVM rendering on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:14] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:14] PROBLEM - HHVM rendering on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.097 second response time [17:41:14] PROBLEM - HHVM queue size on mw1081 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:41:23] RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 7.475 second response time [17:41:23] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:23] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 8.574 second response time [17:41:24] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:24] PROBLEM - HHVM rendering on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 5.717 second response time [17:41:24] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 30.00% above the threshold [57.6] [17:41:24] PROBLEM - HHVM busy threads on mw1048 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:41:24] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 64156 bytes in 7.977 second response time [17:41:33] PROBLEM - HHVM rendering on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:33] PROBLEM - HHVM rendering on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:34] PROBLEM - HHVM rendering on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:34] PROBLEM - HHVM busy threads on mw1095 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:41:44] PROBLEM - HHVM rendering on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:44] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:45] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.653 second response time [17:41:45] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.190 second response time [17:41:45] PROBLEM - HHVM rendering on mw1036 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.102 second response time [17:41:45] RECOVERY - HHVM rendering on mw1212 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.416 second response time [17:41:45] PROBLEM - HHVM rendering on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.216 second response time [17:41:46] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:46] PROBLEM - HHVM rendering on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.975 second response time [17:41:47] RECOVERY - HHVM rendering on mw1109 is OK: HTTP OK: HTTP/1.1 200 OK - 66272 bytes in 2.196 second response time [17:41:47] PROBLEM - HHVM rendering on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.110 second response time [17:41:53] PROBLEM - HHVM rendering on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 6.833 second response time [17:41:53] PROBLEM - HHVM rendering on mw1070 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 6.921 second response time [17:41:53] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 7.721 second response time [17:41:54] RECOVERY - HHVM rendering on mw1068 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.760 second response time [17:41:54] PROBLEM - HHVM rendering on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:54] PROBLEM - Apache HTTP on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:54] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 3.003 second response time [17:41:54] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.714 second response time [17:41:55] PROBLEM - HHVM rendering on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:55] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.006 second response time [17:42:14] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:20] RECOVERY - HHVM busy threads on mw1081 is OK: OK: Less than 30.00% above the threshold [57.6] [17:42:20] PROBLEM - HHVM rendering on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 6.290 second response time [17:42:21] PROBLEM - HHVM busy threads on mw1061 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:42:23] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 3.472 second response time [17:42:23] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:23] PROBLEM - HHVM rendering on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:23] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.017 second response time [17:42:34] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.191 second response time [17:42:34] PROBLEM - HHVM rendering on mw1034 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 3.368 second response time [17:42:34] RECOVERY - HHVM rendering on mw1080 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.258 second response time [17:42:35] RECOVERY - HHVM rendering on mw1071 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 7.529 second response time [17:42:35] PROBLEM - HHVM rendering on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:35] RECOVERY - HHVM rendering on mw1102 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.214 second response time [17:42:43] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:44] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.999 second response time [17:42:44] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:44] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:54] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:54] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.202 second response time [17:42:54] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.986 second response time [17:43:03] RECOVERY - HHVM rendering on mw1094 is OK: HTTP OK: HTTP/1.1 200 OK - 64678 bytes in 6.396 second response time [17:43:03] RECOVERY - HHVM busy threads on mw1030 is OK: OK: Less than 30.00% above the threshold [57.6] [17:43:03] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 0.326 second response time [17:43:03] RECOVERY - HHVM rendering on mw1091 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 8.436 second response time [17:43:04] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [17:43:13] RECOVERY - HHVM rendering on mw1090 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 8.956 second response time [17:43:13] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.442 second response time [17:43:13] PROBLEM - HHVM rendering on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:13] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2971 bytes in 0.161 second response time [17:43:19] RECOVERY - HHVM rendering on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.910 second response time [17:43:19] PROBLEM - HHVM rendering on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:19] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.676 second response time [17:43:19] RECOVERY - HHVM busy threads on mw1032 is OK: OK: Less than 30.00% above the threshold [57.6] [17:43:19] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 2.906 second response time [17:43:20] PROBLEM - HHVM rendering on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:20] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2929 bytes in 0.004 second response time [17:43:25] RECOVERY - Redis on logstash1003 is OK: TCP OK - 3.004 second response time on port 6379 [17:43:25] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:26] PROBLEM - HHVM rendering on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1476 bytes in 0.079 second response time [17:43:26] RECOVERY - HHVM rendering on mw1084 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.281 second response time [17:43:26] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:26] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:26] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:27] PROBLEM - HHVM rendering on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:27] PROBLEM - HHVM rendering on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:37] PROBLEM - HHVM busy threads on mw1162 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [115.2] [17:43:37] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2981 bytes in 0.299 second response time [17:43:46] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:46] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.722 second response time [17:43:46] PROBLEM - HHVM rendering on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:46] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:46] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:43:46] RECOVERY - HHVM rendering on mw1034 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.191 second response time [17:43:56] RECOVERY - HHVM rendering on mw1056 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.664 second response time [17:43:56] RECOVERY - HHVM rendering on mw1048 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 4.623 second response time [17:43:57] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 3038 bytes in 0.598 second response time [17:44:02] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.070 second response time [17:44:16] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.101 second response time [17:44:27] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.212 second response time [17:44:27] PROBLEM - HHVM rendering on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:28] PROBLEM - HHVM rendering on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 6.616 second response time [17:44:28] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 2.069 second response time [17:44:36] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 8.777 second response time [17:44:36] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 6.225 second response time [17:44:36] PROBLEM - HHVM rendering on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:37] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 8.624 second response time [17:44:37] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66425 bytes in 0.016 second response time [17:44:43] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:46] RECOVERY - HHVM rendering on mw1079 is OK: HTTP OK: HTTP/1.1 200 OK - 66271 bytes in 1.173 second response time [17:44:46] RECOVERY - HHVM rendering on mw1097 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.906 second response time [17:44:56] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.844 second response time [17:44:58] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66460 bytes in 0.056 second response time [17:45:04] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [17:45:04] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:05] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.769 second response time [17:45:06] PROBLEM - HHVM rendering on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:16] paravoid: holaaa, free now [17:45:17] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66634 bytes in 0.735 second response time [17:45:23] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 9.953 second response time [17:45:23] PROBLEM - HHVM rendering on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:26] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:26] RECOVERY - HHVM rendering on mw1063 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 9.254 second response time [17:45:26] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:26] RECOVERY - HHVM rendering on mw1151 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.177 second response time [17:45:37] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.500 second response time [17:45:38] RECOVERY - HHVM rendering on mw1036 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.201 second response time [17:45:46] RECOVERY - HHVM rendering on mw1070 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.909 second response time [17:45:46] RECOVERY - HHVM rendering on mw1058 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 3.696 second response time [17:45:46] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:47] PROBLEM - HHVM queue size on mw1220 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [80.0] [17:45:47] PROBLEM - HHVM busy threads on mw1025 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [17:45:47] RECOVERY - HHVM rendering on mw1093 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.176 second response time [17:45:47] RECOVERY - HHVM rendering on mw1118 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.223 second response time [17:45:47] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:48] PROBLEM - HHVM rendering on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:48] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:56] PROBLEM - HHVM busy threads on mw1039 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:45:56] RECOVERY - HHVM rendering on mw1100 is OK: HTTP OK: HTTP/1.1 200 OK - 64157 bytes in 7.860 second response time [17:45:56] PROBLEM - HHVM rendering on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:56] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66428 bytes in 0.215 second response time [17:46:02] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018407 (10gpaumier) >>! In T76560#1018401, @VictorGrigas wrote: > Question: could we add a link to Kiwix.org during any outages? kiwix.org is hosted externally, which is good because it'd be accessible if our sites are d... [17:46:02] PROBLEM - HHVM queue size on mw1116 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:46:03] PROBLEM - HHVM rendering on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:06] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.091 second response time [17:46:06] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:06] PROBLEM - HHVM rendering on mw1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:06] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [17:46:06] PROBLEM - HHVM rendering on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:17] PROBLEM - NTP on cp1063 is CRITICAL: NTP CRITICAL: Offset unknown [17:46:17] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.879 second response time [17:46:26] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.320 second response time [17:46:26] PROBLEM - HHVM rendering on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:26] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:26] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:26] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 1.269 second response time [17:46:35] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018408 (10Technical13) I notice that the current Error: 503, Service Unavailable at Thu, 05 Feb 2015 17:36:37 GMT page isn't HTML5 compliant whereas it uses `
` in which all o... [17:46:36] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.362 second response time [17:46:36] PROBLEM - HHVM rendering on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:37] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 9.207 second response time [17:46:37] PROBLEM - HHVM busy threads on mw1049 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [17:46:37] PROBLEM - HHVM rendering on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:37] PROBLEM - HHVM queue size on mw1215 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:46:37] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.169 second response time [17:46:38] PROBLEM - HHVM rendering on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:38] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 8.886 second response time [17:46:39] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 9.352 second response time [17:46:39] PROBLEM - HHVM rendering on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:46] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:46] PROBLEM - Apache HTTP on mw1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:47] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:47] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:47] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:56] question from katherine: "Can we get a reason [for the outage]? Journalist inquiries coming in." (cc guillom, paravoid, joe) [17:46:57] PROBLEM - HHVM queue size on mw1214 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [17:46:57] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.636 second response time [17:47:06] RECOVERY - HHVM rendering on mw1082 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 4.609 second response time [17:47:06] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.178 second response time [17:47:06] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:07] RECOVERY - HHVM rendering on mw1078 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 5.960 second response time [17:47:07] PROBLEM - HHVM rendering on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:07] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.219 second response time [17:47:08] no idea [17:47:12] <^d> HaeB: Networking issue, fallout from that. [17:47:13] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.636 second response time [17:47:16] PROBLEM - HHVM rendering on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] PROBLEM - HHVM rendering on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:17] i see stuff int he logs like " database error has occurred." [17:47:17] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:17] PROBLEM - NTP on dataset1001 is CRITICAL: NTP CRITICAL: Offset unknown [17:47:19] !log ori Synchronized wmf-config/logging.php: Live hack: disable Logstash logging on suspicion that it is acting up (duration: 00m 05s) [17:47:24] ^ _joe_ [17:47:26] Logged the message, Master [17:47:26] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.536 second response time [17:47:26] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.164 second response time [17:47:27] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.147 second response time [17:47:27] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.953 second response time [17:47:27] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 64161 bytes in 2.263 second response time [17:47:27] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.604 second response time [17:47:27] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.777 second response time [17:47:28] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.032 second response time [17:47:28] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [17:47:29] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.248 second response time [17:47:29] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.301 second response time [17:47:29] paravoid: holaaaaaa [17:47:30] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 3.789 second response time [17:47:30] PROBLEM - HHVM rendering on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1477 bytes in 1.098 second response time [17:47:31] RECOVERY - HHVM rendering on mw1066 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 4.504 second response time [17:47:33] nuria: not now [17:47:39] paravoid: k [17:47:44] HaeB: I've already responded to her direct email. "Network issue" is all I know, and the people who know more are busy fixing it [17:47:46] MySQL server has gone away [17:48:00] nuria: site is down, all ops are working on ti [17:48:08] greg-g: waht??? [17:48:11] SORRY!!!! [17:48:13] :) :) [17:48:17] s'ok [17:48:30] back [17:48:41] wikidata is back! [17:48:48] mediawiki.org back [17:48:50] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 66272 bytes in 0.103 second response time [17:48:50] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [17:48:50] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.128 second response time [17:48:50] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.167 second response time [17:48:50] RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.168 second response time [17:48:51] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.151 second response time [17:48:51] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.171 second response time [17:48:52] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [17:48:52] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 64167 bytes in 0.166 second response time [17:48:53] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.210 second response time [17:48:53] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.148 second response time [17:48:54] RECOVERY - HHVM rendering on mw1214 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.160 second response time [17:48:54] RECOVERY - HHVM rendering on mw1237 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.140 second response time [17:48:55] RECOVERY - HHVM rendering on mw1064 is OK: HTTP OK: HTTP/1.1 200 OK - 64160 bytes in 0.168 second response time [17:49:07] looks like it's coming back online... [17:49:08] What's up with WikidatA? [17:49:10] * greg-g crosses fingers [17:49:15] but can't loading some css [17:49:18] hoo: wikidata + everything down [17:49:19] hoo: eerything broke [17:49:21] but is back [17:49:25] coming back [17:49:42] Ah ok [17:49:43] Central user log in [17:49:43] The provided authentication token is either expired or invalid. [17:49:54] aude: wikidata seems fine [17:49:55] can't login [17:49:55] The provided authentication token is either expired or invalid. [17:49:59] ^ that might be a memcached error [17:49:59] Josve05a: give it a while, things are coming back slowly [17:50:03] Gadgets on enwiki not working I guess [17:50:12] root cause? [17:50:13] dewiki looks good [17:50:14] except that the main page is not "Wikidata:Main Page" anymore but k [17:50:18] hoo: network issues [17:50:24] ah ok [17:50:36] I think some css files are missed in https://www.mediawiki.org/wiki/MediaWiki [17:50:50] at least I now know that the icinga alert works :D [17:51:01] there are still 296 "critical" hosts according to icinga, so it'll be a bit [17:51:17] hoo: :) [17:52:16] RECOVERY - HHVM busy threads on mw1042 is OK: OK: Less than 30.00% above the threshold [57.6] [17:52:17] RECOVERY - HHVM busy threads on mw1036 is OK: OK: Less than 30.00% above the threshold [57.6] [17:52:37] RECOVERY - HHVM queue size on mw1117 is OK: OK: Less than 30.00% above the threshold [10.0] [17:52:57] RECOVERY - HHVM busy threads on mw1113 is OK: OK: Less than 30.00% above the threshold [57.6] [17:53:08] (03PS1) 10Ori.livneh: Set $wmgUseMonologLogger to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188837 [17:53:17] RECOVERY - HHVM busy threads on mw1049 is OK: OK: Less than 30.00% above the threshold [57.6] [17:53:17] RECOVERY - HHVM busy threads on mw1109 is OK: OK: Less than 30.00% above the threshold [57.6] [17:53:17] RECOVERY - HHVM busy threads on mw1111 is OK: OK: Less than 30.00% above the threshold [57.6] [17:53:27] PROBLEM - MySQL Processlist on es1003 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 441 statistics [17:53:35] (03CR) 10Chad: [C: 032] Set $wmgUseMonologLogger to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188837 (owner: 10Ori.livneh) [17:53:36] RECOVERY - HHVM busy threads on mw1102 is OK: OK: Less than 30.00% above the threshold [57.6] [17:53:39] <^d> ori: Merged [17:53:41] (03Merged) 10jenkins-bot: Set $wmgUseMonologLogger to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188837 (owner: 10Ori.livneh) [17:53:45] thanks [17:53:51] <_joe_> ok [17:53:56] RECOVERY - HHVM queue size on mw1116 is OK: OK: Less than 30.00% above the threshold [10.0] [17:54:06] RECOVERY - HHVM busy threads on mw1034 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:07] RECOVERY - HHVM busy threads on mw1019 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:07] RECOVERY - HHVM busy threads on mw1028 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:15] <_joe_> so, 10 minutes of outage due to the switch going down, then 20 due to logging. [17:54:17] RECOVERY - HHVM busy threads on mw1219 is OK: OK: Less than 30.00% above the threshold [76.8] [17:54:27] RECOVERY - HHVM busy threads on mw1097 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:38] !log ori Synchronized wmf-config/InitialiseSettings.php: I4f28205e6: Set $wmgUseMonologLogger to false (duration: 00m 06s) [17:54:45] Logged the message, Master [17:54:46] RECOVERY - HHVM busy threads on mw1035 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:47] RECOVERY - HHVM busy threads on mw1100 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:47] RECOVERY - HHVM queue size on mw1214 is OK: OK: Less than 30.00% above the threshold [10.0] [17:54:56] RECOVERY - HHVM busy threads on mw1039 is OK: OK: Less than 30.00% above the threshold [57.6] [17:54:57] RECOVERY - HHVM busy threads on mw1053 is OK: OK: Less than 30.00% above the threshold [57.6] [17:55:07] RECOVERY - HHVM busy threads on mw1237 is OK: OK: Less than 30.00% above the threshold [76.8] [17:55:07] RECOVERY - HHVM busy threads on mw1141 is OK: OK: Less than 30.00% above the threshold [57.6] [17:55:07] RECOVERY - HHVM queue size on mw1194 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:07] RECOVERY - HHVM busy threads on mw1170 is OK: OK: Less than 30.00% above the threshold [76.8] [17:55:07] RECOVERY - HHVM busy threads on mw1103 is OK: OK: Less than 30.00% above the threshold [57.6] [17:55:08] RECOVERY - HHVM busy threads on mw1151 is OK: OK: Less than 30.00% above the threshold [57.6] [17:55:08] RECOVERY - HHVM busy threads on mw1218 is OK: OK: Less than 30.00% above the threshold [76.8] [17:55:09] RECOVERY - HHVM queue size on mw1249 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:09] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1543 bytes in 0.763 second response time [17:55:10] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 30.00% above the threshold [76.8] [17:55:11] RECOVERY - HHVM queue size on mw1255 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:11] RECOVERY - HHVM queue size on mw1114 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:12] RECOVERY - HHVM queue size on mw1237 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:12] RECOVERY - HHVM busy threads on mw1108 is OK: OK: Less than 30.00% above the threshold [57.6] [17:55:13] RECOVERY - HHVM queue size on mw1258 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:13] RECOVERY - HHVM queue size on mw1254 is OK: OK: Less than 30.00% above the threshold [10.0] [17:55:22] what [17:55:30] oops. [17:55:42] he ran away [17:55:42] probably shouldn't autokick the monitor. [17:56:19] well that's a matter of some debate, for another time :) [17:56:19] <_joe_> yeah [17:56:26] :) [17:56:32] RECOVERY - HHVM busy threads on mw1163 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:32] RECOVERY - HHVM busy threads on mw1242 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:33] RECOVERY - HHVM queue size on mw1202 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:33] RECOVERY - HHVM busy threads on mw1217 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:34] RECOVERY - HHVM busy threads on mw1255 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:34] RECOVERY - HHVM queue size on mw1224 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:35] RECOVERY - HHVM busy threads on mw1257 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:35] RECOVERY - HHVM queue size on mw1081 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:37] RECOVERY - HHVM busy threads on mw1221 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:37] RECOVERY - HHVM queue size on mw1227 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:37] RECOVERY - HHVM busy threads on mw1202 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:37] RECOVERY - HHVM busy threads on mw1086 is OK: OK: Less than 30.00% above the threshold [57.6] [17:56:38] RECOVERY - HHVM busy threads on mw1087 is OK: OK: Less than 30.00% above the threshold [57.6] [17:56:38] RECOVERY - HHVM queue size on mw1225 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:47] RECOVERY - HHVM busy threads on mw1095 is OK: OK: Less than 30.00% above the threshold [57.6] [17:56:47] RECOVERY - HHVM busy threads on mw1083 is OK: OK: Less than 30.00% above the threshold [57.6] [17:56:47] RECOVERY - HHVM busy threads on mw1209 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:47] RECOVERY - HHVM queue size on mw1223 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:57] RECOVERY - HHVM busy threads on mw1225 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:57] RECOVERY - HHVM queue size on mw1222 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:57] RECOVERY - HHVM busy threads on mw1210 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:57] RECOVERY - HHVM busy threads on mw1247 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:57] RECOVERY - HHVM busy threads on mw1022 is OK: OK: Less than 30.00% above the threshold [57.6] [17:56:57] RECOVERY - HHVM busy threads on mw1240 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:57] RECOVERY - HHVM queue size on mw1232 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:58] RECOVERY - HHVM queue size on mw1231 is OK: OK: Less than 30.00% above the threshold [10.0] [17:56:58] RECOVERY - HHVM busy threads on mw1212 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:59] RECOVERY - HHVM busy threads on mw1253 is OK: OK: Less than 30.00% above the threshold [76.8] [17:56:59] RECOVERY - HHVM busy threads on mw1250 is OK: OK: Less than 30.00% above the threshold [76.8] [17:57:00] RECOVERY - HHVM queue size on mw1247 is OK: OK: Less than 30.00% above the threshold [10.0] [17:57:07] RECOVERY - HHVM busy threads on mw1258 is OK: OK: Less than 30.00% above the threshold [76.8] [17:57:07] RECOVERY - HHVM busy threads on mw1090 is OK: OK: Less than 30.00% above the threshold [57.6] [17:57:08] RECOVERY - HHVM busy threads on mw1203 is OK: OK: Less than 30.00% above the threshold [76.8] [17:57:08] RECOVERY - HHVM busy threads on mw1074 is OK: OK: Less than 30.00% above the threshold [57.6] [17:57:08] RECOVERY - HHVM queue size on mw1229 is OK: OK: Less than 30.00% above the threshold [10.0] [17:57:08] RECOVERY - HHVM busy threads on mw1082 is OK: OK: Less than 30.00% above the threshold [57.6] [17:57:08] RECOVERY - HHVM queue size on mw1226 is OK: OK: Less than 30.00% above the threshold [10.0] [17:57:16] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018421 (10Nemo_bis) >>! In T76560#1018407, @gpaumier wrote: > kiwix.org is hosted externally, which is good because it'd be accessible if our sites are down, but we'd likely kill the site by causing a deluge of traffic to... [17:57:26] RECOVERY - HHVM busy threads on mw1094 is OK: OK: Less than 30.00% above the threshold [57.6] [17:57:37] RECOVERY - HHVM busy threads on mw1220 is OK: OK: Less than 30.00% above the threshold [76.8] [17:58:07] RECOVERY - HHVM busy threads on mw1244 is OK: OK: Less than 30.00% above the threshold [76.8] [17:58:07] RECOVERY - HHVM queue size on mw1220 is OK: OK: Less than 30.00% above the threshold [10.0] [17:59:18] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:59:27] RECOVERY - HHVM busy threads on mw1241 is OK: OK: Less than 30.00% above the threshold [76.8] [17:59:47] RECOVERY - HHVM busy threads on mw1137 is OK: OK: Less than 30.00% above the threshold [57.6] [17:59:47] RECOVERY - HHVM busy threads on mw1252 is OK: OK: Less than 30.00% above the threshold [76.8] [18:01:44] !log restarting nutcracker on all appservers [18:01:50] Logged the message, Master [18:02:27] RECOVERY - MySQL Processlist on es1003 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [18:04:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:05:16] RECOVERY - NTP on cp1063 is OK: NTP OK: Offset 0.01252090931 secs [18:05:37] RECOVERY - HHVM busy threads on mw1201 is OK: OK: Less than 30.00% above the threshold [76.8] [18:06:11] <_joe_> !log restarting nutcracker on api appservers [18:06:16] Logged the message, Master [18:07:16] RECOVERY - HHVM busy threads on mw1132 is OK: OK: Less than 30.00% above the threshold [57.6] [18:07:47] RECOVERY - HHVM busy threads on mw1190 is OK: OK: Less than 30.00% above the threshold [76.8] [18:08:17] RECOVERY - HHVM busy threads on mw1194 is OK: OK: Less than 30.00% above the threshold [76.8] [18:08:37] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 30.00% above the threshold [76.8] [18:08:48] <_joe_> !log restarting nutcracker on jobrunners [18:08:51] Logged the message, Master [18:10:34] (03CR) 10Kaldari: Adding original language of this work campaign for WikiGrok (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [18:12:24] (03CR) 10Andrew Bogott: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [18:13:49] (03CR) 10Phuedx: Adding original language of this work campaign for WikiGrok (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [18:13:55] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1018449 (10Technical13) >>! In T76560#1018407, @gpaumier wrote: >>>! In T76560#1018401, @VictorGrigas wrote: >> Question: could we add a link to Kiwix.org during any outages? > > kiwix.org is hosted externally, which is g... [18:20:20] Still getting this error https://dpaste.de/CVhN :( [18:23:39] In vagrant (for those worried it was about production) [18:23:45] :) [18:24:17] https://www.reddit.com/r/wikipedia/comments/2uw10x/wikipedia_is_down/ [18:25:18] RECOVERY - Kafka Broker Messages In Per Second on tungsten is OK: OK: No anomaly detected [18:25:41] <^d> Bsadowski1: https://www.reddit.com/r/wikipedia/comments/2uw10x/wikipedia_is_down/coc8au6 [18:25:41] <^d> :) [18:26:57] etherpad seems to be down [18:27:04] we often use it for the Metrics meeting [18:27:18] <^d> Up for me. [18:27:39] up for me as well [18:27:49] (03CR) 10Phuedx: [C: 04-1] "English label for P364 is "original language of this work". Boo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [18:28:02] wow, quite the storm... [18:28:49] <^d> There you are! [18:29:10] <^d> I started moving elastic stuff to hiera in a WIP yesterday :) [18:29:29] <^d> Can you at least tell me if I'm doing it right or if I'm batshit? [18:29:43] ^d: hey! sure. link? [18:29:47] <^d> https://gerrit.wikimedia.org/r/#/c/188702/ [18:35:43] 3ops-codfw, operations: rename and setup base hardware settings for WMF3298 (zinc/sterope) - https://phabricator.wikimedia.org/T88624#1018556 (10Papaul) a:5Papaul>3RobH WMF3298 is not in codfw it is in eqiad . [18:37:06] (03PS1) 10BBlack: re-pool cp1063 backend T84809 [puppet] - 10https://gerrit.wikimedia.org/r/188844 [18:37:28] (03CR) 10BBlack: [C: 032 V: 032] re-pool cp1063 backend T84809 [puppet] - 10https://gerrit.wikimedia.org/r/188844 (owner: 10BBlack) [18:40:00] ^d: right, so the entire ::configuration thing should go [18:40:37] <^d> I figured, but wouldn't that require a full refactor in one go? [18:41:15] ^d: why refactor? [18:41:39] ^d: so basically, you make them *all* parameters of role::elasticsearch::server [18:41:47] oh [18:41:48] wait [18:41:48] <^d> Ah yeah, remove the abstraction [18:41:50] let me see this fully [18:43:01] 3ops-codfw, operations: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018582 (10RobH) a:5Papaul>3RobH [18:43:33] 3ops-codfw, operations: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018143 (10RobH) a:5RobH>3Papaul Ok, the task info has now been updated to fully reflect both the mgmt ip assignments and the new rack locations for m... [18:43:44] something wrong deployed, commons contains a new broken link in the siebar which i can't remove :/ [18:43:55] "Current events" [18:44:03] no change at MediaWiki:Sidebar [18:45:55] 3operations: detail hardware requests policy and procedure on wikitech/officewiki - https://phabricator.wikimedia.org/T87626#1018601 (10RobH) doing one better and creating a getting help from ops document for wikitech. draft: https://wikitech.wikimedia.org/wiki/User:RobH/ops_requests_draft [18:46:40] James_F: Do you have a few minutes? I’m looking at changes you made to wmf-config/CommonSettings-labs.php [18:46:51] Reedy: around? [18:47:00] andrewbogott: Sure? [18:47:11] James_F: A bunch of them haven’t been deployed on wikitech yet. I’m setting up a new wikitech mirror with the latest config... [18:47:14] andrewbogott: Am setting up Metrics meeting, but sort-of around. [18:47:17] andrewbogott: i think there was something deployed to commons which changes the sidebar... *sigh*** [18:47:18] and parsoid stuff is broken. [18:47:43] Steinsplitter: the wikitech sidebar? [18:47:56] James_F: so, I could use your help untangling that. But probably after the meeting I guess :) [18:47:56] andrewbogott: parsoid was broken in wikitech too, I think. I remember getting errors when trying to use ve [18:48:13] YuviPanda, yes, I suspect that James_F tried to fix it but accidentally broke it some more instead. [18:48:15] : Wikimedia Commons, ther is a link "Community portal" in the siedebar since a few minutes [18:48:18] and it was not added by hand [18:48:20] eh, ok! [18:48:27] Steinsplitter: you probably want marktraceur [18:48:29] and i can't remove them using MediaWiki:Sidebar [18:48:39] yes O_O, probably [18:48:44] YuviPanda: and of course James_F doesn’t have any actual login on virt1000 he would have no way of knowing [18:48:52] :D [18:48:56] greg-g yeah [18:49:07] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [18:49:13] Reedy: now that the outage is over, we can get back on the train :) [18:49:36] James_F: the good news is that the new wikitech host will be more isolated from labs so I can give y’all logins there. [18:49:45] Yay [18:49:53] Reedy: wanna do it during the metrics meeting (ie: now-ish)? [18:50:16] Actually I guess I could do that /now/ so that you can help debug this. One second… [18:50:16] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:50:19] Steinsplitter, since the outage? There may still be weirdness... [18:50:25] Not branched or anything yet. That's fine for me in a few minutes [18:50:35] Reedy: cool [18:52:04] Steinsplitter: yeah, nothing was deployed today to commons. Yesterday at 20:33 UTC we put Commons (and other non-wikipedias) to wmf15. [18:52:16] andrewbogott: That sounds good. :-) [18:52:19] 3Beta-Cluster, operations: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076#1018624 (10yuvipanda) P262 contains commands that were used during the migration [18:52:28] aude: Reedy is going to do wmf16 soon [18:52:39] greg-g: thanks [18:52:55] : do you have a iddea how to sole this? [18:53:14] why do you put mark's name in brackets? :) [18:53:15] 3operations, ops-eqiad: rebalance memcached in eqiad - https://phabricator.wikimedia.org/T88710#1018631 (10RobH) 3NEW a:3Christopher [18:53:28] 3operations, ops-eqiad: rebalance memcached in eqiad - https://phabricator.wikimedia.org/T88710#1018639 (10RobH) Chris: Can you list off the 10G capable racks in eqiad? [18:54:04] my client. need to fix this when i have moor time :/ [18:54:52] :) [18:55:22] (03PS1) 10Andrew Bogott: Give deployers login on silver. [puppet] - 10https://gerrit.wikimedia.org/r/188849 [18:56:07] greg-g: ok [18:56:29] (03CR) 10Andrew Bogott: [C: 032] Give deployers login on silver. [puppet] - 10https://gerrit.wikimedia.org/r/188849 (owner: 10Andrew Bogott) [18:58:30] Gonna start branching now [18:58:54] ok, serious. The complete navbar is overwritten [18:59:02] the help link, all O_O [18:59:21] Reedy, James_F, you should have logins on silver.wikimedia.org now. That’s the host that will soon become wikitech. [18:59:29] But now I’m going to watch the meeting [18:59:33] Permission denied (publickey). [18:59:35] ;) [18:59:42] um… puppet still running [18:59:42] and there is no relevant MW edit in the last week which causes this edit [18:59:44] guess it hasnt propogated :) [18:59:58] aww, /tmp got cleaned on bast1001 [19:00:04] Reedy, greg-g, legoktm: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150205T1900). Please do the needful. [19:00:08] marktraceur: ^ do you have any ideas on that issue from Steinsplitter ? [19:00:19] Reedy: ok, how about now? [19:00:27] 3Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1018656 (10yuvipanda) 3NEW a:3coren [19:00:30] Not currently, greg-g [19:00:32] andrewbogott: yup, WFM :) [19:00:34] * legoktm is ready to do the needful [19:00:35] Probably a config issue [19:00:37] cool [19:00:39] andrewbogott: Still denied for me. [19:00:47] marktraceur: :/ anyone else you think we should ask? [19:01:10] James_F: are you in deployers? [19:01:16] greg-g: I can look at it, I suppose [19:01:17] Well, anyway, you have other things to worry about right now :) [19:01:18] andrewbogott: Aha, no. :-) [19:01:25] i think so, something is overwriting MediaWiki:Sidebar ... :/ [19:01:26] Yeah. Let's worry later [19:01:39] andrewbogott: don't tempt me to make James_F a deployer :P [19:01:46] Steinsplitter: on what wiki? commons? [19:01:52] yes [19:01:57] greg-g: I've considered making a request… [19:02:07] 3operations, ops-eqiad: rebalance memcached in eqiad - https://phabricator.wikimedia.org/T88710#1018665 (10Cmjohnson) Robh asked in IRC channel about which servers were 10G. asw-a5 - ex4500 all mc and cp1058-1070, ms-fe1001-1002 asw-c8 - ex4500 - cp 1045-cp1057, ms-fe1003 and ms-fe1004 asw-d6 - ex4550 - empty a... [19:02:29] might be a cache issue.../me checks [19:02:48] * Reedy is branching all the things [19:04:55] Steinsplitter: a bad cache entry was stuck...looks correct in English now [19:05:20] (03PS1) 10Yuvipanda: tools: Fix exec params in nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188850 (https://phabricator.wikimedia.org/T1102) [19:05:22] except it's wrong in other languages...ugh. [19:05:26] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix exec params in nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188850 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [19:05:34] legoktm: thx [19:05:39] (03PS2) 10Yuvipanda: tools: Fix exec params in nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188850 (https://phabricator.wikimedia.org/T1102) [19:06:13] (03CR) 10Yuvipanda: [C: 032] tools: Fix exec params in nodejs starter tool [puppet] - 10https://gerrit.wikimedia.org/r/188850 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [19:06:33] I guess I just have to loop over every language code and delete it from memcache? [19:06:48] legoktm: we could restart all of memcached again? :) [19:07:03] * legoktm slaps YuviPanda [19:07:10] too soon? [19:09:34] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [19:09:56] !log clearing bad sidebar memcache entries on commonswiki [19:10:03] Logged the message, Master [19:11:01] Steinsplitter: all fixed now [19:11:14] thanks legoktm [19:11:39] thanks legoktm :) you are verry helpful. i buy you a virtual beer :P [19:11:44] :) [19:13:49] 3Tool-Labs: Puppetize adding new hosts to OGE - https://phabricator.wikimedia.org/T88712#1018701 (10yuvipanda) 3NEW a:3coren [19:14:40] 3Tool-Labs: Puppetize adding a host to a particular queue - https://phabricator.wikimedia.org/T88713#1018712 (10yuvipanda) 3NEW a:3coren [19:15:10] (03PS1) 10coren: Tool Labs: hosts templates for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/188856 (https://phabricator.wikimedia.org/T88712) [19:16:29] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1018725 (10Tgr) >>! In T88412#1017533, @fgiunchedi wrote: > what's the best phab project to report #1? I haven't paid much attention to other issues beside #2 to be hone... [19:17:02] Steinsplitter: You said you can't change it by editing the sidebar message? Because https://commons.wikimedia.org/wiki/MediaWiki:Sidebar has portal in it. [19:17:45] ah, but it is now fixed by lego <3 [19:17:49] Ah. [19:17:54] gj legoktm [19:19:32] (03CR) 10Yuvipanda: [C: 04-1] "Better!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [19:25:20] https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf16/Changelog [19:25:24] Big log is big :/ [19:26:48] YuviPanda: https://gerrit.wikimedia.org/r/#/c/188856/ [19:26:51] Reedy: Gosh. Boo to high-risk deployments. [19:27:15] :/ [19:27:35] Coren: nice. what's unrestricted vs vmem? [19:28:13] YuviPanda: unrestriced have no limits and leave the resource management to the node "owner". This is used for dedicated nodes that have odd requirements. [19:28:20] aaah [19:28:21] right [19:28:23] makes sense [19:29:04] catscan for instance. [19:29:29] right [19:29:38] Hm. @lsbdistcodename [19:29:41] (03CR) 10Yuvipanda: [C: 031] Tool Labs: hosts templates for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/188856 (https://phabricator.wikimedia.org/T88712) (owner: 10coren) [19:29:47] I wonder, does that work fine with Jessie? [19:30:07] we don't have jessie hosts yet :) [19:30:10] and I suspect it would work fine... [19:30:53] (03CR) 10coren: [C: 032] "We'll need to keep an eye on this use of @lsbdistcodename to make sure that doesn't break once we have Jessie instances." [puppet] - 10https://gerrit.wikimedia.org/r/188856 (https://phabricator.wikimedia.org/T88712) (owner: 10coren) [19:31:16] Who owns the monolog live hack? [19:31:23] (03PS1) 10coren: Tool Labs: queue templates for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/188858 (https://phabricator.wikimedia.org/T88713) [19:31:32] YuviPanda: ^^ same deal, for the queues. [19:32:16] Fatal error: Uncaught exception 'Exception' with message '/srv/mediawiki-staging/wikiversions.json did not decode to an associative array. [19:32:32] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188859 [19:32:34] (03PS1) 10Reedy: testwiki to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188860 [19:32:41] YuviPanda: Ohwait. Forgot a fix. [19:32:43] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188859 (owner: 10Reedy) [19:32:46] (03CR) 10Yuvipanda: [C: 04-1] Tool Labs: queue templates for gridengine (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188858 (https://phabricator.wikimedia.org/T88713) (owner: 10coren) [19:32:48] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188859 (owner: 10Reedy) [19:32:54] oh, I failed [19:33:29] (03PS2) 10Reedy: testwiki to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188860 [19:33:34] gj Reedy [19:33:49] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188860 (owner: 10Reedy) [19:33:53] (03Merged) 10jenkins-bot: testwiki to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188860 (owner: 10Reedy) [19:34:09] YuviPanda: Hm. I don't think having the parameter name differ from the actual value name is a good idea. I'd rather have slightly odd parameter names whose mapping is transparent to the actual gridengine tunables. [19:34:28] (03PS1) 10Reedy: Disable wmgUseMonologLogger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188861 [19:34:35] who owns that ^^? [19:35:04] Coren: hmm, in that case we should document them better (where is s_rt coming from?) and also set the default in the class itself, rather than in the template [19:35:16] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [19:36:16] (03Abandoned) 10Reedy: Disable wmgUseMonologLogger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188861 (owner: 10Reedy) [19:37:07] (03PS1) 10Reedy: wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188862 [19:37:09] (03PS1) 10Reedy: group0 to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188863 [19:37:33] !log reedy Started scap: testwiki to 1.25wmf16 [19:37:41] Logged the message, Master [19:38:30] YuviPanda: You know, they are common tunables for queue-wide limits but I'm not actually using them for any extant queue atm. Maybe just remove the ability to tweak them for now and put it back in (more) cleanly if we ever end up needing them? [19:38:59] The idea, of course, was to not specify them unless they were specifically overriden in puppet. [19:39:02] Coren: sure. if there's no way to tune them (currently there seems to be no way to set them at all?) we shouldn't be having them in templates. [19:39:10] Reedy: are you also going to deploy GlobalUserPage today? [19:39:13] Coren: right, but overriden how? wikitech variables? [19:39:17] legoktm: yeah, can do :) [19:39:20] YuviPanda: Yeah, that was the intent. [19:39:23] awesome :D [19:39:26] (03PS3) 10Dzahn: remove deprecated legalpad.wm service [puppet] - 10https://gerrit.wikimedia.org/r/187050 [19:39:26] YuviPanda: Also hiera now [19:39:27] oh, crap [19:39:27] did standard-noexim recently get removed ? [19:39:29] Coren: right, we should consider those deprecated. hieraaaa! [19:39:30] l10n update [19:39:32] globaluserpage? [19:39:37] sounds awesome [19:39:44] Coren: you can't use hiera without having them be params of a class (or explicitly using hiera()). [19:39:47] YuviPanda: Lemme just axe them for now [19:39:47] it doesn't inject into global space [19:40:00] plus this uses @varname, which doesn't check global ones, IIRC [19:40:05] mutante: yup, I killed it a few weeks ago [19:40:09] mutante: why? [19:40:12] legoktm: It's not branched in all versions is it? [19:40:19] mutante: standard has a param to make it standard-noexim now [19:40:32] Reedy: no, you just created the 16 branch. [19:40:40] YuviPanda: because it makes me have to do rebases of older gerrit stuff [19:40:47] mutante: heh :) [19:40:50] YuviPanda: it seems it was already replaced with just regular "admin" [19:40:55] hmm? [19:40:56] arg. i mean "standard" [19:40:57] (03CR) 10Reedy: [C: 04-1] "extension-list needs setting to version specific" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [19:40:59] yeah [19:41:00] (03PS2) 10coren: Tool Labs: queue templates for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/188858 (https://phabricator.wikimedia.org/T88713) [19:41:01] !log reedy scap aborted: testwiki to 1.25wmf16 (duration: 03m 27s) [19:41:05] Logged the message, Master [19:41:08] mutante: hiera was used to turn it off for some roles [19:41:30] Reedy: er, how do we do that? [19:41:32] mutante: Ibfa6e218735b5fa233ea50b2c1e2f641d712f9ca [19:41:42] (or 2f5cde0ed76bd61229aefd12221a87ff40a1d7c8) [19:41:56] extension-list-1.25wmf16 [19:42:18] YuviPanda: oh, ok! [19:42:26] (03PS1) 10Reedy: Add extension-list-1.25wmf16 for GlobalUserPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188865 [19:42:33] (03CR) 10Yuvipanda: [C: 031] "we should probably abstract them out a little more (lots of repetition!) but a good start!" [puppet] - 10https://gerrit.wikimedia.org/r/188858 (https://phabricator.wikimedia.org/T88713) (owner: 10coren) [19:42:34] !log reedy Started scap: testwiki to 1.25wmf16 [19:42:51] legoktm: that'll ensure it's in localisation cache for 1.25wmf16 [19:42:56] and not error out everywhere ;) [19:42:58] so now if 2 nodes have the same roles, they could still be different [19:43:25] 3ops-codfw, operations: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1018885 (10Papaul) Rebalance complete and racktable updated. [19:43:29] YuviPanda: The repetition is a bit annoying; but that's because stupid gridengine isn't willing to accept 'not specified' for stuff you don't care to use. :-( [19:43:45] Coren: right, but we can use one template and just change the bits we want to [19:43:55] (03CR) 10coren: [C: 032] "With this, the queue configuration should build properly." [puppet] - 10https://gerrit.wikimedia.org/r/188858 (https://phabricator.wikimedia.org/T88713) (owner: 10coren) [19:43:59] Coren: but we can do that later, yeah :) [19:44:26] Coren: oooh, so qconf -Aq for tools-webgrid-generic (both queue and host) would work now? [19:44:56] YuviPanda: The problem with that is that we /might/ want to create a new queue type that specifies one of them, and templating them an extra level would make them more complex. We can adjust the balance as use cases pop up. [19:45:04] yeah [19:45:14] repetition vs lots of ifs [19:45:40] Coren: can you try adding the generic queue and node now? [19:45:56] I'm forcing a pupper run to regenerate the config now. [19:45:59] puppet* [19:46:00] Coren: cool [19:46:50] Coren: final step is to have puppet call qconf itself, I guess. [19:46:56] should be done carefully [19:47:23] YuviPanda: I don't want to do that yet. For a while, at least, we should do the qconf ourselves and make sure nothing breaks, and keep an eye on the generated configs. [19:47:34] Coren: yeah, but we should set a date and just do it then [19:47:52] (03CR) 10Dzahn: [C: 032] "confirmed deprecated" [puppet] - 10https://gerrit.wikimedia.org/r/187050 (owner: 10Dzahn) [19:48:46] YuviPanda: I'm not so much worried about "when" as I am about "test most cases". I'll start doing comparisons between the generated config and the expected config. [19:48:58] cool :) [19:49:31] Oh, bah. [19:50:26] (03PS1) 10coren: Tool Labs: fixes to the queue templates [puppet] - 10https://gerrit.wikimedia.org/r/188867 [19:50:30] YuviPanda: ^^ [19:52:05] (03PS2) 10Chad: WIP: Begin converting Elasticsearch configuration to use hiera [puppet] - 10https://gerrit.wikimedia.org/r/188702 [19:52:06] hah, missed that [19:52:10] <^d> YuviPanda: Right direction now ^? [19:52:22] (03CR) 10Yuvipanda: [C: 031] Tool Labs: fixes to the queue templates [puppet] - 10https://gerrit.wikimedia.org/r/188867 (owner: 10coren) [19:52:32] (03CR) 10coren: [C: 032] Tool Labs: fixes to the queue templates [puppet] - 10https://gerrit.wikimedia.org/r/188867 (owner: 10coren) [19:53:05] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:53:25] * Coren lulz at the race condition. [19:53:36] "There are 0 unmerged changes" [19:54:26] it probably complains that staging was bypassed :) [19:54:43] (03CR) 10Yuvipanda: [C: 04-1] "You can put beta config in hieradata/labs/deployment-prep/common.yaml, and 'common' (to both labs / prod) config in the default for the el" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188702 (owner: 10Chad) [19:54:44] ^d: not quite :) [19:55:01] YuviPanda: FYI, the host config worked fine. [19:55:09] Coren: whee [19:55:25] <^d> What about non-deployment-prep labs instances? [19:55:30] <^d> This class supports that case too [19:56:18] ^d: they can override their hiera files when they want... [19:56:37] ^d: and if you want some things to be common for *everything* in labs... [19:56:41] there's a way for that too [19:56:43] let me look [19:57:04] ^d: right, you can put things in hieradata/labs.yaml [19:57:29] ^d: but idk, is any project in labs actually using ES atm? [19:57:41] <^d> We've used it time to time for testing things. [19:57:43] <^d> Why wouldn't I want prod stuff in hiera as well? [19:57:50] (03PS1) 10coren: Tool Labs: more tweaks to the queue templates [puppet] - 10https://gerrit.wikimedia.org/r/188872 (https://phabricator.wikimedia.org/T88713) [19:57:56] YuviPanda: Moar tweaks ^^ [19:57:56] ^d: well, basically it's like this: [19:58:05] 1. common to prod and labs -> default param [19:58:10] 2. just for prod -> prod hiera [19:58:17] 3. just for beta -> beta hiera [19:58:24] <^d> Mmmk, got it [19:58:57] Coren: hmm, is processors UNDEFINED needed? [19:59:14] <^d> So prod hiera isn't common, but role/common? [19:59:18] YuviPanda: Yes. Annoyingly. error: required attribute "processors" is missing [19:59:24] lol gridengine [19:59:36] * ^d goes to grab lunch, will figure out after [19:59:42] ^d: http://wikitech.wikimedia.org/wiki/Hiera [20:00:01] ^d: 'common' in that context pertains to only 'common across data centers' and doesn't refer to 'common across labs and prod' [20:00:06] 3Wikimedia-General-or-Unknown, operations: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1018985 (10Krinkle) 3NEW [20:00:12] ^d: nothing outside of the labs/ hierarchy is actually loaded in labs atm [20:00:15] (is confusing, I know) [20:00:26] (03CR) 10Yuvipanda: [C: 031] "lolOGE" [puppet] - 10https://gerrit.wikimedia.org/r/188872 (https://phabricator.wikimedia.org/T88713) (owner: 10coren) [20:00:49] (03CR) 10coren: [C: 032] "Yeay gridengine." [puppet] - 10https://gerrit.wikimedia.org/r/188872 (https://phabricator.wikimedia.org/T88713) (owner: 10coren) [20:02:12] 3Incident-20150205-SiteOutage, Wikimedia-Logstash, operations: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019000 (10chasemp) 3NEW [20:03:27] 3Incident-20150205-SiteOutage, operations: Nutcracker needs to automatically recover from MC failure - https://phabricator.wikimedia.org/T88730#1019008 (10chasemp) [20:03:55] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [20:04:35] <_joe_> mount.nfs: access denied by server while mounting GRRRR [20:04:48] 3Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1019015 (10scfc) In general I would like this to move from the filesystem-based stuff to hiera to keep it simple. The ping-pong where one instance writes something to the filesystem and then another is very cool :-), b... [20:04:53] 3Tool-Labs: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733#1019016 (10yuvipanda) 3NEW a:3coren [20:05:00] YuviPanda: qconf -Aq /data/project/.system/gridengine/etc/queues/webgrid-generic [20:05:00] root@tools-login.eqiad.wmflabs added "webgrid-generic" to cluster queue list [20:05:12] Coren: wheee [20:05:41] ^ 4002 is just generic puppet http stuff [20:05:55] YuviPanda: One thing I have *no* idea how to do, or even if it's possible: gridengine-exec needs to be restarted on nodes /after/ they have been added/ [20:06:04] Coren: heh, was just going to point that out. [20:07:10] 3Incident-20150205-SiteOutage, Wikimedia-Logstash, operations: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019025 (10bd808) Monolog logging is currently disabled in the prod cluster via . We should not re-... [20:07:15] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:07:16] Coren: I restarted [20:07:45] RECOVERY - Disk space on dataset1001 is OK: DISK OK [20:07:58] YuviPanda: I see. Yeay. We now have a node and queue whose config was created by puppet. Ima do some diffs to see how closely the generated configs match the real ones. [20:08:04] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:08:44] <_joe_> !log re-exported nfs exports on dataset1001, remounted /mnt/data on snapshot1001 [20:08:47] Coren: \o/ cool [20:08:50] (03PS1) 10Yuvipanda: tools: Install npm and nodesjs-legacy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/188875 (https://phabricator.wikimedia.org/T1102) [20:08:52] Logged the message, Master [20:08:53] Coren: ^ +1? [20:09:15] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:09:22] 3ops-codfw, operations: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1019034 (10Papaul) complete mc2001 port ge-2/0/0 mc2002 port ge-2/0/1 mc2003 port ge-2/0/2 mgmt setup, BIOS configuration and test complete. [20:10:14] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:10:19] YuviPanda: What's nodejs-legacy do? [20:10:31] (03CR) 10Andrew Bogott: [C: 04-1] "I think this is a partial reconstruction of an earlier patch... it removes three classes but only adds one" [puppet] - 10https://gerrit.wikimedia.org/r/188612 (owner: 10John F. Lewis) [20:10:33] Coren: adds a 'node' executable. [20:10:35] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:10:35] instead of 'nodejs' [20:10:39] well [20:10:41] a symlink [20:11:14] (03CR) 10John F. Lewis: "True! Let me look at that :)" [puppet] - 10https://gerrit.wikimedia.org/r/188612 (owner: 10John F. Lewis) [20:11:24] (03CR) 10Andrew Bogott: [C: 032] base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188611 (owner: 10John F. Lewis) [20:11:47] (03CR) 10coren: [C: 031] "Sane." [puppet] - 10https://gerrit.wikimedia.org/r/188875 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [20:12:36] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1019048 (10RobH) So we still don't have @LuisV_WMF's approval for this task. Which means this is not really legal approved yet? [20:12:48] Coren: \o/ [20:12:50] am off to sleep now [20:12:55] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 84 data above and 9 below the confidence bounds [20:12:56] (03PS2) 10Yuvipanda: tools: Install npm and nodesjs-legacy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/188875 (https://phabricator.wikimedia.org/T1102) [20:13:05] Coren: do keep notes in the phab tasks [20:13:22] * YuviPanda goes off [20:13:25] night [20:13:41] (03CR) 10Yuvipanda: [C: 032] tools: Install npm and nodesjs-legacy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/188875 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [20:16:54] 3Tool-Labs: Puppetize adding a host to a particular queue - https://phabricator.wikimedia.org/T88713#1019053 (10coren) At this time, it looks like configuration for nodes and queues is generated correctly (in /data/project/.system/gridengine/etc), but it is not applied automatically - the qconf statements are st... [20:16:57] 3Wikimedia-General-or-Unknown, operations: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1019055 (10Aklapper) @Krinkle: Sufficiently covered by T86655 already? [20:17:39] (03PS2) 10John F. Lewis: base: move instance-upstarts to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188612 [20:17:43] 3Tool-Labs: Puppetize adding new node to OGE - https://phabricator.wikimedia.org/T88712#1019057 (10coren) [20:18:10] 3Incident-20150205-SiteOutage, Wikimedia-Logstash, operations: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019059 (10bd808) p:5Triage>3High [20:19:44] (03PS1) 10Chad: Remove role::elasticsearch::config abstraction [puppet] - 10https://gerrit.wikimedia.org/r/188877 [20:19:50] 3Tool-Labs: Puppetize adding new node to OGE - https://phabricator.wikimedia.org/T88712#1019063 (10coren) New node config is generated properly but not applied automatically (can be done with a qconf -Ae ) Once we are confident about turning on qconf from puppet, this should be automatic. There remai... [20:20:25] (03CR) 10Chad: "Will make things like I3998769c easier" [puppet] - 10https://gerrit.wikimedia.org/r/188877 (owner: 10Chad) [20:21:11] !log reedy Finished scap: testwiki to 1.25wmf16 (duration: 38m 37s) [20:21:18] Logged the message, Master [20:22:53] slow scap was slow :( [20:23:41] (03CR) 10Reedy: [C: 032] Add extension-list-1.25wmf16 for GlobalUserPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188865 (owner: 10Reedy) [20:23:44] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:25:17] (03PS2) 10Reedy: Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:25:48] c'mon jenkins [20:26:40] (03Merged) 10jenkins-bot: Add extension-list-1.25wmf16 for GlobalUserPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188865 (owner: 10Reedy) [20:26:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:26:55] (03PS3) 10Reedy: Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:27:13] (03PS1) 10coren: Tool Labs: nodejs-legacy is for Trusty only [puppet] - 10https://gerrit.wikimedia.org/r/188878 [20:27:48] (03CR) 10Dzahn: [C: 032] "removing legalpad service (it's an app in regular phab instead)" [dns] - 10https://gerrit.wikimedia.org/r/187054 (owner: 10Dzahn) [20:27:59] (03CR) 10coren: [C: 032] "Trivial package fix." [puppet] - 10https://gerrit.wikimedia.org/r/188878 (owner: 10coren) [20:28:04] (03PS4) 10Reedy: Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:28:29] (03PS2) 10Reedy: wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188862 [20:29:02] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188862 (owner: 10Reedy) [20:29:07] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188862 (owner: 10Reedy) [20:29:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf15 [20:29:31] Logged the message, Master [20:29:48] (03PS2) 10Reedy: group0 to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188863 [20:30:01] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188863 (owner: 10Reedy) [20:30:06] (03Merged) 10jenkins-bot: group0 to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188863 (owner: 10Reedy) [20:30:31] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf16 [20:30:35] Logged the message, Master [20:30:45] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:21] (03PS5) 10Reedy: Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:31:26] (03CR) 10Reedy: [C: 032] Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:31:31] (03Merged) 10jenkins-bot: Enable GlobalUserPage on test* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187888 (https://phabricator.wikimedia.org/T72576) (owner: 10Legoktm) [20:31:36] (03PS3) 10Chad: Move beta elasticsearch config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/188702 [20:32:10] (03PS3) 10Reedy: Limit runJobs output to warning and higher severity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 (owner: 10BryanDavis) [20:32:17] (03CR) 10Reedy: [C: 032] Limit runJobs output to warning and higher severity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 (owner: 10BryanDavis) [20:32:22] (03Merged) 10jenkins-bot: Limit runJobs output to warning and higher severity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188722 (owner: 10BryanDavis) [20:32:44] PROBLEM - Varnish HTTP upload-backend on cp1064 is CRITICAL: Connection refused [20:33:26] (03CR) 10Yuvipanda: [C: 031] Remove role::elasticsearch::config abstraction [puppet] - 10https://gerrit.wikimedia.org/r/188877 (owner: 10Chad) [20:33:51] So, Reedy, interested in helping me sort out what’s going on with the wikitech config? Or should I wait for James_F since he wrote the latest patches? [20:34:41] I can probably hack around things myself but it seems like it would be more polite to try to honor the spirit of those changes :) [20:34:46] !log reedy Synchronized php-1.25wmf16/includes/EditPage.php: Id376f9e75c43c5bd0fa910b04d066e6aa37c73d1 (duration: 00m 07s) [20:34:51] Logged the message, Master [20:35:24] !log reedy Synchronized wmf-config/: GlobalUserPage and I33a855cecfbe25003fe9e4f5e2fab2f928c79da4 (duration: 00m 08s) [20:35:28] Logged the message, Master [20:35:35] legoktm: ^^ [20:35:43] woot [20:35:57] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1019092 (10Cmjohnson) These have arrived on-site. What are the requirements for racking? Do these need to be spread across rows and/or racks? [20:38:14] RECOVERY - Varnish HTTP upload-backend on cp1064 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.009 second response time [20:39:05] (03PS1) 10Chad: Move jq package to module, all elasticsearch machines should have it [puppet] - 10https://gerrit.wikimedia.org/r/188881 [20:39:10] Reedy: hmm, I don't see it on https://test.wikipedia.org/wiki/Special:Version ... [20:40:31] legoktm@terbium:~$ mwscript eval.php --wiki=testwiki [20:40:31] > var_dump(class_exists('GlobalUserPage')); [20:40:31] bool(false) [20:41:12] damn it [20:41:18] i didn't git pull on tin, i did it locally [20:41:30] !log reedy Synchronized wmf-config/: GlobalUserPage and I33a855cecfbe25003fe9e4f5e2fab2f928c79da4 (duration: 00m 05s) [20:41:34] heh [20:41:34] Logged the message, Master [20:41:35] PROBLEM - Varnish HTTP upload-backend on cp1064 is CRITICAL: Connection refused [20:41:49] https://test2.wikipedia.org/wiki/User:Legoktm yaaay [20:42:10] !log reedy Synchronized wmf-config/: GlobalUserPage and I33a855cecfbe25003fe9e4f5e2fab2f928c79da4 (duration: 00m 07s) [20:42:15] Logged the message, Master [20:42:17] grrr [20:42:44] !log mw1092 giving file has vanished: "/wmf-config/.InitialiseSettings.php.KSg3AF" (in common) [20:42:50] Logged the message, Master [20:43:58] <_joe_> Reedy: if you have any issues deploying to mw1018, lemme know [20:44:12] it's not complained [20:45:10] <_joe_> ok [20:46:12] thanks Reedy :) [20:48:34] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:51:04] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 99 data above and 0 below the confidence bounds [20:53:45] RECOVERY - Varnish HTTP upload-backend on cp1064 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.026 second response time [20:55:41] (03CR) 10Kaldari: [C: 04-1] "Need to think about label length" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [20:57:12] !log radon - reinstalling, scheduled downtime [20:57:18] Logged the message, Master [20:57:20] (03PS1) 10BBlack: re-enable cp1064 backend for testing [puppet] - 10https://gerrit.wikimedia.org/r/188882 [20:58:14] (03CR) 10BBlack: [C: 032 V: 032] re-enable cp1064 backend for testing [puppet] - 10https://gerrit.wikimedia.org/r/188882 (owner: 10BBlack) [20:59:17] !log cp1064 upload b ackend re-enabled in cache.pp; if upload-related 503s ensue later today and I'm not around, feel free to re-disable it [20:59:20] Logged the message, Master [20:59:57] !log reedy Synchronized php-1.25wmf16: (no message) (duration: 00m 52s) [21:00:01] Logged the message, Master [21:00:49] andrewbogott: I should be good to help you in a few. Sorry, deploy was taking longer than expected :) [21:01:02] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1019139 (10Qgil) WMF-Legal has not approved this process formally yet, but I believe this is only because Luis has been so busy (he had this task assigned with High priority... [21:01:05] Reedy: no problem — I didn’t realize you were mid-deploy [21:02:28] 3hardware-requests, ops-codfw, operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1019148 (10mark) If we can wait for virt (2 months) then I think we should. Let's not do this work twice. [21:02:31] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1019149 (10GWicke) These need to be spread across rows and (ideally) racks. Our replica placement is by row, with the goal of having one copy of each bit of data in a separate row. [21:03:14] !log reedy Synchronized php-1.25wmf16/extensions/CheckUser/: (no message) (duration: 00m 07s) [21:03:22] Logged the message, Master [21:03:39] (03CR) 10Andrew Bogott: [C: 031] "I don't understand, and yet I approve" [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [21:04:16] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1019161 (10Andrew) Thank you for waiting on a proper .deb package :) I've approved those patches but will wait until Antoine is back @ work before merging. [21:04:35] andrewbogott: ok, what's up? :) [21:04:43] (03CR) 10Dzahn: ""the last error we had in the repository?" ??? i doubt that" [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [21:04:56] Reedy: so… log in to silver, and have a look at the apache error log [21:05:12] haha, alright [21:05:16] I take those errors to mean that we’re part way but not all the way to supporting VE on that box, and hence, breakage [21:05:34] That box is meant to be a dupe of virt1000 (aka wikitech), it is the future home of the deployer-enabled wikitech [21:05:56] (03PS1) 10RobH: setting mc2001-2018 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/188884 [21:06:39] Reedy: I have my local machine’s /etc/hosts hacked so that wikitech.wikimedia.org points to silver’s IP — it’s probably easier if you just let me hit ‘reload’ when you want to see the latest errors. [21:07:33] (03CR) 10Dzahn: [C: 031] "oh, -lenient DOES pass, that's great. i guess i was just fixing remainin warnings that are in -strict" [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [21:08:26] !log reedy Purged l10n cache for 1.25wmf13 [21:08:30] Logged the message, Master [21:08:41] Reedy: see what I mean? [21:08:49] Nope, not looked yet, sorry :P [21:09:01] oops, didn’t mean to nag [21:09:13] "ignore a lot of unwanted errors and warnings we are not interested in [21:09:20] ^ well, that's why [21:09:58] andrewbogott: it's alright, i've got a friend over and just a bit busy etc. :) [21:10:42] 3hardware-requests, ops-codfw, operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1019165 (10Legoktm) [21:13:05] (03CR) 10Dzahn: "could you elaborate a bit on the "ignore a lot of" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [21:14:19] (03CR) 10Dzahn: "re: " --no-puppet_url_without_modules-check" . that will fail on role manifests but is actually good under /modules/" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [21:25:18] andrewbogott: Can you loosen the permissions on /var/log/apache2/? [21:25:23] I can't view without root/sudo [21:25:24] yep [21:25:43] better? [21:27:04] (03PS1) 10Mjbmr: Add alias for previous project namespace (fawikibooks)\nBug: T60655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188887 [21:27:05] 3ops-eqiad, operations: please wipe disks of radon - https://phabricator.wikimedia.org/T88740#1019191 (10Dzahn) 3NEW a:3Cmjohnson [21:27:32] (03PS2) 10Mjbmr: Add alias for previous project namespace (fawikibooks) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188887 (https://phabricator.wikimedia.org/T60655) [21:27:59] andrewbogott: nope... [21:28:01] ls: cannot open directory /var/log/apache2/: Permission denied [21:28:14] Reedy: now? [21:28:17] Probably it was the dir [21:28:30] 3ops-eqiad, operations: please wipe disks of radon - https://phabricator.wikimedia.org/T88740#1019203 (10Dzahn) https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2195 [21:28:44] andrewbogott: might need to do the files too [21:28:45] -????????? ? ? ? ? ? error.log [21:28:55] It’s a+r [21:29:19] reedy@silver:~$ tail -f /var/log/apache2/error.log [21:29:19] tail: cannot open ‘/var/log/apache2/error.log’ for reading: Permission denied [21:29:21] hm, looking, I can’t read it either :) [21:29:25] hahaha [21:30:03] 3ops-eqiad, operations: please wipe disks of radon - https://phabricator.wikimedia.org/T88740#1019191 (10Dzahn) [21:30:07] 3Phabricator, operations: Delete LikeLifer username - https://phabricator.wikimedia.org/T87092#1019210 (10Qgil) @chasemp or anyone in #operations, can you act on this, please? [21:30:53] sudo :p [21:31:04] PHP Fatal error: Class 'Memcached' not found [21:31:07] Reedy: ^ [21:31:33] (03CR) 10RobH: [C: 032] setting mc2001-2018 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/188884 (owner: 10RobH) [21:31:35] <_joe_> mutante: where? [21:31:37] /srv/mediawiki/php-1.25wmf15/includes/objectcache/MemcachedPeclBagOStuff.php on line 61 [21:31:40] on silver [21:31:45] <_joe_> oh, ok [21:31:51] <_joe_> sorry, I panicked [21:32:16] heh, just replying to that above :p didn't mean cluster, yea [21:35:19] (03PS1) 10RobH: misallocated zinc to sterope, undoing [dns] - 10https://gerrit.wikimedia.org/r/188889 [21:36:55] (03CR) 10RobH: [C: 032] misallocated zinc to sterope, undoing [dns] - 10https://gerrit.wikimedia.org/r/188889 (owner: 10RobH) [21:38:40] 3Phabricator, operations: Delete LikeLifer username - https://phabricator.wikimedia.org/T87092#1019229 (10Dzahn) a:3Dzahn [21:38:46] Reedy: pm'd [21:39:19] 3WMF-Design, operations: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1019233 (10Nirzar) ``` I see the current proposal HTML page is 28 KB ``` I am looking into reducing this. It should be around 12kb @nemo_bis ``` b) I don't see it. ``` It's under view details. if someone expresses inte... [21:39:50] 3hardware-requests, ops-codfw, operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1019243 (10RobH) [21:39:51] 3ops-codfw, operations: rename and setup base hardware settings for WMF3298 (zinc/sterope) - https://phabricator.wikimedia.org/T88624#1019241 (10RobH) 5Open>3declined yep, wrong system, so i rolled back my changes, and im declining this ticket [21:39:55] 3Phabricator, operations: Delete LikeLifer username - https://phabricator.wikimedia.org/T87092#983768 (10Dzahn) thanks to ^d for the how to: ``` ./remove destroy @LikeLifer IMPORTANT: OBJECTS WILL BE PERMANENTLY DESTROYED! There is no way to undo this operation or ever retrieve this data. These 1 object(s... [21:40:26] 3hardware-requests, ops-codfw, operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1019245 (10RobH) 5Open>3declined Server allocation declined (stalled for when misc virt systems are in place), so I'm declining this hardware request. [21:40:38] 3ops-eqiad, operations: please wipe disks of radon - https://phabricator.wikimedia.org/T88740#1019247 (10Cmjohnson) in process 2 500GB disk will not finish wiping until tomorrow [21:41:07] 3Phabricator, operations: Delete LikeLifer username - https://phabricator.wikimedia.org/T87092#1019248 (10Dzahn) 5Open>3Resolved [21:44:24] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [21:46:27] (03PS1) 10Andrew Bogott: Include php5-memcached on Trusty openstack-manager hosts [puppet] - 10https://gerrit.wikimedia.org/r/188893 [21:47:06] (03CR) 10Andrew Bogott: [C: 032] Include php5-memcached on Trusty openstack-manager hosts [puppet] - 10https://gerrit.wikimedia.org/r/188893 (owner: 10Andrew Bogott) [21:48:43] (03CR) 10Dzahn: "just needs a manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/188389 (https://phabricator.wikimedia.org/T88427) (owner: 10RobH) [21:49:49] (03Abandoned) 10Dzahn: mediawiki: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188275 (https://phabricator.wikimedia.org/T86894) (owner: 10Dzahn) [21:53:25] (03PS1) 10Dzahn: mediawiki: add codfw monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/188895 [21:54:23] (03PS1) 10Andrew Bogott: Add some more php5 packages to Trusty wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/188896 [21:55:03] (03CR) 10Andrew Bogott: [C: 032] Add some more php5 packages to Trusty wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/188896 (owner: 10Andrew Bogott) [21:56:14] (03PS3) 10Dzahn: ci/jenkins: add public key for VE sync to puppet [puppet] - 10https://gerrit.wikimedia.org/r/188708 (https://phabricator.wikimedia.org/T84731) [21:59:41] 3operations: Fwd: [AffCom] "Brazil UG is no more" - https://phabricator.wikimedia.org/T88748#1019342 (10emailbot) [22:03:51] Reedy: what would you say is the right Project for installing an extension on a wiki? [22:04:09] Operations and mwcore seem both wrong. i would have thought platform , but that is archived [22:04:44] Wikimedia-Extension-setup and Release-Engineer? [22:04:49] <^d> Yeah that [22:04:58] 'k, thanks [22:05:03] * greg-g nods [22:05:25] Engineering [22:05:54] <^d> mutante: I'm assuming you mean T88748? [22:06:08] edit conflict:p [22:06:14] i meant to add the same ones you just added [22:06:19] but effectively removed you :p [22:06:24] because i had the tab already open [22:06:32] yes, that one [22:08:36] <^d> I removed NDA and such, it's not actually a private ops ticket. [22:09:11] no worries i wasn't sure having legal's email even if obfuscated [22:09:18] better safe than sorry I figured [22:09:41] ah, so he used the original RT address [22:09:42] got it [22:10:34] "...and you are done!" [22:10:36] if only [22:11:37] (03PS1) 10Andrew Bogott: Clarify difference between nova::manager and nova::controller roles [puppet] - 10https://gerrit.wikimedia.org/r/188924 [22:13:22] (03CR) 10Andrew Bogott: [C: 032] Clarify difference between nova::manager and nova::controller roles [puppet] - 10https://gerrit.wikimedia.org/r/188924 (owner: 10Andrew Bogott) [22:21:05] (03CR) 10Dzahn: [C: 032] "just puppetizing existing setup - not supposed to change anything" [puppet] - 10https://gerrit.wikimedia.org/r/188708 (https://phabricator.wikimedia.org/T84731) (owner: 10Dzahn) [22:23:28] (03CR) 10Dzahn: "watched on gallium. the only change was that the file mode went from 0600 to 0400" [puppet] - 10https://gerrit.wikimedia.org/r/188708 (https://phabricator.wikimedia.org/T84731) (owner: 10Dzahn) [22:25:26] 3operations: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1019483 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/188895/ [22:25:32] greg-g: not the best week for the wikis, was it ? :/ Wish you and the wikis all the best :) [22:30:39] 3operations: Gather email addresses of Phab users who have filed VisualEditor tasks, to inform them about upcoming triages - https://phabricator.wikimedia.org/T88741#1019514 (10Dzahn) a:3Dzahn [22:31:48] matanya: thanks much sir [22:36:12] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:25] <_joe_> mutante: which distro you used for redis hosts? [22:38:56] <_joe_> because memcached are simple enough that I'm shooting @jessie [22:41:15] _joe_: it was still Ubuntu, but happy to reinstall [22:41:29] <_joe_> trusty or precise? [22:41:31] did bblack change the default the other day? [22:41:43] <_joe_> if it's trusty and was trusty, it's ok [22:41:54] <_joe_> but we should not install precises tbh [22:42:04] <_joe_> it would be a sure waste of time [22:42:10] hmm.. it's precise [22:42:14] but just because it was default [22:42:22] <_joe_> no default is trusty [22:42:43] ehmm, but i didnt make a change in netboot [22:43:01] (03PS1) 10Mjbmr: Add autopatrolled user group for dawikiquote T88591 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188928 [22:43:09] looks [22:44:03] <_joe_> trusty [22:44:07] <_joe_> it's trusty [22:44:32] (03PS2) 10Mjbmr: Add autopatrolled user group for dawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188928 (https://phabricator.wikimedia.org/T88591) [22:44:37] rbf? [22:44:44] deb http://ubuntu.wikimedia.org/ubuntu/ precise main restricted [22:44:45] <_joe_> yes [22:44:58] <_joe_> rbf2001 is trusty [22:45:03] <_joe_> rbf1001 is precise [22:45:11] <_joe_> so yes, let's go with jessie [22:45:11] arg, i'm on the wrong box, yea, sorry [22:45:18] wrong DC :p [22:45:23] ok [22:45:35] <_joe_> we sould also look at changes in redis between the version we have and the one on jessie [22:45:46] <_joe_> I think it's backwards compatible [22:45:52] <_joe_> but we'd better check [22:46:04] so reinstall anytime? [22:46:29] yea [22:49:50] (03PS1) 10Reedy: Enable Parsoid on labswiki (for silver migration) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188931 [22:50:00] (03PS1) 10Dzahn: use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 [22:51:51] (03CR) 10Andrew Bogott: [C: 031] Enable Parsoid on labswiki (for silver migration) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188931 (owner: 10Reedy) [22:51:58] (03CR) 10Reedy: [C: 032] Enable Parsoid on labswiki (for silver migration) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188931 (owner: 10Reedy) [22:52:05] (03Merged) 10jenkins-bot: Enable Parsoid on labswiki (for silver migration) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188931 (owner: 10Reedy) [22:53:09] !log reedy Synchronized wmf-config/InitialiseSettings.php: Enable Parsoid on wikitech (duration: 00m 05s) [22:53:16] Logged the message, Master [22:54:25] (03PS1) 10Ori.livneh: vbench: compute & print measures of central tendency for total runtime [puppet] - 10https://gerrit.wikimedia.org/r/188933 [22:54:47] (03PS2) 10Ori.livneh: vbench: compute & print measures of central tendency for total runtime [puppet] - 10https://gerrit.wikimedia.org/r/188933 [22:55:24] (03PS3) 10Ori.livneh: vbench: compute & print measures of central tendency for total runtime [puppet] - 10https://gerrit.wikimedia.org/r/188933 [22:55:37] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: compute & print measures of central tendency for total runtime [puppet] - 10https://gerrit.wikimedia.org/r/188933 (owner: 10Ori.livneh) [22:58:43] <_joe_> mutante: sorry for that, but the risk is the same and we move towards jessie/debian anyway [22:59:03] <_joe_> check if the redis puppet module has some damn upstart script [23:00:00] <_joe_> In that case I'll subject the author to a 10-hour loop of lennart's presentation on the future of systemd at fosdem. It's slideless. [23:00:16] <_joe_> (It was a nice talk btw, because it was slideless) [23:00:19] Trying for a third time... [23:00:20] Hello, ops. I am interested in uploading 14 million files (~1.5 TB) to Wikimedia Commons. Who do I need to clear this with so that I don't accidentally kill Commons? [23:01:23] harej: lol. [23:01:28] <_joe_> harej: hi! I'm not the expert in commons capacity, but I can relay the message [23:01:40] Open a phab ticket, and make sure godog is CC'd [23:01:49] godog? is that like updog? [23:02:03] whats updog? [23:02:10] not much dog what's up with you [23:02:33] <_joe_> harej: godog is one of the ops, and is the one that can probably give part of the answer to your question [23:02:44] okay. and i should ask via phabricator? [23:02:52] <_joe_> harej: do you have an user on phabricator? [23:02:56] I do! [23:03:18] <_joe_> OK! then the best way to get a good answer is open a ticket in project operations [23:03:27] Alright, thanks! [23:03:39] <_joe_> we're pretty spread across timezones, and it's pretty late for most of us [23:03:47] <_joe_> (me included in fact :)) [23:04:38] Fair enough. I'll proceed with Phabricator [23:04:50] <_joe_> thanks! [23:05:04] (03PS1) 10Andrew Bogott: 'Require all granted' for images on 2.4. [puppet] - 10https://gerrit.wikimedia.org/r/188935 [23:06:35] (03CR) 10Andrew Bogott: [C: 032] 'Require all granted' for images on 2.4. [puppet] - 10https://gerrit.wikimedia.org/r/188935 (owner: 10Andrew Bogott) [23:07:59] (03PS1) 10Ori.livneh: Migrate role::performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188937 [23:08:10] (03PS2) 10Ori.livneh: Migrate role::performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188937 [23:08:17] (03CR) 10Ori.livneh: "godog, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/188937 (owner: 10Ori.livneh) [23:08:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Migrate role::performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188937 (owner: 10Ori.livneh) [23:10:58] springle: ping me when you’re around? [23:14:34] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1019654 (10GWicke) [23:15:26] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#21142 (10GWicke) [23:26:34] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019686 (10Harej) 3NEW [23:31:27] Reedy: "Would a 14 million entry XML file be too big?" :) [23:31:56] yes, even a 1 byte XML file would be too big [23:34:24] what about one nibble? will that be fine? [23:35:04] can i represent a single bit, as an instruction to be fed into your processor? perhaps to shift a register? [23:35:57] you're still using XML? we're coming for ya! [23:36:05] Whatever GLAM Wiki Toolset requires [23:36:11] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019727 (10Dzahn) For special cases like this it might also be an option to send a hard disk. [23:37:20] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019730 (10bd808) The two easiest paths forward for this are to switch from Monolog\Handler\RedisHandler to either: * Monolog\Handler\SyslogUdp... [23:37:21] mutante: Considering harej is in DC, he could almost just meet up with cmjohnson1 to pass it on [23:37:33] for real [23:37:42] exceeept the files aren't totally local? they're at internet archive [23:37:44] ah :) nice [23:37:49] lol [23:37:57] why do we need 1.5TB of files from IA? :P [23:38:08] eh, so we could just download from archive directly? [23:38:20] GWToolset supports that, roughly, I think [23:39:12] mutante: that's the plan at the moment [23:45:15] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019742 (10Harej) I happen to live in DC, so it wouldn't take that much effort for me to hand off a hard disk to someone in Ashburn. Except I don't h... [23:45:37] MaxSem: it does my heart good to smell XML disdain [23:45:38] hi it's me [23:45:41] but wikibugs isn't pinging me? [23:46:00] harej: so instead of the actual files you could also just provide the links to all those files? [23:46:06] on archive.org ? [23:46:09] Yes. And meta-data! [23:46:15] gotcha, ok [23:46:33] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [23:46:45] And if you want something other than XML, just let me know. The metadata file is going to be produced specifically for this purpose. [23:47:42] i'm not sure yet what makes the most sense, just wanted to understand the options. let's paste that to ticket [23:48:45] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019743 (10Reedy) Yes, a 14 million entry xml file would be too big. I'm not sure how big files have actually been uploaded using the GWT, maybe @dan... [23:52:00] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019746 (10Dzahn) So it turns out all those files are on archive.org. So it's an option to just provide all the download links to those files instead... [23:52:04] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1019747 (10chasemp) For the love of consistency I would vote rsyslog. It's very standard and fairly easily debugable and we have to live with... [23:53:17] !log Updated Wikimania Scholarships to 0852585 (re-enable language selection) + local hack in trebuchet repo to remove incomplete translations [23:53:24] Logged the message, Master [23:53:45] 3operations, ops-eqiad: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1019756 (10Cmjohnson) The system board has been replaced and error cleared. There is a temp idrac license on it now. The permanent license needs to be added and dhcpd file fixed with new MAC address. [23:56:06] 3operations, Phabricator: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1019762 (10RobH) Yes, but it should ONLY relay into the ops-datacenter site projects, not #operations itself. I realize thats what we talked about, but just calling it out intentionally. @chasemp so are all the is... [23:58:01] 3operations, Phabricator: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1019763 (10RobH) Just to clarify, since reviewing this task doesn't quite make it clear WHY @emailbot needs this. Example: System X has a failed mainboard, so we send in a support request to our Vendor. Since it i... [23:58:11] 3operations, Phabricator: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1019764 (10RobH) a:5RobH>3chasemp