[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161123T0000). Please do the needful. [00:00:04] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:16] * MaxSem can deploy [00:00:46] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:01:43] (03CR) 10MaxSem: [C: 032] Fix EmailAuth beta cluster enabling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323078 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [00:02:01] (03CR) 10jenkins-bot: [V: 04-1] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:02:17] (03Merged) 10jenkins-bot: Fix EmailAuth beta cluster enabling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323078 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [00:03:51] (03PS3) 10Paladox: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:03:58] (03CR) 10Paladox: "Fixed https://integration.wikimedia.org/ci/job/operations-puppet-rake-jessie/109/console" [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:04:00] !log maxsem@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/323078/ (duration: 00m 49s) [00:04:01] (03CR) 10Paladox: [C: 031] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:04:06] tgr, ^ [00:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:47] thx MaxSem - I need to wait 10m for the beta sync, right? [00:04:53] paladox: you are too fast, heh [00:05:08] LOL [00:05:10] something like that, yeah [00:05:11] yep [00:06:41] (03CR) 10jenkins-bot: [V: 04-1] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:07:46] heh [00:09:06] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:11:37] (03PS4) 10Dzahn: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:11:59] (03PS5) 10Paladox: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:12:03] mutante ^^ woops [00:12:07] did it at the same time [00:12:25] ^ that takes skill therefore i applad [00:12:35] applaud* [00:12:37] (03PS6) 10Paladox: Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:15:13] ^ it was vandalised [00:15:25] your welcome for fixing it :P [00:16:24] Zppix what was it prevously? [00:16:45] It had The topic for -operations is: then blah [00:16:53] but it was added by an uncloaked/unknown user [00:16:57] edited rather [00:17:17] oh [00:17:21] Zppix: ooh, thanks for that [00:17:28] mutante anytime <3 [00:17:49] hey if you cannot help with servers might as well help topic vandalism reverts :P [00:18:27] Dzahn [00:18:27] Uploaded patch set 4. [00:18:27] 16:11 [00:18:27] Paladox [00:18:27] Patch Set 5: Published edit on patch set 4. [00:18:30] 16:11 [00:18:37] arg, that was supposed to be just 2 lines [00:18:46] but what a timing , yes [00:18:50] Yeh lol [00:18:55] i didnt mean to do that [00:19:08] (03PS2) 10Eevans: enable instance restbase2011-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) [00:19:18] just was the timing, if i was a few secs slower i could have not press the big red button [00:19:19] (03CR) 10Dzahn: [C: 032] Phab: Remove pointless variable assignment [puppet] - 10https://gerrit.wikimedia.org/r/323081 (owner: 10Chad) [00:19:22] (03CR) 10Eevans: [C: 031] enable instance restbase2011-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:19:28] and there we have jenking agreeing now [00:19:37] mutante, please stop spamming xD [00:19:42] (03CR) 10Eevans: "Good to go." [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:19:44] :) [00:20:02] (03PS3) 10Eevans: enable instance restbase2011-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) [00:21:06] (03CR) 10Dzahn: [C: 032] enable instance restbase2011-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322958 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:21:10] (03CR) 10Paladox: "@Dzahn we could merge this, but we will need a future patch that will do the user and password for the db." [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [00:21:14] mutante: thank you sir [00:21:27] (03PS4) 10Andrew Bogott: openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:25:12] zppix: ok urandom: no problem (saved one line to make up for spam) [00:25:23] heh [00:25:37] * Zppix slaps mutante with a big red brick [00:26:15] urandom: and ..now it's active on the master [00:26:57] Zppix: the blue brick please [00:27:05] that was a default irc cmd [00:27:09] client^ [00:27:28] heh, default should be "trout" forever [00:27:36] mutante theres that one as well [00:27:44] mutante: yup [00:28:20] (03CR) 10Andrew Bogott: [C: 032] openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:28:24] (03PS5) 10Andrew Bogott: openstack: split nova.pp into one class per file (autoload layout) [puppet] - 10https://gerrit.wikimedia.org/r/322939 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:28:46] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:30:27] andrewbogott: thanks! :) [00:30:50] I'm still chasing the tip :( [00:31:36] yea, i saw, it's a bit slow [00:31:44] now :) [00:32:24] i hit submit [00:37:47] (03PS2) 10Dzahn: phabricator: Move mysql hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [00:38:09] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4654/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [00:43:59] (03CR) 10Dzahn: [C: 04-2] "blocked on https://phabricator.wikimedia.org/T144667" [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [00:45:27] (03CR) 1020after4: "thanks dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [00:47:04] (03PS1) 10Andrew Bogott: Openstack: move wikistatus settings into hiera [puppet] - 10https://gerrit.wikimedia.org/r/323095 [00:47:54] (03CR) 10jenkins-bot: [V: 04-1] Openstack: move wikistatus settings into hiera [puppet] - 10https://gerrit.wikimedia.org/r/323095 (owner: 10Andrew Bogott) [00:48:14] (03PS1) 10Andrew Bogott: wikitechstatusconfig: Add some dummy entries [labs/private] - 10https://gerrit.wikimedia.org/r/323096 [00:49:30] (03PS2) 10Andrew Bogott: Openstack: move wikistatus settings into hiera [puppet] - 10https://gerrit.wikimedia.org/r/323095 [00:50:00] (03CR) 10Andrew Bogott: [C: 032 V: 032] wikitechstatusconfig: Add some dummy entries [labs/private] - 10https://gerrit.wikimedia.org/r/323096 (owner: 10Andrew Bogott) [00:52:48] (03CR) 10Dzahn: [C: 031] "i think it's ok now and require_package works with and without [ ] around multiple packages" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [00:54:06] RECOVERY - DPKG on mwdebug1001 is OK: All packages OK [00:54:20] ^ that was me, fixed [00:55:45] 06Operations, 10Ops-Access-Requests: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2816784 (10bd808) [00:58:31] (03CR) 10Andrew Bogott: [C: 032] Openstack: move wikistatus settings into hiera [puppet] - 10https://gerrit.wikimedia.org/r/323095 (owner: 10Andrew Bogott) [01:01:06] andrewbogott: you are probably aware, but on silver/californium it seems the root cause is that $openstack_version is not set [01:02:01] the path is built like modules/openstack/${openstack_version}/nova/ and becomes modules/openstack//nova/wikistatus [01:02:10] PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused [01:02:19] mutante: yeah, makes sense, I'll look [01:03:38] is tools-lab being worked on atm? [01:06:44] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused eevans Bootstrapping. [01:07:50] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:09:01] Zppix: could you be more specific? [01:09:11] but strange, because i see it being used in lots of places, and looks the same [01:09:31] mutante: yeah, that's just what I was going to say :( [01:09:49] all i see is whitespace .. hmm [01:10:07] in theory it's set here: common/openstack.yaml:openstack::version: 'liberty' [01:11:07] ostriches, around? [01:11:15] and it's like that in all the other classes under ./nova/ .. [01:12:06] fyi we're doing a high-priority unscheduled VE deployment soon [01:12:20] it'll just be some JS changes [01:12:31] mutante: and it works on other hosts. That has to be a clue... [01:13:00] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:25] !log Updated striker to c546f4c (T151409) [01:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:36] T151409: ToolsAdmin: Error 500 for ZppixBot repo listing - https://phabricator.wikimedia.org/T151409 [01:14:00] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:14:32] Krenair: I might be. [01:14:53] ostriches, okay, just wanted to let you know since Greg has been idle for a while [01:15:10] We have a bug that involves edit notices not showing in VE [01:15:23] * ostriches approves or something [01:15:24] Have fun [01:15:24] meaning you don't get e.g. warned that you are exposing IP upon editing [01:16:02] andrewbogott: silver and californium.. and they use different roles .. how is it on labtestweb2001 [01:16:38] mutante: puppet is disabled there; I'm developing still. [01:17:07] ok [01:18:52] (03PS1) 10Andrew Bogott: Remove some redundant includes of openstack::nova::hooks [puppet] - 10https://gerrit.wikimedia.org/r/323101 [01:19:11] mutante: I don't know if that will help anything, but it's correct and should simplify [01:19:21] the version gets set in 3 places for californium [01:20:07] regex.yaml, common/openstack.yaml and hosts/californium.yaml [01:20:12] (03CR) 10Andrew Bogott: [C: 032] Remove some redundant includes of openstack::nova::hooks [puppet] - 10https://gerrit.wikimedia.org/r/323101 (owner: 10Andrew Bogott) [01:20:20] yet it does not get any of them? uhmm [01:24:45] (03PS1) 10Andrew Bogott: Explicit hiera lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323102 [01:24:47] mutante: ^ ? [01:25:20] andrewbogott: californium has a hosts file that sets the openstack_version a second time, but it it still doesnt get it [01:25:36] andrewbogott: what's an example of a host where it works? [01:25:46] i think now it's about the role names [01:25:52] I don't really understand what the relationship is between $::global_variable and something set in hiera that isn't looked up elsewhere... [01:25:59] it works on e.g. labvirt1001 [01:26:03] or labcontrol1001 [01:26:03] silver and californium dont have roles that start with "openstack" [01:26:18] but the value is set in common/openstack.yaml [01:26:25] (03CR) 10Andrew Bogott: [C: 032] Explicit hiera lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323102 (owner: 10Andrew Bogott) [01:26:55] hmm, it would have fit the labvirt example but not labcontrol [01:27:45] mutante: ok, that would do it [01:28:05] Although I don't know why this worked before :( [01:28:16] My patch fixed it in the first place but it just hit the same issue afterwards [01:28:27] andrewbogott: i would have said let's see what happens when we set the openstack::version for just hosts/silver.yaml... [01:28:41] but then i saw that californium has just that... [01:28:56] (03PS1) 10Papaul: CODFW: Add prod DNS for prometheus200[3-4] Bug: T151338 [dns] - 10https://gerrit.wikimedia.org/r/323104 (https://phabricator.wikimedia.org/T151338) [01:28:58] andrewbogott: oh, weird [01:29:57] (03PS1) 10Andrew Bogott: Set openstack::version explicitly, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/323105 [01:29:58] mutante: how about ^ ? [01:30:58] andrewbogott: or even just common.yaml ? [01:31:05] sure, ok [01:31:52] (03PS2) 10Andrew Bogott: Set openstack::version explicitly, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/323105 [01:33:00] compiles that [01:33:17] (03CR) 10Andrew Bogott: [C: 032] Set openstack::version explicitly, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/323105 (owner: 10Andrew Bogott) [01:34:17] hmm. andrew, that fails in compiler [01:34:17] mutante: nope, still fails [01:34:19] on some [01:34:25] but different error [01:34:25] same as before I think? [01:34:34] http://puppet-compiler.wmflabs.org/4656/ [01:34:34] oh, you're right [01:34:41] this clearly doesn't work AT ALL how I thought [01:34:55] Could not find data item openstack::version in any Hiera data file [01:34:59] before it was more like [01:35:05] "i found stuff but set it to ''" [01:35:07] (03PS1) 10Andrew Bogott: Revert "Set openstack::version explicitly, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/323106 [01:36:09] (03CR) 10Andrew Bogott: [C: 032] Revert "Set openstack::version explicitly, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/323106 (owner: 10Andrew Bogott) [01:36:50] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:37:10] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:37:30] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:37:47] James_F, oh, jenkins merged the backports [01:37:51] can deploy now [01:37:58] Go for it. [01:38:00] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:38:10] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:38:29] (03PS1) 10Andrew Bogott: One more explicit lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323107 [01:39:19] (03CR) 10jenkins-bot: [V: 04-1] One more explicit lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323107 (owner: 10Andrew Bogott) [01:40:04] (03PS2) 10Andrew Bogott: One more explicit lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323107 [01:40:40] (03PS1) 10Papaul: CODFW: ADD DHCP entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323108 (https://phabricator.wikimedia.org/T151338) [01:41:01] James_F, hm, these new mwdebug hosts seem to scap pull quite slowly [01:41:22] (03CR) 10Andrew Bogott: [C: 032] One more explicit lookup for openstack_version [puppet] - 10https://gerrit.wikimedia.org/r/323107 (owner: 10Andrew Bogott) [01:41:32] 53 seconds. It feels like that's longer than usual [01:42:00] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:42:00] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:42:15] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816847 (10Papaul) [01:42:20] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:43:04] James_F, seems to work for me on mwdebug1001. okay to push out to users? [01:43:24] Krenair: Do it. [01:44:22] !log krenair@tin Synchronized php-1.29.0-wmf.3/extensions/VisualEditor/modules/ve-mw: https://gerrit.wikimedia.org/r/323080 and https://gerrit.wikimedia.org/r/323103 (duration: 00m 49s) [01:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:45] James_F, ^ [01:45:10] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:45:28] andrewbogott: ah :) [01:45:36] mutante: that didn't actually fix it [01:45:39] it's still wrong in places [01:45:49] hmm. ok. ... [01:45:54] My code really counts on being able to read that as a global :( [01:46:10] Krenair: Yup, looks good in production. [01:46:49] there was an error syncing [01:47:02] mw2092.codfw.wmnet returned [70]: 01:44:12 Copying to mw2092.codfw.wmnet from mw2080.codfw.wmnet [01:47:11] rsync: failed to set times on "/srv/mediawiki/php-1.29.0-wmf.3": Read-only file system (30) [01:47:19] andrewbogott: i noticed that no other variable in common.yaml has "::" they are all just one word or separated by underscore.. this might not mean anything though [01:47:21] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2816874 (10Papaul) p:05Triage>03High [01:47:30] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:47:53] !log mw2092 seems broken [01:47:55] mutante: any idea what change actually provoked this problem? [01:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:14] Debian GNU/Linux 8 auto-installed on Tue Aug 30 07:29:09 UTC 2016. [01:48:14] -bash: /etc/bash_completion: Input/output error [01:48:14] -bash: __git_ps1: command not found [01:48:14] Connection to mw2092.codfw.wmnet closed. [01:48:34] do you think it's 0eec66392a554612ee03ba3b01f9f34ce7f27340 ? [01:49:26] andrewbogott: no, i dont think that because the alerts in icinga were older [01:49:34] unfortunately since they recovered now ... [01:49:38] i dont see the time anymore [01:49:44] looks in history [01:52:03] Krenair: mw2092 r/o? Hrm.... [01:52:06] (also cannot SSH, weird) [01:52:16] Well, it connects then drops immediately [01:52:18] yeah [01:53:00] I got an IO error when my .bashrc tried to enable bash completion [01:53:37] needs ops to investigate [01:53:42] andrewbogott: [01:53:44] Krenair: File a Task? [01:53:52] Krenair: We should depool I suppose. [01:54:04] from scap? [01:54:21] Oh, I guess it's codfw, doesn't matter. [01:54:29] I was thinking from lvs. [01:54:39] andrewbogott: in icinga.log i can see that it was broken with the same error at 1479841109 [01:54:54] Time (UTC)Tue Nov 22 18:58:29 2016 UTC [01:55:20] We might urgently have to fail-over to codfw, though, so we should try to keep it deployable. [01:55:38] that's before you merged the split [01:56:17] James_F: Well I don't have the ability to depool right now. Needs some scap changes :) [01:56:23] andrewbogott: but something today a couple hours ago.. it seems [01:56:24] Either way we'll need a root [01:56:30] * James_F nods. [01:56:40] Just being picky. :-) [01:56:48] hm, apparently git timestamps record when the patch was written and not when it was merged? [01:57:07] andrewbogott: Yes. The merge commit will have the timestamp of when it was merged. [01:57:30] do we have merge commit on ops/puppet? [01:57:48] you merged at 16.28 PST [01:57:59] i'm trying to connect to that mw2092 now [01:59:09] looks like disk fail [01:59:16] mw2092 login: root [01:59:16] [1263230.537241] blk_update_request: I/O error, dev sda, sector 663148816 [01:59:42] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816885 (10Papaul) @fgiunchedi for the partman do you want for us to use raid1-gpt.cfg? [02:01:07] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2814695 (10Dzahn) He uploaded a new partman recipe for prometheus https://gerrit.wikimedia.org/r/#/c/323056/ [02:03:09] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816895 (10Papaul) @Dzahn Thanks [02:03:36] mutante papaul yeah I'll just merge that recipe [02:03:38] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2092.codfw.wmnet [02:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:49] untested, might or might not work [02:03:52] ostriches: ^ depooled [02:04:00] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:04:03] (03PS2) 10Filippo Giunchedi: install_server: add prometheus partman [puppet] - 10https://gerrit.wikimedia.org/r/323056 (https://phabricator.wikimedia.org/T151338) [02:04:42] !log depooled mw2092 because it had I/O errors, dev sda [02:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:00] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:06:08] (03CR) 10Filippo Giunchedi: [C: 032] install_server: add prometheus partman [puppet] - 10https://gerrit.wikimedia.org/r/323056 (https://phabricator.wikimedia.org/T151338) (owner: 10Filippo Giunchedi) [02:06:30] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [02:06:32] (03CR) 10Dzahn: [C: 032] CODFW: Add prod DNS for prometheus200[3-4] Bug: T151338 [dns] - 10https://gerrit.wikimedia.org/r/323104 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:07:50] (03PS2) 10Dzahn: CODFW: ADD DHCP entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323108 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:08:17] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816898 (10fgiunchedi) @papaul yeah `prometheus.cfg` is merged now, let's try that! Might not work on first try though [02:10:13] (03PS1) 10Gergő Tisza: Use 'exception' channel in logstash, kill 'exception-json' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) [02:10:20] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:11:24] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2816914 (10Dzahn) [02:12:02] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2816903 (10Krenair) When I tried to log in earlier: `-bash: /etc/bash_completion: Input/output error` [02:12:35] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2816918 (10Dzahn) [02:14:01] (03CR) 10Dzahn: [C: 032] CODFW: ADD DHCP entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323108 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:15:30] that's a lot of shouting :P [02:15:55] (03PS1) 10Andrew Bogott: Include ::openstack in the common role [puppet] - 10https://gerrit.wikimedia.org/r/323113 [02:16:04] (03PS3) 10Filippo Giunchedi: Remove beta::config [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [02:21:00] andrewbogott: the timestamp looks like https://gerrit.wikimedia.org/r/#/c/322943/ might be responsible [02:21:14] only from the time, not even the code i looked at [02:21:26] but it's like 7.5 hours ago [02:21:29] (03PS1) 10Papaul: CODFW: Add partman entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323114 (https://phabricator.wikimedia.org/T151338) [02:21:55] mutante: yeah, I think it's the removal of 'require openstack' that broke things [02:22:21] jeeeeeeenkiiiins [02:22:36] andrewbogott: oooh! ok [02:22:56] godog: in my mind it's always "leeeeroy" :) [02:23:10] lol [02:23:20] (03PS2) 10Andrew Bogott: Require ::openstack before the common role [puppet] - 10https://gerrit.wikimedia.org/r/323113 [02:23:50] hehehe mutante true, looking forward to a patch to play a sound in gerrit while waiting [02:23:56] gerrit_soundboard [02:25:59] srsly it has been 10 min now [02:26:05] i know, your favorite is sad_trombone [02:26:28] you know.. and we just merged a change the other day that makes running the lint checks much faster.. [02:27:42] godog: it just finished [02:28:48] *nod* yeah the actual checks were fast, I guess queuing [02:28:53] (03CR) 10Filippo Giunchedi: [C: 032] Remove beta::config [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [02:32:37] (03PS3) 10Andrew Bogott: Require ::openstack before the common role [puppet] - 10https://gerrit.wikimedia.org/r/323113 [02:33:01] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [02:33:40] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 14m 13s) [02:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:39] (03PS2) 10Dzahn: CODFW: Add partman entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323114 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:38:48] (03CR) 10Dzahn: [C: 032] CODFW: Add partman entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323114 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:41:45] (03CR) 10Dzahn: [V: 032] CODFW: Add partman entries for prometheus200[3-4] Bug: T151338 [puppet] - 10https://gerrit.wikimedia.org/r/323114 (https://phabricator.wikimedia.org/T151338) (owner: 10Papaul) [02:42:10] only did that because it wasnt touching a manifest [02:49:43] (03PS6) 10Krinkle: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [02:53:14] (03PS4) 10Andrew Bogott: Require ::openstack before the common role [puppet] - 10https://gerrit.wikimedia.org/r/323113 [03:05:47] (03CR) 10Andrew Bogott: [C: 032] Require ::openstack before the common role [puppet] - 10https://gerrit.wikimedia.org/r/323113 (owner: 10Andrew Bogott) [03:06:44] (03CR) 10Krinkle: [C: 031] sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [03:08:05] can single wikis ask to have magic links turned off or it will be done only globally on all wmf wikis at once? [03:10:45] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2816962 (10Tgr) This approach seems backwards. Shouldn't we have a community consultation first, before mak... [03:13:01] (03PS1) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [03:13:50] (03CR) 10jenkins-bot: [V: 04-1] wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [03:16:09] (03PS2) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [03:20:19] !log prometheus200[3-4] signing puppet certs, salt-key, initial run [03:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 745.25 seconds [03:28:00] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:30:21] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2816966 (10Papaul) @fgiunchedi does this look good? root@prometheus2003:~# fdisk -l /dev/sda Disk /dev/sda: 1.5 TiB, 1599741100032 bytes, 3124494336 sectors Units: sectors of 1 * 512 =... [03:31:41] Danny_B: I'm not sure. I think we'd be okay with a single wiki turning it off if they have consensus [03:32:25] (03CR) 10Krinkle: "bump" [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [03:33:01] assumption: all the magic links replaced with alternative -> no need to use them anymore -> better to turn them off rather than allowing them and having to correct them now and then until global switch [03:40:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 137.87 seconds [03:57:30] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:58:20] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:26:20] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [05:17:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=951.00 Read Requests/Sec=244.60 Write Requests/Sec=6.80 KBytes Read/Sec=30666.80 KBytes_Written/Sec=203.20 [05:30:10] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=163.60 Read Requests/Sec=187.70 Write Requests/Sec=121.80 KBytes Read/Sec=5644.00 KBytes_Written/Sec=1240.00 [05:49:40] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:17:40] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:35:20] RECOVERY - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is OK: TCP OK - 0.036 second response time on 10.192.32.154 port 9042 [06:45:00] PROBLEM - Check systemd state on mw2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:45:40] PROBLEM - Check systemd state on mw2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:30] PROBLEM - Check systemd state on mw2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:47:40] PROBLEM - Check systemd state on mw2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:49:30] PROBLEM - Check systemd state on mw2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:56:30] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:05:44] (03PS1) 10Dzahn: admin: create group striker-admins, add bd808 [puppet] - 10https://gerrit.wikimedia.org/r/323121 (https://phabricator.wikimedia.org/T151424) [07:14:36] !log Stopping replication on db1052 (depooled) for maintenance - T150960 [07:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:49] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [07:17:40] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [07:29:25] !log Stopping replication on db1095 (depooled) for maintenance - T150960 [07:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:38] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [07:46:34] !log Stopping MySQL db2070 for maintenance - T149553 [07:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:45] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [07:49:00] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:30] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:52] (03CR) 10Ema: [C: 032] Remove Varnishkafka APT pinning [puppet] - 10https://gerrit.wikimedia.org/r/322911 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [07:57:58] (03PS2) 10Ema: Remove Varnishkafka APT pinning [puppet] - 10https://gerrit.wikimedia.org/r/322911 (https://phabricator.wikimedia.org/T150660) [07:58:01] (03CR) 10Ema: [V: 032] Remove Varnishkafka APT pinning [puppet] - 10https://gerrit.wikimedia.org/r/322911 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [08:18:00] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:24:30] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:59:02] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2817174 (10Volans) p:05Triage>03Normal [09:01:41] 06Operations, 13Patch-For-Review: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933#2817175 (10Volans) p:05Triage>03Normal [09:02:34] 06Operations, 06Labs, 13Patch-For-Review: grafana-labs.wikimedia.org doesn't reflect grafana-labs-admin.wikimedia.org - https://phabricator.wikimedia.org/T143556#2817176 (10Volans) p:05Triage>03Normal [09:03:02] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2817177 (10Volans) p:05Triage>03Normal [09:04:38] 06Operations, 10puppet-compiler: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2817178 (10Volans) p:05Triage>03High I'll try to have a look at it today. [09:14:53] PROBLEM - MariaDB Slave Lag: s1 on db1052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6525.04 seconds [09:15:23] ^ me - looks like downtime finished [09:16:07] ok [09:16:09] :( [09:16:12] sorry :( [09:16:19] I thought the operation would take less time [09:16:57] np [09:35:44] (03PS1) 10Giuseppe Lavagetto: jobrunner: fix logrotate rules under systemd [puppet] - 10https://gerrit.wikimedia.org/r/323127 [09:35:56] <_joe_> volans: ^^ [09:38:34] (03CR) 10Volans: [C: 04-1] "there is a typo, otherwise looks good" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [09:39:41] <_joe_> volans: where would the trailing spaces be? [09:39:56] <_joe_> I'm not sure I got that comment [09:40:11] look at the diff in gerrit, red block _joe_ [09:40:12] in the else [09:40:17] the else line [09:40:41] <_joe_> volans: uhm interesting the inline view that I use [09:40:48] <_joe_> instead of the side-by-side one [09:40:59] <_joe_> didn't show those [09:41:10] maybe you have the ignore spaces in diff [09:41:12] <_joe_> it's also pretty strange given I use whitespace-mode in emacs [09:41:26] <_joe_> volans: no I mean gerrit [09:41:28] <_joe_> https://gerrit.wikimedia.org/r/#/c/323127/1/modules/mediawiki/templates/jobrunner/logrotate-jobchron.conf.erb,unified [09:41:52] nice gerrit [09:42:15] <_joe_> now you see what had me confused [09:42:43] I thought there was an option to ignore spaces in the diffs and that was per-mode, I might have dreamed though :-P [09:44:24] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817203 (10matmarex) There are nearly two million jobs queued now, and while https://grafana.wikimedia.org/dashboard/db/job-queue-health doesn't provide breakdown by type... [09:47:19] (03PS2) 10Giuseppe Lavagetto: jobrunner: fix logrotate rules under systemd [puppet] - 10https://gerrit.wikimedia.org/r/323127 [09:49:55] (03Abandoned) 10Gehel: granting access to analytics-privatedata-users for user discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/322282 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [09:50:52] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2816903 (10Volans) This host is gone for now, I've added a 3 months scheduled downtime and disabled notifications on Icinga. Looks like it is a single disk host or a 2 disk host where only one was used, without RAID. In... [09:50:58] (03CR) 10Giuseppe Lavagetto: jobrunner: fix logrotate rules under systemd (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [09:51:49] <_joe_> volans: care to re-review? [09:52:26] sure [09:53:18] hi. could anyone run maintenance/showJobs.php for me on commons, and give me the output? [09:53:39] (this is for https://phabricator.wikimedia.org/T151196) [09:55:03] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2817221 (10Joe) [09:55:07] actually: maintenance/showJobs.php --group [09:55:33] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2813405 (10Joe) 05Open>03Resolved [09:55:53] (03CR) 10Volans: [C: 04-1] "Sorry, I've looked to quickly before... there is another bit missing" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [09:57:51] MatmaRex: if something I can help with, sure, but I need a bit more context ;) [10:00:00] volans: hmm, i think you need to do this: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Run_a_maintenance_script_on_a_wiki assuming this is up-to-date [10:00:24] so, ssh into terbium, and run `mwscript showJobs.php --wiki=commonswiki --group` [10:00:47] volans: or do you mean context as in why i need this? i'm trying to find out why the queue is clogged [10:01:30] 06Operations: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#1466972 (10Volans) One of those just failed today, the only sign from Icinga is that puppet failed: T151427 Many of those were reimaged to Jessie but apparently were not modified to use the second disk to have a... [10:01:49] MatmaRex: ok, give me a sec [10:01:59] (03PS3) 10Giuseppe Lavagetto: jobrunner: fix logrotate rules under systemd [puppet] - 10https://gerrit.wikimedia.org/r/323127 [10:04:53] <_joe_> volans: I think somewhere on terbium there must be a script i prepared last time we had to deal with this [10:05:22] MatmaRex: you should have received a notification of a paste from Phab ;) [10:05:36] looks like htmlCacheUpdate [10:06:08] volans: thanks, i see it [10:07:39] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2817244 (10Gehel) [10:08:05] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2817245 (10Gehel) [10:08:17] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2817246 (10Gehel) [10:08:29] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#2817247 (10Gehel) [10:09:04] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [10:09:15] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817249 (10matmarex) Actually, it seems that 'categoryMembershipChange' jobs are being processed processed quickly. There's a backlog of 'htmlCacheUpdate' jobs. {P4497} [10:10:53] MatmaRex: let me know if you need anything else [10:12:15] (03CR) 10Alexandros Kosiaris: jobrunner: fix logrotate rules under systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [10:15:17] (03CR) 10ArielGlenn: "Daniel, see the ticket (spoiler: no we haven't). And the first thing we would try to do is play with young-gen and old-gen sizes, rather " [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [10:18:05] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817273 (10Ankry) Let me know it is needed to slow down or stop temporarily the category removal [10:21:45] (03CR) 10Hashar: [C: 031] "That rule has some messy history, looks like the deployment targets now have a security.conf rule "scap-allow-mwdeploy" which would super" [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [10:23:18] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2817280 (10Gilles) [10:23:33] !log stopping replication to dbstore1001 to change its masters [10:23:39] jouncebot: next [10:23:40] In 123 hour(s) and 36 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161128T1400) [10:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2795498 (10Gilles) [10:24:25] !log Stopping replication on the following m3 hosts for maintenance - db1048, dbstore1002 (m3 instance), db2012 - T151384 [10:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:32] T151384: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384 [10:26:49] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2817298 (10Gilles) Now that I've separate the temp case into its own task, I see that the remaining ones look like an encoding problem. I notice one charact... [10:28:20] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s when the original has a ? in its filename - https://phabricator.wikimedia.org/T150760#2817315 (10Gilles) [10:28:38] (03PS1) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) [10:28:47] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s when the original has a ? in its filename - https://phabricator.wikimedia.org/T150760#2795498 (10Gilles) [10:28:49] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2817332 (10Gilles) [10:28:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2817330 (10Gilles) [10:29:57] (03CR) 10Gehel: "This follows the discussion on https://phabricator.wikimedia.org/T151063" [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [10:35:39] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2817344 (10Gilles) [10:35:59] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2817280 (10Gilles) [10:39:51] volans: one more request: mwscript showJobs.php --wiki=commonswiki --list --type=htmlCacheUpdate --limit=100 --status=unclaimed [10:40:05] MatmaRex: sure [10:40:30] 06Operations, 10Traffic: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2817348 (10ema) [10:42:07] 06Operations, 10Traffic: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2817361 (10ema) p:05Triage>03Normal [10:43:16] MatmaRex: done :) [10:45:16] (03PS1) 10Ema: cache_upload: stop graphiq.com buggy javascript [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) [10:46:01] <_joe_> MatmaRex: looks like you were right [10:46:32] <_joe_> all changes I looked at are removal of the uploadwizard category [10:47:12] _joe_: it's not the categorymembership stuff though. just plain old htmlcacheupdate [10:47:31] <_joe_> which is triggered by the category change, it seems [10:47:39] _joe_: based on the paste volans just gave me, it's from WikiPage::onArticleEdit [10:48:15] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817399 (10matmarex) Sample of the queued jobs: {P4498} [10:48:19] which is just the most boring code path ever [10:49:11] and not as easy to add a check to ignore the edits removing the category [10:49:33] (03PS1) 10Marostegui: site.pp: m3 has the wrong db master entry [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) [10:52:19] (03CR) 10Jcrespo: [C: 031] site.pp: m3 has the wrong db master entry [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) (owner: 10Marostegui) [10:52:21] <_joe_> MatmaRex: yup [10:52:43] Invalidate caches of articles which include this page [10:52:44] DeferredUpdates::addUpdate( new HTMLCacheUpdate( $title, 'templatelinks' ) ); [10:52:44] Invalidate the caches of all pages which redirect here [10:52:44] DeferredUpdates::addUpdate( new HTMLCacheUpdate( $title, 'redirect' ) ); [10:52:57] i guess it could be smart to first check if there are any such pages, and not queue if there are none? [10:54:02] <_joe_> no, it's smart to defer that to an asynchronous job [10:56:44] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2817405 (10Gilles) >>! In T66214#2816123, @GWicke wrote: > - Supply original format and size in the URL or metadata, and let the client choose between... [10:56:46] i dunno. it feels like a lot of overhead if there is nothing to do [10:57:39] (03CR) 10Giuseppe Lavagetto: jobrunner: fix logrotate rules under systemd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [10:57:41] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2817406 (10hashar) Somehow the instance deployment-apertium01 is back! ``` $ uptime 10:52:58 up 21 days, 15:25, 1 user, load average: 0.00, 0.09, 0.10... [10:57:45] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4660/" [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) (owner: 10Marostegui) [10:57:49] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: fix logrotate rules under systemd [puppet] - 10https://gerrit.wikimedia.org/r/323127 (owner: 10Giuseppe Lavagetto) [10:59:09] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2817422 (10Volans) p:05Triage>03High Setting high because is a broken production host, @Papaul feel free to lower it if is in the list of soon-to-be-decom hosts [11:01:40] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/mediawiki_jobchron],File[/etc/logrotate.d/mediawiki_jobrunner] [11:02:00] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/mediawiki_jobchron],File[/etc/logrotate.d/mediawiki_jobrunner] [11:02:01] _joe_: ^^^ [11:02:06] 07Puppet, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 07Beta-Cluster-reproducible: New instance have broken puppet configuration when using puppetmaster standalone - https://phabricator.wikimedia.org/T148929#2817428 (10hashar) Puppet provisions the Puppet_Internal_CA.crt file... [11:02:21] <_joe_> volans: yeah that's strange [11:02:31] <_joe_> let me see [11:02:36] Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/mediawiki/logrotate.d_mediawiki_jobchron [11:02:40] <_joe_> it's trusty hosts [11:03:14] <_joe_> volans: yeah it's a race ccondition [11:03:24] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2817432 (10Gilles) [11:03:26] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 503s on an image Mediawiki renders successfully - https://phabricator.wikimedia.org/T150761#2817430 (10Gilles) 05Open>03declined This one seems to be completely intermittent and very rare. Re-requesting those thumbnails works just fine. [11:03:29] looks like still looking for the old file [11:04:13] <_joe_> yeah as I said, race condition [11:04:30] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) (owner: 10Marostegui) [11:05:00] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:05:57] (03CR) 10Marostegui: "After discussing this change with a few folks on IRC we decided that it is important to have the correct master definitions on file." [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) (owner: 10Marostegui) [11:06:00] RECOVERY - Check systemd state on mw2080 is OK: OK - running: The system is fully operational [11:06:16] (03PS2) 10Marostegui: site.pp: m3 has the wrong db master entry [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) [11:06:30] RECOVERY - Check systemd state on mw2083 is OK: OK - running: The system is fully operational [11:06:30] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [11:06:30] RECOVERY - Check systemd state on mw2085 is OK: OK - running: The system is fully operational [11:06:40] yay [11:06:41] RECOVERY - Check systemd state on mw2084 is OK: OK - running: The system is fully operational [11:06:41] RECOVERY - Check systemd state on mw2082 is OK: OK - running: The system is fully operational [11:08:40] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:08:49] (03CR) 10Marostegui: [C: 032] site.pp: m3 has the wrong db master entry [puppet] - 10https://gerrit.wikimedia.org/r/323137 (https://phabricator.wikimedia.org/T151384) (owner: 10Marostegui) [11:11:37] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2817443 (10Gilles) [11:11:39] 06Operations, 06Performance-Team, 10Thumbor: Mediawiki 301s on some images that Thumbor renders succesfully - https://phabricator.wikimedia.org/T150755#2817441 (10Gilles) 05Open>03declined This is coming from a very obscure feature that perma-redirects to a shorter thumbnail URL when the one being encoun... [11:14:31] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2817446 (10akosiaris) With 21 days uptime ? I think it's just not deleted. otherwise this does not make sense. [11:15:34] hashar: this is crazy [11:15:50] we 've both deleted the VM.. at least once each [11:15:57] it's just not being deleted for some reaosn [11:15:59] reason* [11:16:58] akosiaris: yeah it somehow did not register in the openstack infra :( [11:17:04] apparently it is gone now [11:17:22] if it comes back, Shinken will probably notice it and alarm [11:17:22] "apparently" being the key word here [11:17:28] ;D [11:17:40] I looked in Horizon at the action log [11:17:48] and there was no delete action registered [11:18:03] fwiw I do clearly remembering checking that it no longer showed up in the horizon list [11:18:09] (03Abandoned) 10Gehel: maps - maps-test* servers are test servers [puppet] - 10https://gerrit.wikimedia.org/r/322927 (https://phabricator.wikimedia.org/T149643) (owner: 10Gehel) [11:18:12] ditto [11:19:25] and https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-apertium01.deployment-prep.eqiad.wmflabs shows our actions [11:19:49] hmmm [11:20:34] starting to smell like an openstack bug [11:20:46] openstack being deliberately very vague here [11:21:18] yeah [11:21:20] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2817448 (10hashar) From https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-apertium01.deployment-prep.eqiad.wmflabs > Sorry, this page was rece... [11:21:46] https://phabricator.wikimedia.org/T147210 has all the details, might be worth filling a new task [11:34:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817469 (10Gilles) The first example: http://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Nuove_poesie_-_Rocco_Galdieri.djvu/page120-159px-Nuove_poesie_-_Rocco_Galdieri.djvu.jpg... [11:38:40] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817485 (10Gilles) That file's actual weight isn't that big, 20MB, but it seems to be fairly high res: 7 746 × 11 667 pixels. It would be interesting to see if in the other instances... [11:46:00] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:46:22] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817490 (10matmarex) So, these are just the bog-standard jobs generated in WikiPage::onArticleEdit(). The job queue is growing simply because we can't handle this rate of... [11:48:10] <_joe_> !log uploaded calico/kube-policy-controller:0.5.0 to the docker registry [11:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:32] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817497 (10Gilles) Double-checking that specific file on an isolated thumbor instance, it would seem that it doesn't leak *that much* memory per run, but maybe it has a significant me... [11:55:34] _joe_: i'll ask you, since you talked earlier. is it possible to make the jobs be processed faster? retask some idle machines for this, or something? or should people just slow down with the edits? https://phabricator.wikimedia.org/T151196#2817490 [11:56:32] i expect that almost all of these jobs are no-ops… [11:57:15] <_joe_> MatmaRex: actually, the real culprit here is redis IIRC [11:57:28] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817501 (10Gilles) I've tried some random JPG and there is a little leaking for that, but not nearly as much (at most 9ish kb). [12:00:20] <_joe_> MatmaRex: lemme look at increasing the workers for htmlCacheUpdate jobs [12:01:34] <_joe_> that seems like a good idea atm [12:01:48] <_joe_> (I won't comment on the fact that this requires a puppet update) [12:02:52] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817506 (10Gilles) Looking at how many we get per day, it seems like OOM is the reasonable general explanation for those: ``` gilles@ms-fe1001:/var/log/swift$ cat server.log.1 | grep... [12:04:15] (03PS1) 10Giuseppe Lavagetto: jobrunner: temporarily double the html workers [puppet] - 10https://gerrit.wikimedia.org/r/323139 (https://phabricator.wikimedia.org/T151196) [12:04:19] <_joe_> MatmaRex: ^^ [12:05:33] alright. thanks [12:06:23] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 502s where Mediawiki doesn't - https://phabricator.wikimedia.org/T150757#2817524 (10Gilles) [12:06:25] 06Operations, 06Performance-Team, 10Thumbor: Thumbor OOMs too much - https://phabricator.wikimedia.org/T150643#2817526 (10Gilles) [12:06:57] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2795453 (10Gilles) [12:08:06] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2817530 (10Gilles) [12:08:09] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2817531 (10Gilles) [12:08:11] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2795453 (10Gilles) [12:08:22] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628088 (10Gilles) [12:08:24] 06Operations, 06Performance-Team, 10Thumbor: Reimplement various rate-limiting mechanisms in Thumbor - https://phabricator.wikimedia.org/T150745#2817533 (10Gilles) [12:08:28] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter - https://phabricator.wikimedia.org/T151067#2817532 (10Gilles) [12:08:31] 06Operations, 06Performance-Team, 10Thumbor: Reimplement various rate-limiting mechanisms in Thumbor - https://phabricator.wikimedia.org/T150745#2795243 (10Gilles) [12:08:34] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628094 (10Gilles) [12:08:36] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support - https://phabricator.wikimedia.org/T151066#2817537 (10Gilles) [12:08:46] 06Operations, 06Performance-Team, 10Thumbor: Reimplement various rate-limiting mechanisms in Thumbor - https://phabricator.wikimedia.org/T150745#2795243 (10Gilles) [12:08:48] 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter - https://phabricator.wikimedia.org/T151065#2817540 (10Gilles) [12:08:50] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628111 (10Gilles) [12:09:01] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2791865 (10Gilles) [12:09:03] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2817543 (10Gilles) [12:09:05] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628594 (10Gilles) [12:09:17] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't render a few SVGs that Mediawiki can - https://phabricator.wikimedia.org/T150754#2817546 (10Gilles) [12:09:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628724 (10Gilles) [12:09:28] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2791865 (10Gilles) [12:09:30] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2632006 (10Gilles) [12:09:39] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2791865 (10Gilles) [12:09:41] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s when the original has a ? in its filename - https://phabricator.wikimedia.org/T150760#2817552 (10Gilles) [12:09:44] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2632184 (10Gilles) [12:10:58] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2817558 (10Gilles) [12:11:00] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#2817557 (10Gilles) [12:11:02] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2817559 (10Gilles) [12:12:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323139 (https://phabricator.wikimedia.org/T151196) (owner: 10Giuseppe Lavagetto) [12:13:17] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: temporarily double the html workers [puppet] - 10https://gerrit.wikimedia.org/r/323139 (https://phabricator.wikimedia.org/T151196) (owner: 10Giuseppe Lavagetto) [12:13:47] 06Operations, 06Performance-Team, 10Thumbor: Improve Content-Disposition support in Thumbor - https://phabricator.wikimedia.org/T151072#2817565 (10Gilles) [12:14:02] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:14:07] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2817566 (10Gilles) [12:14:16] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2817567 (10Gilles) [12:14:25] 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#2817568 (10Gilles) [12:18:35] (03CR) 10Alexandros Kosiaris: "A first test in https://puppet-compiler.wmflabs.org/4662/ is succesfull. A bigger test run to follow" [puppet] - 10https://gerrit.wikimedia.org/r/322898 (owner: 10Alexandros Kosiaris) [12:28:39] 06Operations, 07Availability, 13Patch-For-Review, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2817573 (10matmarex) For the record, @Ankry says he stopped his bot for now (which was responsible for about half of the edits, it looks like at lea... [12:48:59] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2817603 (10Gehel) p:05Triage>03High [12:49:06] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#2817604 (10Gehel) p:05Triage>03High [12:49:13] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2817605 (10Gehel) p:05Triage>03High [12:49:21] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#2817606 (10Gehel) p:05Triage>03High [12:57:46] (03PS1) 10Marostegui: sanitarium2.my.cnf: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/323145 [12:59:02] (03PS1) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [12:59:54] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) (owner: 10Paladox) [13:00:49] (03PS2) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [13:00:51] (03PS1) 10Alexandros Kosiaris: grafana: Update the serverboard dashboard [puppet] - 10https://gerrit.wikimedia.org/r/323147 [13:01:26] (03PS2) 10Alexandros Kosiaris: grafana: Update the serverboard dashboard [puppet] - 10https://gerrit.wikimedia.org/r/323147 [13:02:07] (03CR) 10Marostegui: "We do have one m3 slave in codfw - db2012." [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [13:02:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] grafana: Update the serverboard dashboard [puppet] - 10https://gerrit.wikimedia.org/r/323147 (owner: 10Alexandros Kosiaris) [13:04:00] (03PS3) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [13:04:29] (03PS4) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [13:04:34] (03PS5) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [13:08:07] 06Operations, 07Availability, 13Patch-For-Review, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2810146 (10Joe) @matmarex since I increased the htmlCacheUpdate throughput by 100% the queue stopped increasing, and the number of error/timeouts ha... [13:13:22] (03PS6) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [13:16:57] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can error on some characters in the filename part of the request - https://phabricator.wikimedia.org/T151453#2817667 (10Gilles) [13:17:06] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2632316 (10Gilles) [13:21:15] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors when %0A is in the filename part of the request - https://phabricator.wikimedia.org/T151453#2817682 (10Gilles) [13:22:12] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on images Mediawiki 404s on - https://phabricator.wikimedia.org/T150753#2817683 (10Gilles) [13:22:56] RECOVERY - MariaDB Slave Lag: s1 on db1052 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [13:23:24] and page for the recovery >_< [13:32:43] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on images Mediawiki 404s on - https://phabricator.wikimedia.org/T150753#2817692 (10Gilles) While all the examples left here (now that I've removed the cases of {T151453}) 404 quickly without an issue now, I've found a recent example in the logs that m... [13:34:27] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on images Mediawiki 404s on - https://phabricator.wikimedia.org/T150753#2817695 (10Gilles) The TIF one is also one of those cases where Mediawiki treats it as a 404, but the file actually exists: https://commons.wikimedia.org/wiki/File:J%C3%A9r%C3%B4m... [13:37:58] (03CR) 10Alex Monk: "there were no sudo privileges requested, it's just login. so striker-users, I think" [puppet] - 10https://gerrit.wikimedia.org/r/323121 (https://phabricator.wikimedia.org/T151424) (owner: 10Dzahn) [13:39:06] (03PS1) 10Giuseppe Lavagetto: jobrunner: increase (again) the workers for htmlCacheUpdate jobs [puppet] - 10https://gerrit.wikimedia.org/r/323151 (https://phabricator.wikimedia.org/T151196) [13:42:24] I receive the page now [13:42:44] jynus: only now? strange I got it right away [13:45:44] <_joe_> volans, jynus I'm going to merge that change btw [13:46:24] (03CR) 10Volans: [C: 04-1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323151 (https://phabricator.wikimedia.org/T151196) (owner: 10Giuseppe Lavagetto) [13:46:28] <_joe_> if this increase doesn't make the queue go down [13:46:31] <_joe_> uh? [13:46:36] <_joe_> -1 LGTM is new [13:46:38] <_joe_> :P [13:46:40] what? [13:46:41] haha [13:46:51] <_joe_> new paradigms in code review [13:46:53] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/323151 (https://phabricator.wikimedia.org/T151196) (owner: 10Giuseppe Lavagetto) [13:47:00] rotfl [13:47:03] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: increase (again) the workers for htmlCacheUpdate jobs [puppet] - 10https://gerrit.wikimedia.org/r/323151 (https://phabricator.wikimedia.org/T151196) (owner: 10Giuseppe Lavagetto) [13:47:05] sorry about that [13:47:11] -1 DLGTM [13:51:07] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on images Mediawiki 404s on - https://phabricator.wikimedia.org/T150753#2817710 (10Gilles) OK, that TIF seems to hang for no particular reason. I think this makes it a duplicate of T150746 because Mediawiki has no problem rendering that file fast when... [13:51:29] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2795266 (10Gilles) [13:51:31] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on images Mediawiki 404s on - https://phabricator.wikimedia.org/T150753#2817715 (10Gilles) [13:51:36] (03CR) 10Jcrespo: [C: 031] sanitarium2.my.cnf: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/323145 (owner: 10Marostegui) [13:51:55] (03PS1) 10Gehel: elasticsearch - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) [13:52:50] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2795266 (10Gilles) At this point I've seen various reasons why Thumbor might hang, I think this scenario probably contains distinct individual bugs that need t... [13:54:10] 06Operations, 06Performance-Team, 10Thumbor: Thumbor chokes on a specific TIF file - https://phabricator.wikimedia.org/T151454#2817720 (10Gilles) [13:56:16] _joe_: we're increasing workers for htmlCacheUpdate? I thought that was one of our main purge limiters. [13:56:51] honestly I don't know which part of it limits what. I know something to that effect is what holds the average rate down though. [13:58:16] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on a specific GIF file - https://phabricator.wikimedia.org/T151455#2817738 (10Gilles) [13:58:29] (03PS2) 10Ema: cache_upload: stop graphiq.com buggy javascript [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) [13:58:43] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors when %0A is in the filename part of the request - https://phabricator.wikimedia.org/T151453#2817752 (10Gilles) a:05fgiunchedi>03Gilles [13:58:54] 06Operations, 06Performance-Team, 10Thumbor: Thumbor chokes on a specific TIF file - https://phabricator.wikimedia.org/T151454#2817753 (10Gilles) a:05fgiunchedi>03Gilles [13:59:10] (03CR) 10Jonas Kress (WMDE): "Needs manual rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [14:08:18] (03PS5) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) [14:10:36] <_joe_> bblack: I'm looking at purge rates too [14:12:08] (03PS2) 10Gehel: elasticsearch - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) [14:16:32] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:09] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817793 (10Gilles) [14:19:01] (03CR) 10Jcrespo: "I am still waiting to explain you the problematic with phabricator with master and slaves https://phabricator.wikimedia.org/T112776#256440" [puppet] - 10https://gerrit.wikimedia.org/r/323082 (owner: 1020after4) [14:19:25] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817720 (10Gilles) Another example: https://commons.wikimedia.org/wiki/File:Tessie_Reynolds_02.tif [14:23:15] (03PS1) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) [14:25:20] (03PS1) 10Gehel: elasticsearch - reimage elasticsearch cirrus / codfw servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323157 (https://phabricator.wikimedia.org/T151326) [14:26:47] 06Operations, 06Performance-Team, 10Thumbor: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#2817810 (10Gilles) [14:26:52] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:26:54] 06Operations, 06Performance-Team, 10Thumbor: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#2817810 (10Gilles) p:05Normal>03Low [14:27:01] 06Operations, 06Performance-Team, 10Thumbor: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#2817810 (10Gilles) [14:27:05] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2632421 (10Gilles) [14:28:33] (03PS1) 10Gehel: elasticsearch - reimage elasticsearch cirrus / eqiad servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323158 (https://phabricator.wikimedia.org/T151326) [14:28:40] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2817829 (10Gilles) [14:29:03] (03CR) 10Jonas Kress (WMDE): [C: 031] Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [14:29:50] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2817738 (10Gilles) Other example: https://commons.wikimedia.org/wiki/File:LOVE_A+T_LOVEigf.gif This one's original starts off animated and blanks itself. [14:30:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2817831 (10Gilles) This error is probably related and showed up when stopping the Thumbor instance: ``` 2016-11-23 14:29:53,444 8841 tornado.application:ERROR Future exception was never... [14:31:51] !log Stopping replication db1095 (not pooled) - maintenance - T150960 [14:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:03] T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960 [14:32:52] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:33:26] !log rebooting, upgrading db1092 while it is depooled for maintenance [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:57] 06Operations, 06Performance-Team, 10Thumbor: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2817842 (10Gilles) This one: https://uk.wikipedia.org/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Jokie.gif Fails for what seems to be a different reason: ``` 2016-11-23 14:32:39,660 8841 thumbor:DEB... [14:37:54] !log elastic@eqiad: reindexing ruwiki from terbium, logs in ~dcausse/bm25_reindex/cirrus_log (T148344) [14:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:05] T148344: Search works incorrectly when the query contains words used as namespace names and a colon (:) - https://phabricator.wikimedia.org/T148344 [14:39:07] (03PS1) 10Eevans: enable instance restbase2012-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/323159 (https://phabricator.wikimedia.org/T151086) [14:44:32] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:44:56] 06Operations, 06Performance-Team, 10Thumbor: Nginx time limit should be a bit higher than Thumbor subprocess time limit - https://phabricator.wikimedia.org/T151459#2817875 (10Gilles) [14:45:20] (03CR) 10Eevans: "Reminder: Being the first instance of restbase2012, this one comes "Some Intervention Required"; Someone with root will need to log in aft" [puppet] - 10https://gerrit.wikimedia.org/r/323159 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [14:47:17] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#2814443 (10Gehel) a:03Gehel [14:47:25] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325#2814372 (10Gehel) a:03Gehel [14:47:30] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2814411 (10Gehel) a:03Gehel [14:53:01] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2817901 (10Marostegui) This has been running fine for 6 days already. Once the deploys are un blocked, I will pool it back. [14:56:20] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): move data to /srv for the cirrus / elasticsearch clusters - https://phabricator.wikimedia.org/T151328#2817910 (10Gehel) The oldest elasticsearch servers (`elastic1017-1031`) have smaller SSD, configured as RAID... [14:58:23] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2817914 (10Gilles) [14:58:25] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on a few images Mediawiki 500s on - https://phabricator.wikimedia.org/T150756#2817912 (10Gilles) 05Open>03Resolved I believe I've found the few bugs where Thumbor was hanging for illegitimate reasons. Everything else here is a case of Thumbor taki... [14:58:52] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:35] (03PS2) 10Gehel: elasticsearch - codfw servers move to jessie and data on /srv [puppet] - 10https://gerrit.wikimedia.org/r/323157 (https://phabricator.wikimedia.org/T151326) [15:05:53] (03PS2) 10Gehel: elasticsearch - eqiad servers move to jessie and data on /srv [puppet] - 10https://gerrit.wikimedia.org/r/323158 (https://phabricator.wikimedia.org/T151326) [15:10:23] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817960 (10Gilles) Likely cause for the first example: ``` 2016-11-23 15:09:40,841 8841 tornado.application:ERROR Future exception was never retrieved: Traceback (most recent call last)... [15:12:50] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817963 (10Gilles) Same error coming from the second example. [15:20:42] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:39] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817972 (10Gilles) Yet another instance: http://upload.wikimedia.org/wikipedia/commons/thumb/8/82/Hafnia_alvei.tif/lossy-page1-401px-Hafnia_alvei.tif.jpg [15:26:52] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:48:42] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:49:12] (03CR) 10BBlack: cache_upload: stop graphiq.com buggy javascript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [15:50:48] !log elastic@eqiad: ruwiki reindex done (T148344) [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:59] T148344: Search works incorrectly when the query contains words used as namespace names and a colon (:) - https://phabricator.wikimedia.org/T148344 [15:52:42] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:21] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2818062 (10Papaul) [16:01:37] !log Stopping replication db2070 for maintenance - T149553 [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:50] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [16:02:11] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2814695 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi you can take over. Thanks [16:03:20] The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [16:03:21] The system administrator who locked it offered this explanation: The database has been automatically locked while the replica database servers catch up to the master. [16:03:23] great [16:05:17] where did you get that? [16:09:03] I do not see anything in logstash (fyi) [16:09:06] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2818085 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete. [16:09:24] marostegui, I see several servers with lag on s2 and s4 [16:09:32] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag [16:10:21] I think the extra connections from the job queue can craete extra pressure [16:10:41] and then there is db1053, which seems to have problems [16:11:02] RECOVERY - MegaRAID on ms-be2011 is OK: OK: optimal, 13 logical, 13 physical [16:11:20] but they are not big spikes [16:11:29] they are [16:11:37] see it is a log scale [16:11:45] db1053 could be just a disk issue [16:11:52] the others, a query pattern [16:14:15] db1053 has one disk with high predictive failures events but no media errors [16:14:27] (03CR) 10BryanDavis: [C: 031] "Looks like a sensible plan to me. Logstash warns on startup of JDK7 support being deprecated so it seems likely that this will be an uneve" [puppet] - 10https://gerrit.wikimedia.org/r/323154 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [16:15:38] I would either offline a disk there or depool it [16:15:56] mafk, hello? [16:16:05] ehm, yes sorry, wrong channel [16:16:09] too many tabs opened [16:16:14] I need an account disabled [16:16:27] it can wait [16:17:23] jynus: Probably better to depool it I would say, as it might not be the disk (the counter isn't growing) [16:17:29] jynus: you want me to depool it now? [16:18:14] https://tendril.wikimedia.org/host/view/db1053.eqiad.wmnet/3306 [16:18:28] look at replication history [16:18:39] it is most likely the disk [16:18:43] interesting [16:19:41] ok, let me locate the exact disk [16:19:54] it is the forth [16:20:16] yes [16:20:20] fifth, slot 4 [16:20:33] we lose nothing offline it [16:21:20] it is even documented: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Caused_by_hardware [16:21:42] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:22:00] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2818132 (10Gilles) Looking at all these examples and recent occurrences, the issue is indeed intermittent, which would support the load theory that I need to l... [16:22:17] so: megacli -PDOffline -PhysDrv [32:4] -aALL agreed? [16:22:19] megacli -PDOffline -PhysDrv '[32:4]' -aALL [16:22:28] the quotes are important [16:22:42] or have to be escaped \[ [16:22:48] 06Operations, 06Performance-Team, 10Thumbor: Investigate what happens when cheap requests immediately follow a very expensive request on a Thumbor instance - https://phabricator.wikimedia.org/T150746#2818133 (10Gilles) [16:23:19] Never had to use them before, but sure. I will go ahead and set offline [16:23:21] (03CR) 10BryanDavis: "striker-users seems fine. If and when I find a need for elevated rights we can migrate to an -admin or -root group." [puppet] - 10https://gerrit.wikimedia.org/r/323121 (https://phabricator.wikimedia.org/T151424) (owner: 10Dzahn) [16:23:29] jynus: ok? [16:23:34] yes, log it [16:24:08] if I am wrong, I can take blame [16:24:20] !log Setting offline disk [32:4] on db1053 - looks like it is causing repl issues [16:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:41] done [16:25:04] an alert should go off soon [16:25:20] and task created on phab :) [16:25:26] XD [16:25:29] if you don't silence it [16:25:44] root@db1053:~# megacli -LDPDInfo -aAll | grep Degraded [16:25:44] State : Degraded [16:25:48] should be coming soon indeed :) [16:27:21] it still has spikes of lag, maybe [16:28:04] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2818140 (10GWicke) > you could pass it through an anchor that wouldn't make it back server side. I.e. restbase or the API would give this to the client... [16:28:12] we will see [16:30:08] (03PS3) 10Ema: cache_upload: stop graphiq.com buggy javascript [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) [16:31:31] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2818146 (10mark) >>! In T149911#2772370, @RobH wrote: > I'd like to allocate spare pool system WMF4726 for this request. It has the following specs: > > * Dual Intel® Xeon® P... [16:33:05] (03CR) 10Ema: cache_upload: stop graphiq.com buggy javascript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [16:35:39] 06Operations, 06Performance-Team, 10Thumbor: Investigate what happens when cheap requests immediately follow a very expensive request on a Thumbor instance - https://phabricator.wikimedia.org/T150746#2818160 (10Gilles) The answer of the investigation is that things work as expected... I'm still at a loss to... [16:36:10] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2818161 (10Gilles) [16:36:32] (03CR) 10BryanDavis: [C: 04-1] "Added Krinkle as a reviewer. I think he brought the execption-json channel to life originally. We can keep it on fluorine if there are oth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [16:40:52] PROBLEM - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:40:53] ACKNOWLEDGEMENT - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T151465 [16:40:55] 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818166 (10ops-monitoring-bot) [16:41:15] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2818170 (10Gilles) They happen a fair bit: ``` gilles@ms-fe1001:/var/log/swift$ cat server.log.1 | grep "Mediawiki: 200 Thumbor: 504" | wc -l 995... [16:41:36] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818173 (10Marostegui) [16:42:39] marostegui: you take care of this task? ^^^ [16:43:04] volans: yep! :) [16:43:15] thanks :) [16:43:29] no, thank you for creating the check and generating the task automatically!! :) [16:44:43] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818166 (10Marostegui) This is correct, disk `32:4` has been set as offline and needs to be replaced: ``` RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 3.271 TB... [16:44:46] I am not sure that fixed it [16:45:43] the graphs shows 1s spikes from time to time [16:48:06] 06Operations, 07Availability, 13Patch-For-Review, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2818186 (10Joe) So, even if I raised the number of jobs and the number of submitted htmlCacheUpdate jobs submitted for commonswiki increased: ```$ f... [16:48:21] 06Operations, 07Availability, 13Patch-For-Review, 07Performance, 15User-Joe: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2818190 (10Joe) [16:48:50] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2818194 (10Gilles) There definitely seems to be a time correlation between 502s and 504s, this sort of thing happens a lot: ``` Nov 23 06:05:35 m... [16:49:01] (03PS1) 10Chad: Gerrit: Raise global limit for objects back to 100m [puppet] - 10https://gerrit.wikimedia.org/r/323179 [16:53:56] (03CR) 10Paladox: [C: 031] Gerrit: Raise global limit for objects back to 100m [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [16:54:22] (03CR) 10Mobrovac: [C: 031] Gerrit: Raise global limit for objects back to 100m [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [17:03:45] 06Operations: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822#2818224 (10fgiunchedi) [17:03:59] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2818226 (10fgiunchedi) [17:07:32] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:55] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2818227 (10JKatzWMF) [17:09:19] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2797837 (10Krenair) Be wary of T150058 and Elastic's use of this CA [17:09:26] 06Operations, 10DBA, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2277856 (10fgiunchedi) re: certificate handling that @jcrespo mentioned, see also {T150822} for the related... [17:10:32] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:20:11] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#2818294 (10ovasileva) p:05Triage>03Normal [17:27:42] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:30:20] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2818302 (10JKatzWMF) @cscott an email bump caused me to just see your comment now- I apologize for the dela... [17:33:24] jouncebot: next [17:33:24] In 116 hour(s) and 26 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161128T1400) [17:33:39] 116 hours? That's not a very useful time interval jouncebot [17:34:02] Just shy of 7 days, ok thanks for nothing jouncebot [17:34:24] (03CR) 10Krinkle: [C: 04-1] "Introduced by Ori actually. I merely moved the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [17:34:26] xD [17:34:42] no SWAT in a week? [17:34:56] mafk: Silly american holidays [17:35:18] Reedy: which ones are now? [17:35:40] Pretty much all of them [17:35:55] lol [17:35:59] mafk: Thanksgiving this week [17:36:04] GIVE ME YOUR THANKS [17:36:21] I have no thanks to give [17:36:24] !log Stopping MySQL on db2070 for maintenance - https://phabricator.wikimedia.org/T149553 [17:36:29] * mafk gives a slice of thanksgiving turkey to Reedy [17:36:37] ostriches: Any fucks? [17:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:48] Reedy: 0 fucks given. [17:36:52] Damn. [17:37:16] (03Abandoned) 10Alex Monk: Remove apache-level HTTPS redirects [puppet] - 10https://gerrit.wikimedia.org/r/322423 (owner: 10Alex Monk) [17:37:37] It's a good day for technical debt. [17:37:41] I mean bad. [17:37:44] bad day for it. [17:37:47] Let's kill it. [17:38:15] I'd like to see a real-time graph with number of turkey's killed across the globe. [17:38:23] And monitor it in Incinga [17:38:29] Krinkle: TO KICKSTARTER [17:38:33] * mafk wonders why EU based SWATers can't take care of SWAT those days. [17:38:34] Sounds like a great IOT project [17:38:37] per minute [17:39:20] I wonder of diamond works on Turkey? [17:40:08] Gotta depool them first from etcd [17:40:09] Hm. there's no JavasC [17:40:17] Hehe [17:40:18] https://www.npmjs.com/package/turkey [17:40:24] And it's... mediawiki related? [17:40:32] .... [17:40:42] loool [17:42:10] The formatting of the article in tky is terrible. Who would want that? [17:43:41] because real rockstars work in terminal [17:45:28] I have no objections to working in a terminal. I have objections to horrifically formatted and nigh unreadable text :p [17:48:19] <_joe_> mafk: that's not so simple, sadly. When you do changes, you risk breaking things, and breaking things means having to phone people during thanksgiving... [17:48:50] _joe_: if people that can fix things are all affected by thanksgiving then I agree [17:49:02] <_joe_> mafk: and not every issue presents itself during the EU day [17:49:09] * ostriches is boycotting thanksgiving this year :p [17:49:13] <_joe_> sometimes we do a deploy at 2 PM UTC [17:49:25] <_joe_> and we find out that something is broken at 2 AM UTC [17:49:42] _joe_: And you had 12 hours to sleep in between! Slacker!! ;-) [17:49:47] I can understand that [17:50:05] <_joe_> so it's just a way to avoid breaking things for a few days a year [17:50:30] I don't have anything urgent pending for SWAT so I'm fine. [17:55:13] (03PS2) 10Alex Monk: Consolidate all of the simple wikimedia.org VHosts into two [puppet] - 10https://gerrit.wikimedia.org/r/322425 [17:55:45] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:58:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2818395 (10Volans) @bd808 I guess that your existing access predates the policy of signing L3 (Acknowledgement of Wikimedia Server Access Responsibilities). Could you ple... [17:58:31] !log demon@tin Synchronized php-1.29.0-wmf.3/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: (no message) (duration: 00m 53s) [17:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:58] Can we remove mw2092.codfw.wmnet from the scap target list too? It's already depooled from lvs but that read-only filesystem makes scap barf every time. [18:01:15] ostriches: sure, let check it was already depooled [18:03:20] _joe_: depool is not enough for removal from scap targets? [18:03:41] <_joe_> volans: pooled=inactive [18:04:04] (03CR) 10Krinkle: Consolidate all of the simple wikimedia.org VHosts into two (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [18:04:09] right, I always mix them up, I didn't touched because was already depooled, let me fix it [18:04:27] !log volans@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2092.codfw.wmnet [18:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:45] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:02] volans: ty! [18:05:05] 06Operations, 06Performance-Team, 10Thumbor: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2817720 (10ori) OK, I was able to reproduce this with the following minimal case: ```lang=python from wand import image from wand.api import library img = image.Image() with open('Hafn... [18:05:47] ostriches: I've also run puppet on tin, so should be fixed now ;) sorry about that [18:05:56] No worries. Thanks so much :) [18:06:42] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2818443 (10Volans) Set `set/pooled=inactive` to remove it from scap targets too [18:06:59] (03PS3) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [18:07:46] (03PS4) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [18:22:26] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2818471 (10bd808) >>! In T151424#2818395, @Volans wrote: > @bd808 I guess that your existing access predates the policy of signing L3 (Acknowledgement of Wikimedia Server... [18:22:29] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2818472 (10kaldari) +1 [18:24:15] 06Operations, 07Availability, 13Patch-For-Review, 07Performance, 15User-Joe: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2818475 (10Legoktm) > @aaron do you think we can do something to speed up the processing of a specific wiki's queue? I cannot find ind... [18:26:49] (03PS3) 10Alex Monk: Consolidate all of the simple wikimedia.org VHosts into two [puppet] - 10https://gerrit.wikimedia.org/r/322425 [18:33:43] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:41:09] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2818492 (10Gilles) a:03Gilles [18:41:40] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2703298 (10Gilles) [18:43:05] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699497 (10akosiaris) According to SoS, 5.3.0 iOS app has been shipped last week, so we should start seeing traffic for 0px requests dropping [18:45:07] (03PS1) 10Mobrovac: Trending Edits: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) [18:56:33] !log Shutting down db2034 for maintenance - T149553 [18:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:44] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [18:57:16] (03PS1) 10Mobrovac: PDF Render: Create the service's admin group [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) [19:00:03] (03CR) 10Ppchelko: Trending Edits: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [19:00:08] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2818586 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The servers are working with... [19:01:26] (03PS2) 10Mobrovac: Trending Edits: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) [19:01:54] (03CR) 10Mobrovac: Trending Edits: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [19:04:04] (03CR) 10Ppchelko: [C: 031] Trending Edits: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/323194 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [19:14:45] (03CR) 10Gergő Tisza: "I don't know what the history is behind it, but it's the exact opposite: the 'exception' channel is the one that is only sent to fluorine," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [19:23:08] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2818653 (10AndyRussG) >>! In T149873#2767345, @aaron wrote: > Another idea is to add a cache-busting pa... [19:24:51] !log swift eqiad-prod: ms-be1027 to weight 2000 T136631 [19:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:02] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [19:25:32] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Shell access to californium for bd808 - https://phabricator.wikimedia.org/T151424#2818661 (10Volans) The 3 days waiting period will end Sat, Nov 26, 00:55 UTC [19:25:38] bblack: ema: Hi! Got a sec to chat about T151418 and T151419? There's a patch merged but undeployed for the former, hopeing for some feedback about safety thereof :) Thx in advance!! [19:25:38] T151419: Spike: CentralNotice: Is a Varnish banner/campaign quick flush switch feasible? - https://phabricator.wikimedia.org/T151419 [19:25:39] T151418: CentralNotice: If possible, reduce Varnish cache time for SpecialBannerLoader errors - https://phabricator.wikimedia.org/T151418 [19:29:59] (03CR) 10MaxSem: [C: 031] Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [19:34:32] (03CR) 10Filippo Giunchedi: [C: 032] enable instance restbase2012-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/323159 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [19:35:49] urandom: ^ I'll do the needful [19:45:33] urandom: seeing some WARN org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=b75f1470-f902-11e4-9236-9fbfa298c4b0 [19:45:51] urandom: in the system-a log, not remember seeing that before though [19:50:50] 06Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2818683 (10fgiunchedi) @bblack we'll need to translate the "dbnames" in `$private_wikis` to actual names used in urls, I don't think there can be a correspondence in the path alone, the hostname... [19:51:08] godog: yeah, that's 'normal' [19:51:39] ah ok, I've restarted cassandra [19:51:45] cassandra-a that is [19:51:55] did that work? [19:52:07] you probably have to delete the contents of /srv/cassandra-a/* [19:52:29] yeah did that and restarted, looks like it is bootstrapping again [19:52:41] godog: if it starts before scap has deployed the twcs jar, then it can't suss out the schema [19:52:58] this is a part of the first-instance chicken-egg problem [19:53:38] godog: i think it's still stuck; let me bump it [19:53:55] ok thanks! [19:54:16] puppet ran successfully tho, mh [19:54:39] nope, still won't go... can't find the jar [19:55:03] Caused by: java.lang.ClassNotFoundException: com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy [19:57:29] $ file /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar [19:57:32] /srv/deployment/cassandra/twcs/lib/cassandra-v2.2/TimeWindowCompactionStrategy-2.2.5.jar: ASCII text [19:57:34] ah, indeed looks like a case where the jar hasn't been hydrated by git-fat [19:59:27] grr [20:00:18] (03PS2) 10Smalyshev: Limit concurrent connections by client IP [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) [20:00:39] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:00:52] (03PS1) 10Yuvipanda: tools: Fix typo in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/323199 [20:01:09] PROBLEM - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused [20:01:18] urandom: "fixed" it manually with git-fat checkout [20:01:19] PROBLEM - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused [20:01:39] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:01:42] godog: auh, cool [20:01:49] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:02:15] ACKNOWLEDGEMENT - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Filippo Giunchedi bootstrapping [20:02:15] ACKNOWLEDGEMENT - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused Filippo Giunchedi bootstrapping [20:02:15] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused Filippo Giunchedi bootstrapping [20:02:15] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Filippo Giunchedi bootstrapping [20:02:15] ACKNOWLEDGEMENT - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Filippo Giunchedi bootstrapping [20:02:16] ACKNOWLEDGEMENT - restbase endpoints health on restbase2012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.67, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [20:02:39] RECOVERY - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-a valid until 2017-11-17 00:54:31 +0000 (expires in 358 days) [20:02:39] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [20:02:49] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active [20:03:30] (03CR) 10Yuvipanda: [C: 032] tools: Fix typo in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/323199 (owner: 10Yuvipanda) [20:05:49] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:05:55] (03CR) 10Nuria: "The stats group currently doesn't hold any "umbrella" users, rather login shell users. Do we want to add this one? Not sure, @ottomata to" [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [20:06:49] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active [20:07:03] godog: now we have: java.lang.RuntimeException: Unable to gossip with any seeds [20:07:17] which means "i can't talk to anyone!" [20:08:39] odd, checking the firewall elsewhere [20:08:59] i can ping the few i've tried [20:09:39] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:11:39] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:11:50] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:12:39] RECOVERY - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-a valid until 2017-11-17 00:54:31 +0000 (expires in 358 days) [20:12:39] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [20:12:49] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active [20:13:19] godog: i locally hacked the list of seeds down to one host (1007-a), and it still failed :/ [20:16:39] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:16:39] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:49] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:18:01] urandom: ack, my current suspect is still the firewall, checking [20:20:21] i don't understand the code where this is excepting [20:20:40] i don't understand how it could ever work [20:20:49] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:21:49] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active [20:22:38] (03PS1) 10Odder: Add localized logo for Gujarati Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323236 (https://phabricator.wikimedia.org/T121853) [20:23:44] urandom: I've started cassandra-a back again but it failed, anyways the firewall looks good to me, I've masked the instance too [20:24:01] (03CR) 10Odder: "I ran optipng -o7 on all files before commiting this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323236 (https://phabricator.wikimedia.org/T121853) (owner: 10Odder) [20:24:49] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:27:59] 06Operations, 06Labs, 13Patch-For-Review: Setting up grafana should also setup Anonymous read-only access for the default org - https://phabricator.wikimedia.org/T143556#2818812 (10fgiunchedi) 05Open>03stalled [20:28:39] RECOVERY - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-a valid until 2017-11-17 00:54:31 +0000 (expires in 358 days) [20:28:39] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [20:28:49] RECOVERY - cassandra-a service on restbase2012 is OK: OK - cassandra-a is active [20:28:59] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:09] 06Operations, 06Labs, 13Patch-For-Review: Setting up grafana should also setup Anonymous read-only access for the default org - https://phabricator.wikimedia.org/T143556#2571642 (10fgiunchedi) [20:29:22] godog: seems to be moving now [20:29:24] AndyRussG: re T151418 and https://gerrit.wikimedia.org/r/#/c/322976 - what's the response code on the errors you're dealing with here? I would've assumed 5xx, which we don't cache. [20:29:25] T151418: CentralNotice: If possible, reduce Varnish cache time for SpecialBannerLoader errors - https://phabricator.wikimedia.org/T151418 [20:30:04] godog: that's basically a timeout (of 30s) performing a shadowround of gossip (for it to get all of the contact points) [20:30:15] and it took 30s each time to throw [20:30:19] bblack: no, it's 200. They're caught exceptions that we'd like to know about, but would like to have things transparently continue to work for the user :) [20:30:33] * urandom sigh [20:31:18] godog: so i was going to say that I live hacked that timeout to 60s and it was progressing, but that oddly enough it did not take 30s to do so (i.e. not, apparently, because of the increased timeout) [20:31:33] bblack: the main concern is a long-running CN bug related to MessageCache. The effect is that banners sometimes can't be retrieved by CN for 1-2 minutes following the banner's creation [20:31:47] but now it died because it thinks restbase2009-c.codfw.wmnet is down [20:32:06] godog: the old -Dcassandra.consistent.rangemovement=false warning [20:32:25] AndyRussG: when the exception is caught, there's no decent banner output for the user anyways right? [20:32:41] (There is a patch to potentially fix the probably root cause, we're discussing it later today, but we may not want to deploy it during the first week of the big fundraiser, i.e., next week. This patch is to make the caching times for those bad responses less.) [20:32:46] is it a 200 with no content? or? I would still think a 5xx is appropriate if it's an error [20:33:06] bblack: it's a 200 with content, that calls an error handler callback in the CN client code [20:33:25] Discussing this yesterday, I think we don't want to just not cache at all, because this is potentially zillions of requests per second [20:33:34] ok [20:33:49] Imagine starting Tuesday, banner loading on every pageview in most English-speaking countries.... [20:33:53] what's the intended (or current?) normal TTL and reduce TTL? [20:34:12] Normal TTL for anons is 10 minutes, and reduced TTL is 2 minutes. Modifiable via config settings [20:34:15] urandom: *sigh* [20:34:47] Eventually I think we might well want to get logged-ins all cached up as well, but that's another issue [20:34:54] (FR banners only go out to anons) [20:35:20] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2818833 (10fgiunchedi) Sort-of related, see also {T151444} for hot-linked urls that result in 404s with significant rate-per-second [20:36:09] So yeah, because config settings, I think this is pretty safe, i.e., if somehow the 2 minutes is too short in the case of a persistent error and we're overloading stuff, we can send up a config change to make it longer, and I think the effect is immediate, no? [20:36:25] relatively-immediate, yes [20:36:31] godog: feels like this has to be somehow net related [20:36:32] yeah [20:37:12] godog: though i cannot see how/why [20:37:29] I can give you a conceptual "+1 this seems like a fair bandaid for now until we fix everything related in a better way", but I have no business reviewing the actual PHP code :) [20:37:58] seems like it was already reviewed/merged though, just not deployed? [20:38:25] bblack: that sounds great! Yeah we'd just like to get a sense of how safe the intended functionality is from an cluster resources perspective [20:38:33] yeah it was reviewed and merged but not deployed [20:39:13] If it looks good it'd be nice to get it out by Tuesday [20:39:18] Or Monday, rather [20:39:40] (03CR) 10BBlack: [C: 031] "I'd say let's keep this stage and not merge yet. Maybe try to make some tech contact and/or wait for the problem to get worse before pull" [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [20:40:24] AndyRussG: sounds reasonable [20:40:41] bblack: K fantasmic, thx much :) [20:41:31] 06Operations, 06Performance-Team, 10Traffic, 07Regression: Investigate major HTTP 500 spike since 2016-09-23 - https://phabricator.wikimedia.org/T151078#2818855 (10Krinkle) 05Open>03Resolved a:03Krinkle Looks like that was it. It's coming back down now: {F4828618} Might take a while to return fully... [20:41:32] bblack: the other one, T151419, is basically to find out if there is such a think as a quick-kill-switch in case we have a loosing banner or some other mishap during our high-donation-volume times [20:41:33] T151419: Spike: CentralNotice: Is a Varnish banner/campaign quick flush switch feasible? - https://phabricator.wikimedia.org/T151419 [20:41:57] (it can get up to a pretty high dollar value per minute at times) [20:45:38] bblack: I have to be afk for about 45 minutes, but I'll be back on later, and will see any backscroll... thx in advance for any comments on that one, also pls lmk if u have questions!! :) [20:45:41] (03CR) 10MaxSem: "Well, that's what he proposed in https://phabricator.wikimedia.org/T151063#2811510" [puppet] - 10https://gerrit.wikimedia.org/r/323133 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [20:45:46] thx again! [20:46:19] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:48:30] !log demon@tin Started scap: pruning old deployment branches [20:48:40] urandom: odd indeed, so essentially communications are slower than expected (?) [20:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:49] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:51:27] godog: i dunno [20:52:49] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2818919 (10fgiunchedi) 05Open>03Resolved Rebuilding [20:57:59] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:59:05] AndyRussG|bassoo: Special:Banner is currently exempted from the varnish code that normally re-sets the user-facing Cache-Control to be 0s/no-cache for wiki pages. That exemption is old, I'm pretty sure it pre-dates me. But in any case that means your 10 minute TTLs are not just for varnish, they're also in browser caches. [20:59:51] AndyRussG|bassoo: unless we change that, anything we do in varnish to invalidate a banner is only going to affect fresh UAs, not ones that have already cached it and will keep it another few minutes. [21:02:08] AndyRussG|bassoo: that issue aside, we should be able to rig up something using mwScript to purge the banner-related URLs on demand when warranted, for now. [21:07:44] !log demon@tin Finished scap: pruning old deployment branches (duration: 19m 14s) [21:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:26] 06Operations, 06Performance-Team, 05codfw-rollout: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2818971 (10Krinkle) [21:10:28] 06Operations, 06Performance-Team, 05codfw-rollout: Ensure maintainers of long-running scripts on terbium expect downtime for switchover - https://phabricator.wikimedia.org/T129258#2818968 (10Krinkle) 05Open>03Resolved a:03Krinkle (Old) [21:11:48] 06Operations, 10MediaWiki-General-or-Unknown, 10Monitoring, 06Performance-Team: edit.success in graphite never reached zero during codfw switchover - https://phabricator.wikimedia.org/T133177#2818974 (10Krinkle) 05Open>03declined Seems to be fine. We may wanna look at this next time we do a switch-over... [21:13:34] wow, does https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%D9%85%D8%B4%D8%A7%D8%B1%DA%A9%D8%AA%E2%80%8C%D9%87%D8%A7/Dexbot&uselang=en&namespace=8 take an excessively long time to load [21:14:50] guess there's no index on rev_namespace [21:14:59] * bawolff thought there was for some reason [21:17:39] 06Operations, 07Tracking: Silver anomalies - https://phabricator.wikimedia.org/T151486#2818993 (10Volans) [21:17:51] rev_namespace exists? [21:18:49] I see 0 entries there, btw [21:18:58] you meant rev_timestamp? [21:19:11] sure there is an index [21:19:15] CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp); [21:19:24] plus CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp); [21:26:08] 06Operations, 07Tracking: silver: /dev/md2 mounted twice - https://phabricator.wikimedia.org/T151489#2819044 (10Volans) [21:27:58] 06Operations, 10DBA, 07Tracking: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10Volans) [21:28:39] Platonides: There is 0 entries, but it took like 5 minutes to tell me that [21:30:02] oh I see what you mean, there couldn't possibly be an index, since that's not even a column [21:30:40] It took me like 2-3 seconds. [21:30:45] (which is still kinda shitty) [21:32:34] it was quick for me [21:32:38] 06Operations, 07Tracking: silver: / partition low on space - https://phabricator.wikimedia.org/T151493#2819106 (10Volans) [21:32:47] probably because you warmed the index, bawolff [21:33:31] I guess this has always been slow, and I just never search for edits to a specific namespace that the user has never edited for users with lots of edits [21:34:44] yes, that would be slow if he has many edits but few in that ns [21:35:09] it will be using (rev_user_text,rev_timestamp), joining with page then looking at page_namespace [21:35:37] you'd want a rev_user_text,rev_namespace,rev_timestamp index [21:35:39] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:51] but that would require duplicating a namespace field for every revision [21:35:55] Its too bad mysql can't do things like have indexes that go across multiple tables [21:36:08] Seems like that would solve a lot of problems [21:36:14] (03CR) 10Dzahn: "where was it lowered? is there something to link to with?" [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:36:21] do other dbs have that? [21:36:31] I have no idea. Maybe postgress does [21:36:55] hehe, I also thought in postgres as a possibility [21:44:36] (03CR) 10Chad: "I0d2fafc72cc68a3cddc78a7a6d11710bdedccc1d" [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:46:07] bawolff: If only MW had acceptable PG support ;-) [21:50:31] (03CR) 10Krinkle: [C: 031] Consolidate all of the simple wikimedia.org VHosts into two [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [21:51:29] (03CR) 10Dzahn: "thanks! hmm, reading both changes, the linked one and this partial revert but neither really say why we made these changes. did it slow do" [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:54:29] (03CR) 10Chad: "Wasn't a performance issue nah. And yeah, we've started hitting some issues with 1 or 2 repos at the lower limit. I'll follow up on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:55:49] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2703298 (10Tgr) See T88412 for similar issues in the past. [21:56:40] (03PS2) 10Dzahn: Gerrit: Raise global limit for objects back to 100m [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:57:37] (03CR) 10Dzahn: [C: 032] Gerrit: Raise global limit for objects back to 100m [puppet] - 10https://gerrit.wikimedia.org/r/323179 (owner: 10Chad) [21:59:04] !log gerrit restarting for config change 323179 [21:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:57] done [22:02:15] mutante let me know if grrrit-wm fails to restart its feed after the restart of gerrit (ping me) [22:02:33] Zppix: ok! let's see [22:02:42] it should but incase know im here [22:02:52] (03PS1) 10Chad: Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 [22:02:54] (03CR) 10Dzahn: "moo" [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [22:02:57] Seems to work :) [22:03:04] yep, looks good [22:03:16] It should reconnect on it's on [22:03:17] own [22:03:18] :) [22:03:19] yay [22:03:34] paladox however it is technology isnt always reliable (just look at phab tasks) [22:03:39] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:03:46] yep, and it does not disconnect from IRC either, nice [22:03:49] Zppix what phab task? [22:04:15] paladox i meant tasks in general [22:04:28] i was making a point xD [22:04:29] ok [22:50:46] Heya - this image seems to be having issues... https://commons.wikimedia.org/wiki/File:Ansett_Airways%27_flying_boat_Beachcomber_on_Sydney_Harbour_(23495124061).jpg [22:51:43] what's wrong with it? [22:51:52] oh File not found: /v1/AUTH_mw/wikipedia-commons-local-public.82/8/82/Ansett_Airways%27_flying_boat_Beachcomber_on_Sydney_Harbour_%2823495124061%29.jpg [22:51:56] yeah, sorry [22:52:00] the file doesn't seem to exist [22:52:02] Platonides: just been deleeted anyway [22:52:05] oh [22:52:07] lol [22:52:30] !log cleanup older labs instances metrics from 'instances' hierarchy on graphite2001 [22:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:54] it was restored and still problem Platonides [22:54:02] Oh, now it's back! [22:54:03] broken rather [22:54:15] disregard [22:54:30] (i start saying stuff and it fixes itself... your welcome xD) [22:54:47] it still looks broken for me [22:54:59] Oh, I mean back as in not deleted. [22:55:03] It's still not working. :( [22:55:16] Deleting and undeleting wouldn't have fixed the problem [22:55:44] The original is missing for some reason; probably wants a ticket filing for someone swift savvy to look at [22:56:30] I deleted and undeleted just on the off chance it had been previously deleted and something had gone awry during the deletion [22:57:08] (03CR) 10BryanDavis: "> I don't know what the history is behind it, but it's the exact" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [22:57:40] Based on there being a small thumbnail in the file history, it had been successfully uploaded [22:57:55] !log phab2001 - installing vim upgrade [22:58:06] I don't see the small thumbnail [22:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:11] although I see the big one :P [22:58:28] Platonides: It was there before the delete [22:58:43] Delete would've purged it [23:00:06] agree it's successfully uploaded - duplicate detection is working on the still active file and preventing me from uploading a duplicate directly from Flickr [23:03:50] duplicate now at https://commons.wikimedia.org/wiki/File:Ansett_Airways%27_flying_boat_Beachcomber_on_Sydney_Harbour.jpeg and all is normal with the file, thumb and full size file. [23:04:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:05:24] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/#/c/313903/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [23:05:39] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:05:48] (03PS3) 10Dzahn: Remove beta::deployaccess [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [23:05:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3747281 keys, up 23 days 14 hours - replication_delay is 0 [23:06:35] (03CR) 10Dzahn: "expected the rebase to make it disappear.. the other change is already merged.." [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [23:10:25] (03PS2) 10Gergő Tisza: Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) [23:10:27] (03PS1) 10Gergő Tisza: Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) [23:11:18] !log cleanup older labs instances metrics from 'instances' hierarchy on graphite1001 [23:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:04] (03PS2) 10Krinkle: Docroot cleanup: Remove old unused blank.gif from HTTPS tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323323 (owner: 10Chad) [23:14:53] (03PS1) 10BryanDavis: logstash: Move files from root to role module [puppet] - 10https://gerrit.wikimedia.org/r/323332 [23:14:55] (03PS1) 10BryanDavis: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 [23:22:37] (03CR) 10Krinkle: [C: 031] Do not send 'exception-json' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323330 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [23:23:05] (03CR) 10Krinkle: [C: 031] Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [23:32:51] 06Operations: Silver anomalies - https://phabricator.wikimedia.org/T151486#2819482 (10Aklapper) [ Not a #tracking task per definition; removing tag ] [23:33:39] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:37:18] (03CR) 10Dzahn: "did this get reverted? file is still there?" [puppet] - 10https://gerrit.wikimedia.org/r/313903 (https://phabricator.wikimedia.org/T121721) (owner: 10Andrew Bogott) [23:37:48] (03CR) 10Dzahn: [C: 032] "that other one must have been reverted" [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [23:46:33] (03PS7) 10Paladox: Phabricator: Allow setting the mysql.user and mysql.pass in labs [puppet] - 10https://gerrit.wikimedia.org/r/323146 (https://phabricator.wikimedia.org/T139475) [23:49:44] (03CR) 10Dzahn: "Paladox: ping" [puppet] - 10https://gerrit.wikimedia.org/r/301849 (owner: 10Paladox) [23:50:38] (03Abandoned) 10Paladox: Rely on commits name instead of branch [puppet] - 10https://gerrit.wikimedia.org/r/301849 (owner: 10Paladox) [23:50:59] mutante could you review https://gerrit.wikimedia.org/r/323146 please? [23:51:25] (03PS2) 10Dzahn: Labs: Shinken alert for beta error rate [puppet] - 10https://gerrit.wikimedia.org/r/304263 (https://phabricator.wikimedia.org/T141785) (owner: 10Thcipriani) [23:52:16] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2819556 (10Florian) Cool :) Mostly Mentors Split the work in a was like "do at least (number of things) of the repetitive work to fulfill... [23:52:56] paladox: can you look at the old one from August? [23:53:08] mutante done, i've abandoned it [23:53:23] it requires a change in gerrit for what i want to work. [23:54:38] paladox: oh!, ok thanks [23:54:48] Your welcome :) [23:55:09] (03PS1) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [23:55:51] paladox: re: mysql config in hiera. are the values set in hiera or not currently? [23:56:02] mutante nope [23:56:06] i set it as undef [23:56:15] so if it is undef it will use the default [23:56:44] for example it does [23:56:44] hiera('phabricator_app_user', undef) [23:56:57] (03CR) 10Dzahn: [C: 032] Labs: Shinken alert for beta error rate [puppet] - 10https://gerrit.wikimedia.org/r/304263 (https://phabricator.wikimedia.org/T141785) (owner: 10Thcipriani) [23:56:57] notice undef in there, that is the default [23:57:09] $phab_app_user = hiera('phabricator_app_user', undef) [23:57:21] then i do [23:57:40] if $phab_app_user == undef { $app_user = $passwords::mysql::phabricator::app_user } else { $app_user = $phab_app_user } [23:57:41] yes, i noticed that. that's why i asked [23:57:45] ok [23:59:04] mutante would you be able to run puppet compiler on that patch please?