[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T0000). Please do the needful. [00:00:37] jouncebot: I'm still waiting for jenkins to finish teh train :P [00:01:49] ugh, long train day, anything need my special angry follow-up on, twentyafterfour ? [00:01:50] (03PS5) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) [00:03:24] greg-g: no it [00:03:27] it's fine [00:03:34] k :) [00:03:40] twentyafterfour ^^ the patch (that may improve things a bit) [00:03:53] as you should be able to search with numbers with that patch. [00:04:04] Krenair: after twentyafterfour is done, you want to handle the swat or you wish my help to deploy to focus on the testing part? [00:04:21] I'll do it [00:04:23] thanks though Dereckson [00:04:24] * Dereckson nods. [00:04:31] welcome [00:05:23] paladox: that got a -1 from jcrespo, I don't intend to second guess him about mysql [00:05:56] No i mean that will improve things, i didnt say go around him. Things may be different now so i rebased the patch. [00:07:07] Krenair, tsk tsk tsk, cutting in line, aren't we? :))) https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1150867&oldid=1150366 [00:07:42] One intermediate revision by one other user not shown [00:08:57] paladox: in real life, you measure a real drop of performance if you start to index all the "a" "an" "at" "in", "to" (or in French "le" "la" "du" "au"), as there are a lot of non meaningful words to index [00:09:33] paladox: it makes sense for asiatic langauges to decrease the value at 1, so you can search ideograms and kanjis [00:09:47] oh [00:09:58] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:10:19] paladox: for other languages, 3 is *really* a good compromise index size vs. minimal number of letters to consider a word has fair chance to be meaningful [00:10:34] oh [00:10:45] Dereckson: what about numbers? [00:10:46] (03CR) 10jenkins-bot: [V: 04-1] hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [00:10:48] RECOVERY - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.33 port 9042 [00:11:05] like "error on line 3" might be significant in a phab search [00:11:15] "error line" doesn't mean much [00:11:45] line numbers are ephemeral values [00:12:10] sort of but it still matters for specific phrase searches [00:12:11] But if there would have been more stable, yeah, that would have been useful to be able to search them. [00:12:27] hello. can i add more stuff to swat? it seems there is one spot left ;) [00:12:28] (03CR) 10Eevans: [C: 031] "Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/327260 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:12:33] (03PS2) 10Eevans: enable instance restbase1016-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327260 (https://phabricator.wikimedia.org/T151086) [00:12:51] MatmaRex: we didn't start, still the train window, Krenair is going to swat [00:13:06] I'm doing the last thing for the train right now [00:14:28] twentyafterfour: by the way, when I see an error thrown at some location described by a line number error in the logs, I usually check the class and method name to report and use class::method in the title, it's more meaningful [00:15:24] (and these bugs are the ones where I noticed line numbers on the MediaWiki code base are REALLY ephemeral) [00:17:25] ok SWAT can commence when this scap sync-dir finishes [00:17:30] Krenair: ^ [00:17:40] k [00:17:42] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.6/extensions/Echo: deploy 327393 refs T153261 (duration: 00m 47s) [00:17:47] and that's done [00:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:58] T153261: Icons in Echo emails broken - https://phabricator.wikimedia.org/T153261 [00:19:24] huh [00:20:01] Krenair, can I add a late-breaking one that only affects Labs? https://gerrit.wikimedia.org/r/#/c/327377/ [00:20:05] It will be the 8th. [00:20:12] either you or MatmaRex, but not both [00:20:20] oh, labs [00:20:27] I can do that after [00:20:46] Cool, thanks. [00:22:46] !log krenair@tin Synchronized php-1.29.0-wmf.6/extensions/VisualEditor/modules/ve-mw/dm/ve.dm.MWWikitextSurfaceFragment.js: https://gerrit.wikimedia.org/r/#/c/327374/ (duration: 00m 39s) [00:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:49] matt_flaschen, ooh, that has something potentially useful to something that broke last year [00:23:55] anyway I need to focus [00:24:31] Cool. I thought it might be generally useful, so I made it an array. [00:24:39] I'm sick of running scripts and seeing a bunch of non-relevant errors. [00:26:27] okay well this worked [00:28:00] !log krenair@tin Synchronized php-1.29.0-wmf.6/extensions/VisualEditor/modules/ve-mw/ui/ve.ui.MWWikitextDataTransferHandlerFactory.js: https://gerrit.wikimedia.org/r/#/c/327390/, https://gerrit.wikimedia.org/r/#/c/327392/ (duration: 00m 39s) [00:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:19] (i added my patch for SWAT, https://gerrit.wikimedia.org/r/#/c/327400/ in UploadWizard for wmf.6) [00:30:25] I'm not certain this fixed anything [00:32:10] maybe my testing is wrong [00:33:57] oh, maybe [00:34:10] umm [00:34:12] maybe one of the problems. [00:34:46] nope, actually, it's just smarter than I am [00:35:55] https://github.com/wikimedia/mediawiki/commits/wmf/1.29.0-wmf.5 [00:35:55] wtf [00:36:13] thcipriani, did we seriously revert back to manual submodule bumps? [00:36:26] (for .5) [00:37:08] I saw something about gerrit submodule subscriptions not working anymore? [00:37:18] ^ [00:37:25] It worked for .6 [00:37:25] Gerrit is still broken [00:37:28] Just not .5 [00:37:30] evidently gerrit is more strict about urls than it once was (badly repeating what ostriches told me), so for wmf.5 we have to do manual bumps. [00:37:38] Heck it worked for VE on .6 [00:37:39] wmf.6 worked for me earlier today [00:37:47] wmf6 should work now [00:37:54] okay, no way in hell we're going to meet the end of the window [00:37:55] wmf5 is borked ohwell [00:38:18] Krenair: sorry for delaying things. Should be fine to continue past the time window [00:38:42] jenkins was backed up earlier too, not helping things [00:39:14] the next window is phabricator and I'm not sure I want to continue with phab deployment because it's a bit risky [00:39:38] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:39:39] (search index changes not tested well enough for me to be 100% comfortable) [00:46:37] https://gerrit.wikimedia.org/r/327403 is going through jenkins [00:47:05] every time this breaks and we have to do this again, it's another unnecessary 5-10 minutes on the process [00:49:21] (03CR) 10Dzahn: [C: 032] enable instance restbase1016-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327260 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:52:04] yurik: ugh, you want an i18n change in swat? [00:52:14] ostriches https://gerrit.wikimedia.org/r/#/c/327406/ [00:53:17] Krenair, sigh, yep [00:54:08] ostriches could you v+2 https://gerrit.wikimedia.org/r/#/c/326947/ please? [00:54:20] ah, its James_F who cuts in line [00:55:28] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 41273 MB (3% inode=98%) [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T0100). Please do the needful. [01:00:06] thcipriani ostriches auto submodule updates will not work in mw core without https://gerrit.wikimedia.org/r/#/c/327406/ [01:00:18] ok [01:00:22] also wmf .5 submodules are fixed in https://gerrit.wikimedia.org/r/#/c/326947/1 [01:01:11] (03Restored) 10Dereckson: (bug 41712) he.wiki images size configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (owner: 10Dereckson) [01:03:28] mutante: ty! [01:03:28] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 2484 MB (6% inode=83%): /home 41262 MB (3% inode=98%) [01:03:38] !log krenair@tin Synchronized php-1.29.0-wmf.5/extensions/VisualEditor/modules/ve-mw/dm/ve.dm.MWWikitextSurfaceFragment.js: https://gerrit.wikimedia.org/r/#/c/327375/ (duration: 00m 40s) [01:03:38] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:40] (03CR) 10Dereckson: [C: 04-1] "Need to be rebased and one parameter compared with CS." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (owner: 10Dereckson) [01:04:53] urandom: welcome [01:05:14] yeah thanks mutante ! [01:05:31] works [01:06:38] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [01:06:39] !log krenair@tin Synchronized php-1.29.0-wmf.5/extensions/VisualEditor/modules/ve-mw/ui/ve.ui.MWWikitextDataTransferHandlerFactory.js: https://gerrit.wikimedia.org/r/#/c/327389/, https://gerrit.wikimedia.org/r/#/c/327395/ (duration: 00m 39s) [01:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:04] works [01:09:05] yurik, doing yours [01:09:12] Krenair, 327409 [01:09:17] could you include taht too [01:09:20] same ext [01:09:37] (already cherrypicked) [01:10:08] yurik, no, no room [01:10:28] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:41] k [01:10:42] next possible time is European Mid-day SWAT in 12-13 hours [01:10:50] kk [01:11:13] after yurik is MatmaRex [01:15:49] yurik, well scap pull finished on mwdebug1002 [01:16:01] except not quite [01:16:08] ? [01:16:23] because it says 'Finished rsync common (duration: 00m 45s)', and then doesn't end [01:16:55] Krenair, i18n needs a rebuild i think - https://www.mediawiki.org/wiki/Help:Extension:Kartographer [01:16:55] oh, I see what it's doing [01:17:22] krenair 30140 0.0 0.5 78652 22884 pts/0 S+ 01:14 0:00 | \_ /usr/bin/python /usr/bin/scap pull [01:17:22] krenair 30289 0.0 0.0 4336 720 pts/0 S+ 01:15 0:00 | \_ /bin/sh -c sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuil [01:17:22] root 30290 0.0 0.0 40540 3228 pts/0 S+ 01:15 0:00 | \_ sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild --no- [01:17:22] mwdeploy 30291 0.2 0.5 296304 23424 pts/0 Sl+ 01:15 0:00 | \_ /usr/bin/python /usr/bin/scap cdb-rebuild --no-pro [01:17:25] mwdeploy 30294 79.6 1.1 101256 44908 pts/0 R+ 01:15 1:09 | \_ /usr/bin/python /usr/bin/scap cdb-rebuild --no [01:18:07] ah, so it is rebuilding :) [01:18:17] yep [01:18:24] "cdb-rebuild --no" is confusing :) [01:18:44] okay done yurik [01:18:57] it goes off the edge of the screen [01:19:44] Krenair, i think it hasn't rebuilt - take a look at https://www.mediawiki.org/wiki/Help:Extension:Kartographer -- the maps should have links at the bottom, but instead show i18n msg keys [01:19:47] hm, it's still not got it [01:19:50] twentyafterfour, around? [01:20:01] Krenair: here [01:20:20] twentyafterfour, when you do scap pull, does that handle i18n updates needed? [01:20:35] I don't think so [01:20:44] needs separate sync-l10n [01:20:47] is there any way to test this with the debug machines? [01:20:56] I'm not sure [01:21:00] or do we just have to do a full scap and cross our fingers? [01:21:31] I had to do `scap sync-l10n` earlier [01:21:44] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2875043 (10matmarex) [01:21:48] on tin or the debug machine? [01:21:54] on tin [01:22:02] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875030 (10matmarex) [01:22:09] had to build it also..took a long time [01:22:15] what's the issue? /me reads scrollback [01:22:26] doing a change requiring i18n changes in swat [01:22:46] want to test it before pushing to user-reachable machines [01:22:51] hmm if it's running cdb-rebuild then that should do it [01:23:00] that's what we thought [01:23:14] !log krenair@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 01m 14s) [01:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:36] well [01:23:42] that didn't fix it [01:24:27] maybe needs `scap l10n-update` as well [01:24:36] which takes a long time [01:24:42] :-/ [01:28:11] (03PS2) 10TTO: Removing 'technician' user group from tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326354 (https://phabricator.wikimedia.org/T152911) (owner: 10MarcoAurelio) [01:29:02] (03PS3) 10TTO: Removing 'technican' user group from tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326354 (https://phabricator.wikimedia.org/T152911) (owner: 10MarcoAurelio) [01:31:19] (03CR) 10Dzahn: [] "Hey Faidon, do you agree the apt.wm.org specific parts should be in the aptrepo role? and that we need nginx in both roles but minus the a" [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:31:55] (03CR) 10Dzahn: [] "but i wasn't 100% sure if aptrepo module or aptrepo role class maybe" [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:34:36] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875030 (10Betacommand) I would say this is functioning as intended all /wiki urls are in the short url form. Short urls do not handle any URL params. index.php is the corre... [01:34:44] (03PS1) 10TTO: Enable per-page language choice on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327413 (https://phabricator.wikimedia.org/T153209) [01:35:30] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [01:35:31] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1834.195392 Seconds [01:35:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1836.446182 Seconds [01:35:40] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1839.244417 Seconds [01:36:30] (03PS1) 10TTO: Enable per-page language choice on wikis with Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) [01:37:06] (03CR) 10TTO: [C: 04-1] "Wait until I4a12c4f83ca8cb4ca5fc3f97e5fd0f3edc7f3721 has been merged and tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [01:37:30] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.644418 Seconds [01:37:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:37:40] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 1.873203 Seconds [01:38:30] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:39:25] ok [01:39:35] that did the trick? [01:40:11] nope [01:40:36] in fact, we may have a problem [01:40:59] Dec 15 01:40:39 mw1177: #012Warning: Invalid parameter for message "timedmedia-in-job-queue": a:1:{i:0;C:7:"Message":227:{a:8:{s:9:"interface";b:1;s:8:"language";b:0;s:3:"key";s:18:"timedmedia-seconds";s:9:"keysToTry";a:1:{i:0;s:18:"timedmedia-seconds";}s:10:"parameters";a:1:{i:0;i:0;}s:6:"format";s:5:"parse";s:11:"useDatabase";b:1 [01:40:59] ;s:5:"title";N;}}} in /srv/mediawiki/php-1.29.0-wmf.6/includes/Message.php on line 1158 [01:41:19] lots of that getting spammed in hhvm.log [01:43:16] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused eevans Bootstrapping [01:43:19] not sure if it's related [01:43:39] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875091 (10Writ_Keeper) Fair enough, I suppose--I was surprised that redirect=no worked with the short form at all--but then, does that make the fact that it *does* work a b... [01:45:23] possibly caused by https://gerrit.wikimedia.org/r/#/c/321456/2 [01:45:53] Krenair: I merged the patch to fix that [01:46:14] Krenair: https://gerrit.wikimedia.org/r/#/c/325790/ [01:46:17] (03Abandoned) 10TTO: Enable per-page language choice on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327413 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [01:47:05] (03CR) 10TTO: [] "No need for that; this has already been functional on testwiki for some months." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [01:47:27] yurik, how many people are going to see if this commit goes to normal users with broken l10n? [01:48:13] like a lot? [01:48:19] all of hewiki and cawiki [01:48:30] * yurik checks wikivoyage [01:48:39] Krenair, its not that big yet [01:49:12] its at the bottom of the map, very few people would notice i think [01:49:44] !log krenair@tin Started scap: https://gerrit.wikimedia.org/r/#/c/327274/ [01:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:12] this just went past in exception.log: Cannot access the database: Can't connect to MySQL server on '10.64.16.144' (4) (10.64.16.144) {"exception_id":"92c5c8ecaaff3a408dd9c9e3"} [01:53:27] that's db1049, s5-master [01:53:52] thing is, it was running in the context of enwii [01:53:53] enwiki [01:54:28] ah, the trace includes wikidata stuff, and wikidatawiki is on s5 [01:59:35] !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap"; owner is "krenair"; reason is "https://gerrit.wikimedia.org/r/#/c/327274/" (duration: 00m 00s) [01:59:49] ostriches, what are you up to [01:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:00] Stuff [02:00:01] :) [02:00:26] I was pruning old cdb cache files [02:01:50] Testing stufffff [02:03:03] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875030 (10PrimeHunter) >>! In T153275#2875073, @Betacommand wrote: > Short urls do not handle any URL params. It nearly always works fine, e.g.: https://en.wikipedia.org/w... [02:03:35] 06Operations, 10ops-eqiad, 10netops: asw-a2-eqiad PEM 0 not powered - https://phabricator.wikimedia.org/T153273#2874808 (10Cmjohnson) A fuse blew on the PDU...xy phase 2 is down. I am stopping at the store tomorrow to get a SC20 fuse to replace it. [02:04:50] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:50] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:31] (03PS1) 10Chad: scap clean: provide l10n-only option for pruning stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327421 [02:09:18] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2875128 (10fgiunchedi) a:05fgiunchedi>03Papaul @papaul I tried rebooting prometheus2003 today for a test and since it wasn't coming back I checked the mgmt which also doesn't seem to a... [02:16:40] !log krenair@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/327274/ (duration: 26m 56s) [02:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:57] yurik, so [02:17:01] I think it hasn't worked [02:17:11] fun [02:17:50] I don't really have any ideas, haven't looked into the l10n cache before and have no intention of doing it at this time [02:18:24] Krenair, worry not, will solve it tomorrow morning [02:18:33] maybe the automated overnight translatewiki update will solve it? [02:18:40] nah, that's broken at the moment [02:18:51] [02:19:10] maybe next week should be dedicated to stabilization :)))))) [02:19:38] so what shall I do with this patch? [02:21:42] yurik? [02:22:40] Krenair, what are the options [02:22:48] we could leave it as is, or revert [02:22:54] yes [02:23:04] ok, lets revert, and go home :) [02:23:16] i will try to solve it in our own hour tomorrow? [02:23:55] i think it will be safer [02:25:19] 02:24:45 sync-dir failed: Failed to acquire lock "/var/lock/scap"; owner is "mwdeploy"; reason is "(no message)" [02:25:20] what. [02:25:48] oh FFS it's the broken l10nupdate cron [02:26:13] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875030 (10Unready) >>! In T153275#2875117, @PrimeHunter wrote: > It nearly always works fine, e.g.: > https://en.wikipedia.org/wiki/Example?action=history > https://en.wiki... [02:27:53] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.5) (duration: 06m 32s) [02:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:19] wait it actually succeeded today? pff [02:28:35] I guess the scap fix got packaged [02:29:02] Krenair, have you reverted yet or not? [02:29:16] the revert is deploying now [02:29:27] !log krenair@tin Synchronized php-1.29.0-wmf.6/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/327422/ (duration: 01m 01s) [02:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:53] MatmaRex, still around? [02:30:08] idle 01:22 [02:30:29] yup, scap 3.4.2-1 went out today. Had the sync-l10n fix. [02:31:24] Krenair, thanks for persevering! :) I guess we should try tomorrow morning. I'll be up at europe swat, and possibly try to figure it out during that time, or schedule an hour right after that [02:32:11] matt_flaschen, let's do your labs change [02:33:22] (03CR) 10Alex Monk: [C: 032] Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 (owner: 10Mattflaschen) [02:33:58] (03Merged) 10jenkins-bot: Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 (owner: 10Mattflaschen) [02:34:57] Thanks [02:35:20] And thanks for SWATting today. [02:35:58] yeah, thanks, today was rough [02:36:35] * yurik gives Krenair a deployment cookie with chocolate [02:38:21] !log krenair@tin Synchronized dblists: https://gerrit.wikimedia.org/r/327377 - labs only change (duration: 00m 42s) [02:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:47] Thanks [02:38:58] Krenair: yeah [02:39:20] i can verify if you're still willing to deploy it :P [02:39:37] !log krenair@tin Synchronized tests/dblistTest.php: https://gerrit.wikimedia.org/r/327377 (duration: 00m 40s) [02:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:47] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/327377 - labs only change (duration: 00m 39s) [02:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:48] MatmaRex, it's going through jenkins now [02:44:31] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:44:52] so hhvm on mw1201 seems unresponsive [02:45:05] but I've forgotten how to make it reboot on the machines without /sbin/restart hhvm [02:45:13] restart* [02:48:36] MatmaRex, ugh, so something else is deploying now [02:48:50] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:48:59] l10nupdate [02:49:00] * MatmaRex eyes ops [02:49:05] oh. heh [02:49:22] Krenair: why not /usr/sbin/service hhvm restart? (the puppetized automatic restart does the same) [02:49:37] mutante, I don't think there's a sudo rule allowing that? [02:49:52] oh, let me do it [02:50:30] !log mw1201 - restarted hhvm [02:50:33] 06Operations, 10scap: Trying to scap while l10nupdate is syncing shows unhelpful error - https://phabricator.wikimedia.org/T153278#2875159 (10Krenair) [02:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:42] hmmm [02:51:24] root@mw1201:~# /etc/init.d/hhvm start [02:51:24] /etc/init.d/hhvm: 10: /etc/default/hhvm: Syntax error: Unterminated quoted string [02:51:27] wut [02:52:29] there is no 10th line? [02:52:36] !log mw1201 add missing " in /etc/default/hhvm [02:52:40] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.028 second response time [02:52:40] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 69746 bytes in 0.133 second response time [02:52:47] (03PS1) 10MaxSem: Add a daily discovery stats cron [puppet] - 10https://gerrit.wikimedia.org/r/327424 (https://phabricator.wikimedia.org/T153272) [02:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:02] 5 RUN_AS_USER="www-data" [02:53:02] 6 RUN_AS_GROUP="www-data" [02:53:08] Krenair: the last " was missing [02:53:14] let's see if puppet removes that [02:53:16] ah, that was there when I looked [02:53:22] guess I looked after you fixed it [02:53:45] (03PS2) 10MaxSem: Add a daily discovery stats cron [puppet] - 10https://gerrit.wikimedia.org/r/327424 (https://phabricator.wikimedia.org/T153272) [02:53:52] mutante, it's not the only one - mw1202 has the same [02:53:58] presumably more [02:54:00] puppet re-breaks that.. wow [02:54:16] must have been a recent merge then [02:54:34] modules/hhvm/templates/hhvm.default.systemd.erb:RUN_AS_GROUP="<%= @group %> [02:54:41] yea [02:55:15] nope [02:55:16] https://gerrit.wikimedia.org/r/#/c/281885/ [02:56:13] but it only affects systemd [02:56:18] so could be after jessie upgrades [02:56:36] strange how that did not pop up earlier though? [02:56:40] yeah [02:57:29] (03CR) 10Dzahn: hhvm: add systemd/jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [02:59:05] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 11m 28s) [02:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:22] (03PS1) 10Dzahn: hhvm: add missing " in hhvm.default.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/327426 [03:03:06] (03PS2) 10Dzahn: hhvm: add missing " in hhvm.default.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/327426 [03:04:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 15 03:04:56 UTC 2016 (duration 5m 51s) [03:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:44] !log krenair@tin Synchronized php-1.29.0-wmf.6/extensions/UploadWizard/resources/transports/mw.FormDataTransport.js: https://gerrit.wikimedia.org/r/#/c/327400/ (duration: 00m 39s) [03:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:10] MatmaRex, ^ [03:07:10] Krenair: yay. thanks [03:08:24] !log labtestnet2001 is out of disk again - nova-api.log as before [03:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:56] works as expected. good night [03:10:40] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [03:12:59] !log labtest2001 - gzip'ed /var/log/upstart/nova-api.log (T153279) [03:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:15] T153279: labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279 [03:13:16] 06Operations, 06Labs, 10Labs-Infrastructure: labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2875202 (10Dzahn) [03:13:30] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [03:17:50] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:00:20] (03CR) 10Yurik: [C: 031] "seems ready to go, pending @gehel's +2 & depl :)" [puppet] - 10https://gerrit.wikimedia.org/r/327424 (https://phabricator.wikimedia.org/T153272) (owner: 10MaxSem) [04:09:40] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:12:30] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:15:30] twentyafterfour, Krenair: for future reference, cdb-rebuild only turns the json back into CDBs. Something else on tin, either a full scap or l10nupdate, needs to prepare updated l10n json dumps. sync-l10n does not build new json files. It's basically a sync-dir of the existing json cache files + a json->cdb build. [04:22:48] 06Operations, 06Discovery, 06Labs, 03Interactive-Sprint, 06Maps (Maps-data): PostgreSQL query planner bug on labsdb1006 - https://phabricator.wikimedia.org/T145599#2875258 (10Yurik) [04:23:22] 06Operations, 06Discovery, 07Epic, 06Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2875263 (10Yurik) [04:31:16] 06Operations, 06Discovery, 06Maps (Maps-data): Improve automation around Maps servers - https://phabricator.wikimedia.org/T138017#2875294 (10Yurik) [04:32:30] 06Operations, 06Discovery, 06Maps (Maps-data): Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2875303 (10Yurik) [04:33:07] 06Operations, 06Discovery, 03Interactive-Sprint, 06Maps (Maps-data): Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2875319 (10Yurik) [04:37:05] 06Operations, 06Discovery, 10Traffic, 06Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2875353 (10Yurik) [04:37:40] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:40:12] 06Operations, 06Discovery, 10Traffic, 06Maps (Kartographer): Clarify caching to enable direct Wikidata Query Service access by - https://phabricator.wikimedia.org/T146832#2875409 (10Yurik) [04:41:30] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:00:09] (03PS1) 10Legoktm: beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 [05:00:11] (03PS1) 10Legoktm: beta: Remove duplicate entry for wikidata in $wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327439 [05:44:30] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 36 probes of 409 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:47:10] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 28 probes of 404 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:49:30] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 409 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:51:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1009.80 Read Requests/Sec=2536.80 Write Requests/Sec=10.10 KBytes Read/Sec=30952.40 KBytes_Written/Sec=3407.60 [05:52:10] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 1 probes of 404 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:03:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.50 Read Requests/Sec=1.10 Write Requests/Sec=22.20 KBytes Read/Sec=4.40 KBytes_Written/Sec=849.20 [06:29:17] (03PS7) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [06:47:40] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:02:45] stat1002's homes are still a bit overloaded [07:02:51] trying to reduce their space [07:08:04] !log Deploy alter table db1049 (master) dewiki.revision - T148967 [07:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:19] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:15:40] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:16:25] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2875607 (10Marostegui) No worries - I have extended the downtime for the lag checks until Monday. However if it finishes before th... [07:27:00] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: binlog truncated in the middle of event: consider out of disk space on master: the first event db2029-bin.002103 at 876447380, the last event read from db2029-bin.002103 at 876447380, the last byte read from db2029-bin.002103 at 87644739 [07:27:30] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.34 port 9042 [07:28:00] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:35:40] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:38:40] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:39:30] RECOVERY - Disk space on stat1002 is OK: DISK OK [07:40:52] !log moved some home files on stat1002 to the data-tank partition to free some space [07:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:44] (03CR) 10Ema: [C: 032] varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [07:47:46] (03PS3) 10Dzahn: hhvm: add missing " in hhvm.default.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/327426 [07:52:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327459 (https://phabricator.wikimedia.org/T150644) [08:03:40] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:07:40] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:17:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327459 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [08:18:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327459 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [08:19:45] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1071 - T150644 (duration: 00m 40s) [08:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:58] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [08:20:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1071 - T150644 (duration: 00m 39s) [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "So, very briefly:" [puppet] - 10https://gerrit.wikimedia.org/r/327426 (owner: 10Dzahn) [08:23:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Also the reason for not actually starting is that there was an instance already running, from what I can understand by the logs, that was " [puppet] - 10https://gerrit.wikimedia.org/r/327426 (owner: 10Dzahn) [08:24:38] !log Deploy alter table wikidatawiki.revision db1071 - T150644 [08:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:40] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:18] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#2875661 (10Writ_Keeper) @Unready: this bug actually applies to all URL parameters, not just redirect, given a page name with a question mark in it. For example: https://en.... [08:39:40] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:53:40] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:00:48] (03PS3) 10Gehel: Add a daily discovery stats cron [puppet] - 10https://gerrit.wikimedia.org/r/327424 (https://phabricator.wikimedia.org/T153272) (owner: 10MaxSem) [09:01:51] (03CR) 10Gehel: [C: 032] Add a daily discovery stats cron [puppet] - 10https://gerrit.wikimedia.org/r/327424 (https://phabricator.wikimedia.org/T153272) (owner: 10MaxSem) [09:07:40] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:10:36] (03PS1) 10Giuseppe Lavagetto: mediawiki: add TLS termination to imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327460 [09:10:38] (03PS1) 10Giuseppe Lavagetto: mediawiki: add TLS support to API in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327461 [09:10:40] (03PS1) 10Giuseppe Lavagetto: mediawiki: add TLS termination to all appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327462 [09:16:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1071 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327463 [09:17:23] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add TLS termination to imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327460 (owner: 10Giuseppe Lavagetto) [09:18:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1071 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327463 (owner: 10Marostegui) [09:19:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327463 (owner: 10Marostegui) [09:20:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T150644 (duration: 00m 40s) [09:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:22] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [09:26:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327464 (https://phabricator.wikimedia.org/T150644) [09:28:28] (03PS2) 10Giuseppe Lavagetto: mediawiki: add TLS support to API in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327461 [09:28:30] (03PS2) 10Giuseppe Lavagetto: mediawiki: add TLS termination to all appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327462 [09:28:32] (03PS1) 10Giuseppe Lavagetto: tlsproxy::localssl: add dependency between sslcerts and nginx [puppet] - 10https://gerrit.wikimedia.org/r/327465 [09:29:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327464 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [09:29:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327464 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [09:30:57] PROBLEM - Check systemd state on mw2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:30:57] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:31:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T150644 (duration: 00m 39s) [09:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:27] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [09:32:18] (03PS2) 10Elukey: Add the prometheus-apache-exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) [09:33:39] !log Deploy alter table wikidatawiki.revision db1070 - T150644 [09:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:37] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:57] PROBLEM - Check systemd state on mw2148 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:36:07] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:36:14] hello [09:38:02] jynus: good morning! did you get any clue about S3 surge of traffic ? [09:38:51] jynus: the last I heard is that the mw deployment was unrelated and there is nothing obvious standing out in Grafana (no job queue surge, change-prop looks ok, parsoid as well etc) [09:38:56] so it is kind of a mystery :( [09:39:38] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:40:47] PROBLEM - Check systemd state on mw2149 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:40:57] PROBLEM - puppet last run on mw2149 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:41:15] hashar, this is the new working theory: https://phabricator.wikimedia.org/T153184#2875757 [09:41:57] PROBLEM - Check systemd state on mw2089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:57] PROBLEM - Check systemd state on mw2151 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:42:06] I think it would also fit the other weekend spike [09:42:07] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:42:07] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:43:38] jynus: so it is "just" about /populateLocalAndGlobalIds.php running for a while and taking too much [09:43:48] sounds good. Maybe we can throttle it / batch it somehow [09:43:52] sounds sane and safe [09:45:27] <_joe_> the puppet errors in codfw will go away [09:45:29] it is ok if it is temporary [09:45:43] the problem if it was group0 related [09:46:13] is that when deployed, it would overload s3 and create an outage [09:46:28] we can handle the current load if it finishes soon [09:47:04] but it is causing some slowdown in response time [09:47:07] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:47:07] PROBLEM - Check systemd state on mw2087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:47:32] apparently, the job started at the same time than the deploy [09:47:57] we need to enforce !log and Deployment mention more strongly [09:48:10] to avoid unnecessary investigations [09:51:57] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:52:07] PROBLEM - Check systemd state on mw2150 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:52:30] (03PS2) 10TTO: Enable per-page language choice on wikis with Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) [09:57:46] (03PS1) 10Jcrespo: mariadb: Adjust mariadb otrs backups to increase max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/327468 [09:58:51] RECOVERY - Check systemd state on mw2088 is OK: OK - running: The system is fully operational [09:59:00] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:59:08] (03CR) 10Jcrespo: [C: 032] mariadb: Adjust mariadb otrs backups to increase max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/327468 (owner: 10Jcrespo) [09:59:50] PROBLEM - Nginx local proxy to apache on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 443: Connection refused [10:00:00] PROBLEM - Nginx local proxy to apache on mw2149 is CRITICAL: connect to address 10.192.32.37 and port 443: Connection refused [10:00:20] PROBLEM - Nginx local proxy to apache on mw2150 is CRITICAL: connect to address 10.192.32.38 and port 443: Connection refused [10:00:30] PROBLEM - Nginx local proxy to apache on mw2087 is CRITICAL: connect to address 10.192.16.60 and port 443: Connection refused [10:00:30] PROBLEM - Nginx local proxy to apache on mw2151 is CRITICAL: connect to address 10.192.32.39 and port 443: Connection refused [10:01:00] PROBLEM - Nginx local proxy to apache on mw2089 is CRITICAL: connect to address 10.192.16.62 and port 443: Connection refused [10:01:41] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:02:10] !log elastic@eqiad: T152092 - reindex done (took 2.5 days) [10:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:25] T152092: Activate BM25 on all but wikis with spaceless languages - https://phabricator.wikimedia.org/T152092 [10:03:10] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [10:03:50] RECOVERY - Nginx local proxy to apache on mw2148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.365 second response time [10:04:00] RECOVERY - Check systemd state on mw2148 is OK: OK - running: The system is fully operational [10:05:21] jynus: in theory we could get mwscript to ask for a reason and log that together with user. Similar to what we manually do via !log here [10:07:51] RECOVERY - Check systemd state on mw2149 is OK: OK - running: The system is fully operational [10:08:01] RECOVERY - Nginx local proxy to apache on mw2149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.361 second response time [10:08:01] RECOVERY - puppet last run on mw2149 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:09:00] RECOVERY - Nginx local proxy to apache on mw2089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.196 second response time [10:09:10] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:09:10] RECOVERY - Check systemd state on mw2089 is OK: OK - running: The system is fully operational [10:09:30] RECOVERY - Nginx local proxy to apache on mw2151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.360 second response time [10:10:00] RECOVERY - Check systemd state on mw2151 is OK: OK - running: The system is fully operational [10:10:10] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:14:10] RECOVERY - Check systemd state on mw2087 is OK: OK - running: The system is fully operational [10:14:30] RECOVERY - Nginx local proxy to apache on mw2087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.209 second response time [10:15:01] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:19:33] !log stopping slave on labsdb1001 - s1 to run alter table T151029 [10:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:46] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [10:20:01] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:20:10] RECOVERY - Check systemd state on mw2150 is OK: OK - running: The system is fully operational [10:20:20] RECOVERY - Nginx local proxy to apache on mw2150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.189 second response time [10:21:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327472 [10:22:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327472 (owner: 10Marostegui) [10:23:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327472 (owner: 10Marostegui) [10:24:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 - T150644 (duration: 00m 47s) [10:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [10:29:42] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327465 (owner: 10Giuseppe Lavagetto) [10:37:59] (03PS3) 10DCausse: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 [10:38:47] (03PS4) 10DCausse: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (https://phabricator.wikimedia.org/T152895) [11:02:10] jouncebot: next [11:02:10] In 1 hour(s) and 57 minute(s): cxserver Scap3 migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1300) [11:07:40] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:07:40] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10fdans) [11:09:58] (03PS2) 10Giuseppe Lavagetto: tlsproxy::localssl: add dependency between sslcerts and nginx [puppet] - 10https://gerrit.wikimedia.org/r/327465 [11:15:17] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875970 (10fdans) [11:16:32] !log mobrovac@tin Starting deploy [changeprop/deploy@9eab965]: Deploying fix for T153215 [11:16:33] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10fdans) [11:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:45] T153215: Memory leak in Change-Prop - https://phabricator.wikimedia.org/T153215 [11:17:27] !log mobrovac@tin Finished deploy [changeprop/deploy@9eab965]: Deploying fix for T153215 (duration: 00m 54s) [11:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:24] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875979 (10elukey) p:05Triage>03Normal a:03elukey [11:19:41] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10elukey) I confirm that Francisco is a new Analytics member, request valid from my team's perspective. [11:28:33] (03CR) 10Elukey: [C: 04-1] "Still need some tweaks for the labs ferm rules" [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [11:29:15] <_joe_> ema: I'm merging https://gerrit.wikimedia.org/r/327465 [11:29:26] <_joe_> ema: worse can happen, is puppet failing to apply [11:29:47] _joe_: I've tried the change on my labs instance and it didn't fail so we should be fine [11:30:09] anyways yeah, worst case scenario we revert :) [11:30:56] (03CR) 10Giuseppe Lavagetto: [C: 032] tlsproxy::localssl: add dependency between sslcerts and nginx [puppet] - 10https://gerrit.wikimedia.org/r/327465 (owner: 10Giuseppe Lavagetto) [11:33:04] <_joe_> ema: runs well on the mw hosts [11:33:40] _joe_: sweet, let me try on pinkunicorn too [11:34:15] _joe_: all good [11:34:24] <_joe_> cool [11:34:36] <_joe_> let me see now if this fixes my race condition [11:34:57] (03PS3) 10Giuseppe Lavagetto: mediawiki: add TLS support to API in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327461 [11:35:10] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: add TLS support to API in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327461 (owner: 10Giuseppe Lavagetto) [11:40:10] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [11:41:59] wat ^ ? [11:42:10] _joe_: akosiaris: known ^? [11:42:17] <_joe_> mobrovac: nope I am looking right now [11:42:50] <_joe_> mobrovac: as I expected Could not evaluate: Cannot allocate memory - fork(2) [11:43:11] thank you zotero [11:43:20] <_joe_> zotero just restarted [11:43:26] <_joe_> so yeah, we know :P [11:43:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The IP allocation is wrong. The rows should be public1-b-codfw, public1-c-eqiad. Also, if we don't have a good service name (I see Jaime a" (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [11:43:57] hm [11:44:05] I restarted it like 1-2 days ago [11:44:23] https://tools.wmflabs.org/sal/log/AVj38xIxHQCSeVEJe5Tx [11:45:11] so something is making it barf... it's not the usual memory leaking [11:56:24] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-by-ganglia-cluster?var-datasource=eqiad%20prometheus%2Fops&var-cluster=sca [11:56:32] <_joe_> this is sca1* [11:56:59] <_joe_> and the same pattern on sca2* [11:57:05] <_joe_> which doesn't make sense at all [11:57:18] <_joe_> since it's supposed to be inactive apart from monitoring [11:57:25] <_joe_> so is monitoring causing this? [12:00:06] hmm [12:01:06] or just firefox memory leaking [12:01:26] the pattern is quite funny [12:03:18] why do the 2 boxes have such a phase difference ? [12:03:49] at least in codfw [12:04:01] it's very predictable [12:04:16] in eqiad the restarts are very close to each other [12:04:35] can't wait to kill that software [12:04:44] !log mobrovac@tin Starting deploy [trending-edits/deploy@200a709]: Fix for T153122 [12:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:56] T153122: Investigate delay growth in trending service - https://phabricator.wikimedia.org/T153122 [12:05:08] !log mobrovac@tin Finished deploy [trending-edits/deploy@200a709]: Fix for T153122 (duration: 00m 24s) [12:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:36] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::apiserver: Remove unused master_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/326452 (owner: 10Alexandros Kosiaris) [12:05:43] (03PS4) 10Alexandros Kosiaris: k8s::apiserver: Remove unused master_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/326452 [12:06:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Tested in toollabs, clearly a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/326452 (owner: 10Alexandros Kosiaris) [12:08:11] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:08:41] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:41] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:26:17] (03PS6) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [12:28:00] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2876127 (10hashar) Slightly optimized based on what is in the git tree instead of the local checkout: ``` git ls-tree -z --name-only -r HEAD|xargs -0... [12:30:44] (03PS3) 10Mobrovac: CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) [12:33:40] (03PS3) 10Giuseppe Lavagetto: mediawiki: add TLS termination to all appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327462 [12:33:52] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: add TLS termination to all appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327462 (owner: 10Giuseppe Lavagetto) [12:34:28] <_joe_> Amir1: merging your change as well [12:35:01] kart_ kart__ please go over https://gerrit.wikimedia.org/r/#/c/321861/6/scap/templates/config.yaml.j2 and make sure the config is all there [12:35:02] (03PS1) 10Yuvipanda: labs: Set nosuid on scratch [puppet] - 10https://gerrit.wikimedia.org/r/327483 [12:35:07] _joe_: hey, I have some old patches in puppet, which one? [12:35:12] (03PS3) 10Alexandros Kosiaris: Rework network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/313650 [12:35:16] . [12:35:19] <_joe_> Amir1: sorry [12:35:23] <_joe_> I meant akosiaris [12:35:24] <_joe_> :P [12:35:54] (03PS2) 10Yuvipanda: labs: Set nosuid on scratch [puppet] - 10https://gerrit.wikimedia.org/r/327483 [12:35:54] _joe_: ah yes, thanks [12:36:20] akosiaris: can you document the process for building updated debs when you upgrade k8s to 1.5.x? [12:36:41] Okay. I just put this here. http://mashable.com/2016/12/14/penguin-wikipedia-photo-caption-battle/?utm_cid=mash-com-fb-main-link#RduOJrCAvaqr [12:36:58] yuvipanda: er, sure... it's just dpkg-buildpackage [12:37:05] but I will [12:37:11] probably in repo [12:37:22] akosiaris: sure, but there's also need to apply patches, test, etc [12:37:31] apply patches, run tests, etc [12:37:50] ah yes [12:38:19] akosiaris: alongside pre-reqs for building (docker I presume, and maybe internet access?0 etc [12:38:19] so [12:38:22] (03CR) 10Yuvipanda: [C: 032] labs: Set nosuid on scratch [puppet] - 10https://gerrit.wikimedia.org/r/327483 (owner: 10Yuvipanda) [12:38:41] hmm indeed there are a few [12:39:12] ok will do. thankfully most if automated anyway. using the builder role in labs, just cloning the repo and upgrading [12:39:42] akosiaris: the kubebuilder or debbuilder? [12:40:03] debbuilder [12:40:10] right [12:40:11] ok [12:46:07] 06Operations, 10Ops-Access-Requests: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876156 (10elukey) [12:47:13] 06Operations, 10Ops-Access-Requests, 10Analytics: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10elukey) [12:52:24] (03PS1) 10ArielGlenn: update rsyncd exclusion/inclusion list for rsync mirrors [puppet] - 10https://gerrit.wikimedia.org/r/327486 [13:00:04] mobrovac and akosiaris: Dear anthropoid, the time has come. Please deploy cxserver Scap3 migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1300). [13:00:04] kart_: A patch you scheduled for cxserver Scap3 migration is about to be deployed. Please be available during the process. [13:00:32] kart_ kart__ ping? [13:00:40] mobrovac: yes. Here. [13:00:46] akosiaris: around? [13:01:02] i need 5 mins guys [13:01:04] then we can go [13:01:12] mobrovac: okay. no issue. [13:01:37] yes I am around [13:01:54] ready whenever you are [13:04:47] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:09:27] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:44] 06Operations, 10Ops-Access-Requests, 10Analytics: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10Krenair) What's the rationale for this including deployment access? [13:10:49] 06Operations, 10Ops-Access-Requests, 10Analytics: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876223 (10elukey) >>! In T153303#2876221, @Krenair wrote: > What's the rationale for this including deployment access? My mistake, I was about to... [13:11:01] 06Operations, 10Ops-Access-Requests, 10Analytics: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876224 (10elukey) [13:11:39] (03CR) 10Alexandros Kosiaris: [C: 031] aptrepo: add Docker's apt repo to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/327241 (owner: 10Faidon Liambotis) [13:12:34] akosiaris: kart__: go? [13:12:43] mobrovac: I'm here :) [13:12:57] mobrovac: which patch should go first? [13:13:39] puppet [13:14:09] jouncebot: now [13:14:09] For the next 0 hour(s) and 45 minute(s): cxserver Scap3 migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1300) [13:14:30] akosiaris: puppet patch for you. [13:15:21] akosiaris: to be on the safe side, i will disable puppet in eqiad and we'll deploy in codfw first [13:15:27] ok [13:15:34] merging and doing eqiad [13:15:35] (03PS1) 10Giuseppe Lavagetto: mediawiki: add https endpoints for all web clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327493 (https://phabricator.wikimedia.org/T153042) [13:15:36] er [13:15:37] codfw [13:15:57] <_joe_> akosiaris: when you have time, a sanity check on ^^ will be welcome [13:16:17] _joe_: im not that crazy yet xD [13:16:39] <_joe_> I'll re-review it when I'b back later anyways [13:16:48] !log disable puppet on scb nodes for scap3 migration for cxserver [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:06] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327494 [13:17:39] (03CR) 10Alexandros Kosiaris: [C: 032] CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) (owner: 10Mobrovac) [13:17:45] (03PS4) 10Alexandros Kosiaris: CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) (owner: 10Mobrovac) [13:17:47] kart_ [13:17:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) (owner: 10Mobrovac) [13:18:10] mobrovac: yes. [13:18:23] kart__ i will merge the other change and then mangle a bit with the deploy config on tin, so please don't do anything :) [13:18:41] btw kart__ do you have access to scb in codfw? [13:19:41] mobrovac: sure. I'm not touching anything. [13:19:51] mobrovac: not sure about codfw access. [13:20:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add hhvm_exporter role and class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [13:20:26] kart__ can you access scb2001.codfw.wmnet? [13:21:07] !log running puppet on scb200X boxes for cxserver scap3 migration [13:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:11] mobrovac: I can login to scb2001, let me check. [13:23:18] mobrovac: is that same one? [13:23:30] yes [13:23:42] everything is looking good puppet wise on the scb200X boxes [13:23:47] mobrovac: ok. I've access. [13:23:50] akosiaris: nice. [13:23:54] ok [13:23:56] mobrovac: I 'll proceed with the eqiad ones [13:24:01] akosiaris: no no [13:24:02] not yet [13:24:03] (03PS2) 10ArielGlenn: update rsyncd exclusion/inclusion list for rsync mirrors [puppet] - 10https://gerrit.wikimedia.org/r/327486 [13:24:09] * akosiaris halting [13:24:21] want to do the scap3 change in the deploy repo first ? [13:24:21] i have to actually do the deploy first and we have to verify cxserver there [13:24:25] ok [13:24:25] ypu [13:24:37] !log mobrovac@tin Starting deploy [cxserver/deploy@cf286d3]: (no message) [13:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] akosiaris: kart__ we have aproblem [13:26:37] ? [13:26:49] mobrovac: ? [13:27:21] investigating ... [13:27:21] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=8080): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:27:28] config file problem [13:28:01] (03PS1) 10Ema: package_builder: run shell inside the chroot on build failures [puppet] - 10https://gerrit.wikimedia.org/r/327497 [13:28:19] akosiaris: sigh, the jwt secret token contains a " [13:28:25] damn [13:28:30] lol [13:28:31] ok, will try to find a solution [13:29:18] mobrovac: ah [13:30:06] mobrovac: also in scap/vars.yaml -> log_file: /tmp/citoid.log :/ [13:30:15] mobrovac: please also fix that. [13:30:45] (03CR) 10Alexandros Kosiaris: [C: 031] Add the prometheus-apache-exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [13:31:04] kart__: that's not important, that'a overwritten in prod [13:31:10] let's focus here [13:32:20] maybe we can reissue that secret ? [13:32:31] IIRC it's for mediawiki => cxserver communication [13:32:41] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:32:48] kart__: would that make sense ^ ? [13:33:55] akosiaris: yes. [13:34:16] akosiaris: it is fetch from mw config. [13:34:40] and from private puppet repo - both should match. [13:34:47] !log mobrovac@tin Finished deploy [cxserver/deploy@cf286d3]: (no message) (duration: 10m 10s) [13:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:29] unless mobrovac has a better idea, then we can just change it a bit and work with that [13:36:15] akosiaris: lemme try something real quick, 5 mins [13:36:20] ok [13:37:31] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:39:29] !log mobrovac@tin Starting deploy [cxserver/deploy@cf286d3]: (no message) [13:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:02] !log mobrovac@tin Finished deploy [cxserver/deploy@cf286d3]: (no message) (duration: 00m 33s) [13:40:05] ok akosiaris kart__ found a fix [13:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:21] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [13:40:30] kart__ please go to scb2001 and try to issue some reqs to cxserver to see if all is good [13:40:47] addshore: your wmf branch RevisionSlider change will need a backport I guess ( https://gerrit.wikimedia.org/r/#/c/327475/ ) it fails qunit [13:40:52] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2876279 (10elukey) Adding my 2 cents: I personally don't like the Ganglia way of metrics visualization, because it is difficult imho to compare trends (same metric... [13:41:17] hashar: sure! [13:41:38] mobrovac: nice [13:42:06] mobrovac: okay [13:43:26] hashar: added! [13:43:56] neat [13:44:13] kart__ ok to continue? have you checked? [13:45:14] mobrovac: checking further. [13:47:51] mobrovac: to be sure, Curl requests are fine or what exactly should I check to sure? [13:49:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [13:49:16] kart__ yes, requests to the service that would be made by the extension [13:49:42] kart__ have you done that? can we proceed? [13:49:49] fwiw, service_checker says ok on scb2001 [13:50:03] which should be doing quite a big of URL checking due to the spec [13:50:12] a bit* [13:50:25] (03CR) 10ArielGlenn: [C: 032] update rsyncd exclusion/inclusion list for rsync mirrors [puppet] - 10https://gerrit.wikimedia.org/r/327486 (owner: 10ArielGlenn) [13:50:31] I suggest we proceed, unless something comes up [13:50:41] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:55] mobrovac: okay. one more test. [13:50:59] mobrovac: go ahead :) [13:51:17] zeljkof: I have CR+2 a few changes. I am heading out for a coffee, be back in roughly 10 mins [13:51:20] ok akosiaris let's proceed to eqiad [13:51:30] ok doing so now [13:51:34] hashar: ok, see you for swat then [13:51:42] let me know once puppet has run there [13:52:25] mobrovac: done [13:52:32] kk [13:52:37] * mobrovac doing a full deploy [13:52:46] 06Operations, 10Ops-Access-Requests, 10Analytics: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2876304 (10Krenair) okay [13:53:28] !log mobrovac@tin Starting deploy [cxserver/deploy@430c858]: Full deploy to switch CXServer to Scap3 config deploys T147634 [13:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:39] T147634: Enable Scap3 config deploys for CXServer - https://phabricator.wikimedia.org/T147634 [13:54:20] !log mobrovac@tin Finished deploy [cxserver/deploy@430c858]: Full deploy to switch CXServer to Scap3 config deploys T147634 (duration: 00m 52s) [13:54:34] ok we are done kart__ akosiaris ^ [13:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:37] :) [13:55:07] cool. Thanks mobrovac and akosiaris! [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1400). [14:01:04] Present [14:01:07] o/ [14:01:13] o/ [14:01:26] \o [14:01:34] o/ [14:01:35] \o/ [14:01:45] o/ [14:01:47] I have CR+2 a couple changes ages ago [14:02:05] ci was busy, probably still is [14:02:17] CI is quite overloaded for some reason though (few jobs take 20 minutes + lot of changes added) [14:02:42] Update extensions/Kartographer from branch 'wmf/1.29.0-wmf.6' [14:02:43] landed [14:02:51] o/ [14:03:14] and the Flow one for wmf.5 as well [14:04:38] yes, flow and kartographer commits are merged [14:05:09] hashar: what's the plan, should I deploy one, you the other? [14:05:15] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:05:16] then see what next? [14:06:03] zeljkof: can you update both [14:06:06] and push both to mwdebug? [14:06:17] they are on different branches [14:06:18] hashar: sure, on it [14:07:17] hashar: looks like yurik is not around, should I wait with his patch until he arrives? [14:08:32] wow, yeh, CI is super backlogged [14:08:58] 42 mins for the core change thats about to be merged so far... [14:08:58] zeljkof: I am not worrying too much about that one [14:08:59] matt_flaschen: can you test 327402 at mwdebug1002, once it is there? [14:09:32] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:53] hashar: should I just deploy it? [14:10:24] zeljkof: we can try to reproduce. Not sure how though [14:10:35] I will trust yurik for this one [14:10:42] zeljkof, yeah, in dry run. [14:10:53] (03CR) 10Elukey: [C: 031] "The change is awesome and I am using it atm in labs (deployment-copper). The only weird issue is the fact that the new C10shell file is no" [puppet] - 10https://gerrit.wikimedia.org/r/327497 (owner: 10Ema) [14:10:55] hashar: ok, so deploying? [14:11:05] matt_flaschen: great, will ping you in a few minutes, when it is there [14:11:10] zeljkof: yeah [14:11:17] hashar: ok [14:13:09] sorry i'm late, anyone swating? [14:13:21] the RevisionSlider changes of addshore will land soonish [14:13:27] hashar: great! [14:13:42] dcausse: will get to the cirrus search last [14:13:50] hashar: sure [14:13:58] (03PS2) 10Hashar: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327494 (owner: 10Jdrewniak) [14:14:07] (03CR) 10Hashar: [C: 032] "For SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327494 (owner: 10Jdrewniak) [14:14:18] jan_drewniak: portal change is in the CI pipeline [14:14:43] hashar: sounds good [14:15:14] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327494 (owner: 10Jdrewniak) [14:16:44] jan_drewniak: portal change is on mwdebug1002 for testing [14:16:59] yurik: hashar and I are [14:17:07] your commit is next, will ping you in a few minutes [14:17:20] thx!!!! [14:17:27] matt_flaschen: please test, your commit is at mwdebug1002 [14:18:07] hashar: looks good! [14:18:11] yurik: can you test your commit at mwdebug1002, once it is there? [14:18:19] yep [14:18:41] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:18:58] pushing the portals change [14:19:44] !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 58s) [14:19:49] (03CR) 10Gehel: [C: 031] "This looks like a good solution to the need of strong alignment between elasticsearch version and plugins version." [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:34] !log hashar@tin Synchronized portals: (no message) (duration: 00m 50s) [14:20:42] zeljkof, it's not working, but doesn't look related to my chnage: P4623 [14:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:46] jan_drewniak: it on production :)} [14:20:48] Seems like maybe memcached isn't installed. [14:20:50] (03PS1) 10Ema: varnish cachestats.py: make key_prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/327504 (https://phabricator.wikimedia.org/T151643) [14:21:07] matt_flaschen: should I push on production? [14:21:18] *to* production [14:21:39] (03PS2) 10DCausse: [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) [14:21:51] hashar: yippie, thanks [14:21:57] addshore: RevisionSlider changes are on mwdebug1002 [14:22:02] ack, checking [14:22:23] yurik: your commit is at mwdebug1002, please test [14:22:38] zeljkof, yeah. [14:22:51] matt_flaschen: there is no memcached extension on mwdebug [14:22:57] matt_flaschen: ok, deploying [14:22:59] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 [14:23:08] zeljkof, all good [14:23:12] matt_flaschen: mwscript is hardcoded to use php5 which is Zend. And we don't included the PHP extensions required for mediawiki [14:23:12] (03CR) 10Marostegui: [C: 04-2] "Wait until the lag is gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 (owner: 10Marostegui) [14:23:27] yurik: great, deploying in a minute [14:23:32] thx [14:23:33] hashar: mwdebug1002 = mw1099 right? [14:23:36] matt_flaschen: gotta be checked on a work server on which you would scap pull the change (eg: terbium) [14:23:38] addshore: yes [14:23:49] addshore: though I think the browser extensions have all been updated [14:24:01] ahh, i should really update mine then! [14:24:55] hashar, re "And we don't included the PHP extensions required for mediawiki", why? [14:25:03] matt_flaschen: cause prod is on hhvm now [14:25:04] hmm, hashar it doesnt appear to be working, give me a few more mins [14:25:24] hashar, prod also uses php5 for mwscript right? [14:25:36] matt_flaschen: but mwscript still uses php5 :/ Then it is used solely on the work/deployment servers. So those servers have the Zend extensions added. But rest of the fleet do not [14:26:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) [14:26:19] (03PS1) 10Ladsgroup: Make fawiki in beta use xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327507 (https://phabricator.wikimedia.org/T139110) [14:26:31] hashar, well, this one should, otherwise this workflow is impossible. I'll file a bug. In the meantime, I was going to do a dry run first on terbium anyway, so it's fine. [14:26:35] (03CR) 10Marostegui: [C: 04-2] "Wait for the SWAT to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [14:27:02] addshore: wmf.6 does have "Load bar arrow on left for RTL languages" [14:27:14] !log zfilipin@tin Synchronized php-1.29.0-wmf.5/extensions/Flow: SWAT: [[gerrit:327402|FlowFixInconsistentBoards: Dont output non-critical error info (T148057)]] (duration: 00m 56s) [14:27:24] matt_flaschen: deployed, please test [14:27:24] bah, im testing it on a wiki that is currently on wmf5... [14:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:26] T148057: Fix user talk pages already in inconsistent state due to to T138310 - https://phabricator.wikimedia.org/T148057 [14:27:47] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [14:28:09] addshore: hahah :} [14:28:25] I promise one day we will have a single branch [14:28:38] 07Puppet, 10Deployment-Systems, 05Mediawiki SWAT Deployments: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316#2876393 (10Mattflaschen-WMF) [14:28:45] hashar: all checked, its good to go! [14:28:53] !log zfilipin@tin Synchronized php-1.29.0-wmf.6/extensions/Kartographer: SWAT: [[gerrit:327423|Fix fullscreen map not closing properly (T153100)]] (duration: 00m 40s) [14:29:03] dcausse: so you get a few patches. Is there an order to land them or are they independent changes ? [14:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:06] T153100: Fullscreen map not closing properly - https://phabricator.wikimedia.org/T153100 [14:29:07] yurik: deployed to production, please test [14:29:21] T153316 [14:29:22] T153316: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316 [14:29:36] hashar: the order the deployment wiki page should be fine [14:29:44] addshore: that is being pushed [14:29:57] zeljkof, all good ,thx [14:30:04] hashar: I have deployed Flow and Kartographer, what's next? all good? [14:30:07] !log hashar@tin Synchronized php-1.29.0-wmf.6/extensions/RevisionSlider: (no message) (duration: 00m 40s) [14:30:16] yurik: \o/ [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:19] zeljkof: RevisionSlider and portals are done [14:30:28] what is left is all the CirrusSearch patches I guess [14:30:36] is dcausse deploying that? [14:30:46] I can if you want [14:30:55] fine with me :D [14:31:00] ok :) [14:31:06] ty hashar [14:31:07] +2 :} [14:31:15] ok [14:31:22] hashar: should dcausse deploy his patches? [14:31:35] zeljkof: why not ? :} [14:31:38] was +2 for that? :) [14:31:46] excellent, so my job is done here [14:31:46] (03PS5) 10Hashar: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (https://phabricator.wikimedia.org/T152895) (owner: 10DCausse) [14:31:48] (03PS3) 10Hashar: [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [14:31:50] ^^^ rebased the changes in mwconfig [14:31:53] dcausse: good luck [14:32:07] the CirrusSearch one ( https://gerrit.wikimedia.org/r/#/c/327467/ ) is in the pipe [14:32:24] ok [14:32:37] dcausse: you can take over from now :D [14:32:42] I am sticking around for support [14:32:44] ok, swating [14:32:48] sure thanks [14:33:00] zeljkof: so it is all covered :} \O/ [14:33:11] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:33:18] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (https://phabricator.wikimedia.org/T152895) (owner: 10DCausse) [14:33:35] hashar: nice work, well, team work ;) [14:33:43] yeah that works well [14:33:51] (03Merged) 10jenkins-bot: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (https://phabricator.wikimedia.org/T152895) (owner: 10DCausse) [14:34:07] dcausse: note the extension code is not pushed yet [14:34:14] in case that has an impact on the config change [14:34:21] nope it's independant [14:34:27] \O/ [14:36:46] (03PS1) 10Yuvipanda: labs: Set noexec and nodev for scratch [puppet] - 10https://gerrit.wikimedia.org/r/327508 [14:37:36] (03CR) 10Rush: [C: 031] labs: Set noexec and nodev for scratch [puppet] - 10https://gerrit.wikimedia.org/r/327508 (owner: 10Yuvipanda) [14:38:44] !log Stop replication db2033 (x1) for maintenance - T151552 [14:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:57] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [14:40:03] zeljkof, works, except there's an issue I noticed on officewiki (pretty certain it's not caused by this patch, and that is not even on this branch) [14:40:52] matt_flaschen: great, please report the problem [14:41:08] zeljkof, yeah, I'm going to look into that one myself. [14:41:31] even better :) [14:41:49] zeljkof: I know we already hit the 8 patches limit, but please check this if possible: https://gerrit.wikimedia.org/r/#/c/327507/ [14:42:12] hashar, zeljkof: just to be sure: scap sync-file myfile "log" right? [14:42:26] dcausse: yes [14:42:29] ok [14:42:35] dcausse: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment [14:42:38] docs ftw! [14:43:09] hashar: Amir1 has one more patch, we are at 8 already... [14:43:26] it's in beta [14:43:30] just needs +2 [14:44:17] Amir1: for the change that are only on beta, we just CR+2 them and rebase the prod server [14:44:23] no need to add them to swat :D [14:44:29] (03PS2) 10Hashar: Make fawiki in beta use xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327507 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [14:44:37] (03CR) 10Hashar: [C: 032] Make fawiki in beta use xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327507 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [14:44:40] Amir1: done :} [14:44:59] hashar: thanks! [14:45:05] Do you want me to rebase tin? [14:45:07] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: T152895 - [cirrus] Reduce regex/default timeouts (duration: 00m 40s) [14:45:10] Amir1: I will [14:45:13] (03PS2) 10Yuvipanda: labs: Set noexec and nodev for scratch [puppet] - 10https://gerrit.wikimedia.org/r/327508 [14:45:15] kk [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] T152895: insource (regex) search on en.wp's Special:Search does not work - https://phabricator.wikimedia.org/T152895 [14:45:25] dcausse: the CirrusSearch patch for wmf.6 has landed ( https://gerrit.wikimedia.org/r/#/c/327467/ ) [14:45:27] (03Merged) 10jenkins-bot: Make fawiki in beta use xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327507 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [14:45:52] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2876469 (10jcrespo) I am with elukey in which a good stacked graph can be more valuable in terms of expressiveness than multiple servers graphs. E.g. despite havin... [14:45:58] hashar: ok, will config with mw config, will do the extension at the end [14:46:18] (03PS1) 10Ema: varnish cachestats.py: strip final newline [puppet] - 10https://gerrit.wikimedia.org/r/327511 (https://phabricator.wikimedia.org/T151643) [14:46:20] (03PS4) 10DCausse: [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) [14:47:39] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [14:47:42] (03CR) 10Alexandros Kosiaris: [C: 031] docker: cleanup dockerproject's apt repository [puppet] - 10https://gerrit.wikimedia.org/r/327242 (owner: 10Faidon Liambotis) [14:48:17] (03Merged) 10jenkins-bot: [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [14:50:16] zeljkof, hashar, I need to depl graphoid service (scap3) - a tiny oneliner fix. I could do it in parallel if you are ok with it, or i could wait until you are done. [14:51:05] (03CR) 10Alexandros Kosiaris: [C: 031] docker: cleanup the custom apt repository stanzas [puppet] - 10https://gerrit.wikimedia.org/r/327243 (owner: 10Faidon Liambotis) [14:51:12] !log dcausse@tin Synchronized tests/cirrusTest.php: 1/2 T152092: [cirrus] enable BM25 on all but wikis with spaceless languages (duration: 00m 40s) [14:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:26] T152092: Activate BM25 on all but wikis with spaceless languages - https://phabricator.wikimedia.org/T152092 [14:51:43] yurik: there is another deployment after swat, I think dcausse is deploying his patches now, probably the best to ask him [14:52:04] (03CR) 10Alexandros Kosiaris: [C: 032] aptrepo: pin ElasticSearch version [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:52:08] (03PS2) 10Alexandros Kosiaris: aptrepo: pin ElasticSearch version [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:52:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] aptrepo: pin ElasticSearch version [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:52:21] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: 2/2 T152092: [cirrus] enable BM25 on all but wikis with spaceless languages (duration: 00m 39s) [14:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:41] (03PS1) 10Ema: varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) [14:54:51] (03CR) 10Giuseppe Lavagetto: [C: 031] docker: cleanup dockerproject's apt repository [puppet] - 10https://gerrit.wikimedia.org/r/327242 (owner: 10Faidon Liambotis) [14:55:11] dcausse: all good ? [14:55:21] hashar: yup, doing the extension now [14:55:58] hashar: scap sync-dir php-1.29.0-wmf.6/extensions/CirrusSearch "log", is that right? [14:56:10] yeah [14:56:14] ok doing [14:56:29] eventually pass --dont-break-the-cluster [14:57:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "There is a typo in the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/327243 (owner: 10Faidon Liambotis) [14:58:25] :) [14:59:01] !log dcausse@tin Synchronized php-1.29.0-wmf.6/extensions/CirrusSearch: T153051: Do not return the current wikis when detecting query languages (duration: 00m 52s) [14:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:15] T153051: Search page duplicates result - https://phabricator.wikimedia.org/T153051 [14:59:28] !log eu swat done [14:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:55] yurik: I'm done [15:01:05] ok, deploing [15:02:40] !log yurik@tin Starting deploy [graphoid/deploy@5e3f8ff]: (no message) [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:17] !log yurik@tin Finished deploy [graphoid/deploy@5e3f8ff]: (no message) (duration: 00m 36s) [15:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] (03PS3) 10Yuvipanda: labs: Set noexec and nodev for scratch [puppet] - 10https://gerrit.wikimedia.org/r/327508 [15:04:08] dcausse: yurik kudos :} [15:04:13] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2704482 (10Krenair) Perhaps you would be able to help with T153319 [15:04:24] done, thx [15:04:25] matt_flaschen: I think you can do your Flow dry run script that fix some talk pages [15:04:30] !log European SWAT complete [15:04:32] (03CR) 10Giuseppe Lavagetto: [] "just out of curiosity, why labs ferm rules are important? I think we don't include base::firewall in labs anyways." [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [15:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:00] _joe_ I stole the exporter template from Filippo, so I will not pretend to know the answer :D [15:07:31] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:35] hashar, thanks. I will investigate officewiki (see above) quickly first (it's the same script), hopefully fix that, then probably still run the dry run everywhere. [15:07:44] (03CR) 10Yuvipanda: [V: 032 C: 032] labs: Set noexec and nodev for scratch [puppet] - 10https://gerrit.wikimedia.org/r/327508 (owner: 10Yuvipanda) [15:08:24] (03CR) 10Marostegui: [C: 031] "swat is finished" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [15:08:29] (03PS2) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) [15:08:55] <_joe_> elukey: well check by yourself which classes are included by default in labs [15:10:19] will do it [15:10:32] but now I don't know the answer, this is what I wanted to say :) [15:11:05] the last code version does not have any labs reference [15:11:08] (03PS1) 10Ema: varnishxcps: subscribe to cachestats.py [puppet] - 10https://gerrit.wikimedia.org/r/327516 (https://phabricator.wikimedia.org/T151643) [15:11:13] it is only a hiera call [15:11:25] (as it was suggested in the hhvm-exporter code review) [15:11:44] <_joe_> but well, just see my comment :) [15:13:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [15:14:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327506 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [15:15:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T150644 (duration: 00m 39s) [15:15:44] (03PS2) 10Ema: varnishxcps: subscribe to cachestats.py [puppet] - 10https://gerrit.wikimedia.org/r/327516 (https://phabricator.wikimedia.org/T151643) [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:47] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [15:15:51] (03CR) 10Ema: [V: 032 C: 032] varnishxcps: subscribe to cachestats.py [puppet] - 10https://gerrit.wikimedia.org/r/327516 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:16:26] !log Deploy alter table wikidatawiki.revision on db1092 - T150644 [15:16:31] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:22] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [15:17:27] (03PS2) 10Giuseppe Lavagetto: mediawiki: add https endpoints for all web clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327493 (https://phabricator.wikimedia.org/T153042) [15:17:31] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 697893 msg: ocg_render_job_queue 3131 msg (=3000 critical) [15:17:41] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 697998 msg: ocg_render_job_queue 3179 msg (=3000 critical) [15:17:42] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 698009 msg: ocg_render_job_queue 3184 msg (=3000 critical) [15:17:52] matt_flaschen: sure! if you need assistance let me (us) know [15:19:29] <_joe_> meh [15:20:31] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:21:21] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [15:22:16] (03PS2) 10Faidon Liambotis: aptrepo: add Docker's apt repo to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/327241 [15:22:26] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: add Docker's apt repo to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/327241 (owner: 10Faidon Liambotis) [15:23:52] 06Operations, 07Puppet, 10Deployment-Systems, 06Release-Engineering-Team, 05Mediawiki SWAT Deployments: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316#2876601 (10hashar) From {T146286} , we had the puppet class `::mediawiki::packages::php5` added to `role::deployment::... [15:23:56] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/4895/lvs2003.codfw.wmnet/ looks good." [puppet] - 10https://gerrit.wikimedia.org/r/327493 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [15:24:03] (03PS3) 10Giuseppe Lavagetto: mediawiki: add https endpoints for all web clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327493 (https://phabricator.wikimedia.org/T153042) [15:25:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: add https endpoints for all web clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327493 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [15:26:08] (03PS2) 10Faidon Liambotis: docker: cleanup dockerproject's apt repository [puppet] - 10https://gerrit.wikimedia.org/r/327242 [15:26:10] 06Operations, 06Labs, 10Labs-Infrastructure: labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2875187 (10AlexMonk-WMF) So this file has a lot of `/usr/lib/python2.7/dist-packages/nova/context.py:181: DeprecationWarning: Using function/method 'oslo_utils.ti... [15:26:10] (03PS2) 10Faidon Liambotis: docker: cleanup the custom apt repository stanzas [puppet] - 10https://gerrit.wikimedia.org/r/327243 [15:26:15] grrrit-wm: force-restart [15:26:19] Re-connecting to Gerrit and IRC. [15:27:00] re-connected to Gerrit and IRC. [15:27:06] (03CR) 10Faidon Liambotis: [V: 032 C: 032] docker: cleanup dockerproject's apt repository [puppet] - 10https://gerrit.wikimedia.org/r/327242 (owner: 10Faidon Liambotis) [15:28:43] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM, 15User-Joe: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2876627 (10hashar) T146286 was to get the PHP5 packages installed on the maintenance servers. This task... [15:33:35] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,cluster=imagescaler,dc=codfw [15:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:21] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,cluster=api_appserver,dc=codfw [15:36:26] (03CR) 10Faidon Liambotis: [C: 031] icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/321642 (https://phabricator.wikimedia.org/T149913) (owner: 10Volans) [15:36:31] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:36:34] (03PS1) 10ArielGlenn: generate lists of completed dumps for rsync inclusion on dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/327519 (https://phabricator.wikimedia.org/T152954) [15:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:31] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:37:36] (03Abandoned) 10Faidon Liambotis: docker: apt repo before installing package [puppet] - 10https://gerrit.wikimedia.org/r/321485 (owner: 10Dduvall) [15:37:39] (03PS1) 10Eevans: enable instance restbase1017-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327520 (https://phabricator.wikimedia.org/T151086) [15:38:20] (03CR) 10Eevans: [C: 031] "Ready." [puppet] - 10https://gerrit.wikimedia.org/r/327520 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:39:41] urandom: want me to review --^ ? [15:39:57] elukey: sure, if you want [15:40:50] urandom: IP look good, I can merge if you are ready to bootstrap [15:40:56] elukey: yup! [15:41:30] elukey: oh, can you run puppet on that node after you do? [15:42:08] elukey: puppet will have been erroring up until now, so my account doens't exist [15:42:12] (03PS1) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 [15:42:21] elukey: it'll start succeeding after this changeset [15:42:24] (03PS1) 10Jcrespo: mariadb: Apply new TLS config to client-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/327523 [15:42:43] elukey: a chicken-and-egg problem relating to the multi-instance stuff [15:42:44] (03CR) 10jenkins-bot: [V: 04-1] labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda) [15:42:56] <_joe_> uhm grrr [15:43:11] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:12] <_joe_> I hate our nagios abstractions for puppet [15:43:21] urandom: mmmm I can't ssh, checking from puppetmaster1001 [15:43:26] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,cluster=appserver,dc=codfw [15:43:27] (03PS2) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 [15:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:16] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.21 and port 443: Connection refused [15:44:18] (03PS2) 10Jcrespo: mariadb: Apply new TLS config to client-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/327523 [15:44:29] ACKNOWLEDGEMENT - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.22 and port 443: Connection refused Giuseppe Lavagetto new service being deployed [15:44:34] ACKNOWLEDGEMENT - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.1 and port 443: Connection refused Giuseppe Lavagetto new service being deployed [15:44:35] <_joe_> did it get to page? meh [15:44:36] <_joe_> sorry [15:44:38] ACKNOWLEDGEMENT - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.21 and port 443: Connection refused Giuseppe Lavagetto new service being deployed [15:44:50] :-) [15:44:55] (03CR) 10Mobrovac: [] enable instance restbase1017-a.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327520 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:45:15] <_joe_> now I need to check the logic of our code [15:45:16] and that is why I disable in somecases rather than ack/downtime [15:45:30] <_joe_> jynus: this was completely unexpected [15:45:52] urandom: 1017 and 1018 are still in super early stage, namely puppet keys not accepted on puppet master [15:45:59] I am not complaining, just justifying that in some cases disable is the way to go [15:46:02] so I guess brand new hw [15:46:41] elukey: auh ok, i thought they were more or less ready; no worries, i think godog is planning to take them up this morning [15:46:50] elukey: thanks though! [15:47:11] elukey: and yeah, brand new hw [15:47:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327526 [15:47:51] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327504 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:48:08] urandom: running puppet on 1017 now [15:48:12] cc godog --^ [15:48:45] <_joe_> akosiaris: ok I'm not sure what the heck is going on [15:49:04] urandom: ah snap it says invalid secret cassandra/services/restbase1017/restbase1017.kst, so not easy as I though [15:49:29] elukey: yeah, that's because it needs this changeset merged [15:49:43] it's trying to use the 'default' instance, for which no secrets have been generated [15:50:04] that was the error i was expecting, actually [15:50:28] hashar, I'm officially cancelling my window. I did figure out the officewiki thing (at least the current state, not so much the 'why'). [15:50:39] so super godog already added the crt etcc on the private repo [15:50:49] (03CR) 10ArielGlenn: [C: 032] generate lists of completed dumps for rsync inclusion on dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/327519 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [15:50:49] let's merge the code review urandom [15:51:01] (03CR) 10Elukey: [C: 032] enable instance restbase1017-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327520 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:51:06] (03PS2) 10Elukey: enable instance restbase1017-a.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327520 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:51:44] matt_flaschen: at least you figured out the issue :} [15:51:56] matt_flaschen: or to be more precise that there is an issue of some sort [15:51:59] grrrit-wm: force-restart [15:52:55] marktraceur ^^ [15:53:06] + i doint think it reconnects on ssh. [15:53:13] also urandom, just noticed - not codfw right? (wrong commit msg?) [15:53:21] You doint? [15:53:33] oh? [15:53:37] Yes but it should as it did that before i deployed your change. [15:53:42] ha [15:53:46] elukey: yeah, my bad. [15:53:47] paladox: Well, prove it, show me a test [15:53:50] Ok [15:53:54] I have a test bot [15:54:04] paladox: I believe when I tested it, it worked. And I'm sure Zppix concurs. [15:54:04] that dosent have the change. [15:54:05] urandom: good good, my heart stopped for a second then I realized [15:54:06] :D [15:54:10] elukey: has puppet been re-run since the merge? [15:54:18] * urandom still can't login [15:54:20] still need to merg [15:54:25] auh, ok [15:54:26] I was double checking :) [15:54:41] mistook the log msg [15:55:04] (03CR) 10Marostegui: [] "test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327526 (owner: 10Marostegui) [15:55:12] apergos: ready to merge? [15:55:29] as soon as I get jenkins v+1 [15:55:44] well it is already +2+2 no? [15:55:50] what? [15:55:57] modules/snapshot/templates/cron/list-last-good-dumps.sh.erb [15:55:59] paladox: Looks like it's working to me [15:56:05] I did puppet-merge that one [15:56:19] is it saying it did not? [15:56:25] paladox: But I also see grrrit-wm1 which is concerning [15:56:26] apergos: nope I can see it now [15:56:32] Oh it works now but when you did it, it kept changing nicks. and yeh. [15:56:32] that's troubling [15:56:36] yes please merge it then [15:56:52] paladox: I empower you to investigate the duplicate IRC connection and fix it :) [15:56:54] maybe I made a typo in the response and didn't pay attention whatsoever (weirder things have happened) [15:56:56] Ok [15:56:58] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Apply new TLS config to client-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [15:57:20] marktraceur i noticed that before too, i wrote to upstream and haven't had a reply. [15:57:33] paladox: I doubt it's an upstream issue. Oh, look, it died. [15:57:41] So I guess the IRC connection starts but doesn't stick around [15:57:57] lol [15:58:01] marktraceur ^^ [15:58:07] Ugh [15:58:11] but... hm... I ran puppet and it processed changes, after that puppet-merge [15:58:14] so how did it... [15:58:23] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327526 (owner: 10Marostegui) [15:58:30] this seems like a bug in disconnect or something. [15:59:00] apergos: so puppet merge didn't give to me the "OK" right after the merge on 1001, but for all the others yes [15:59:02] (03CR) 10Volans: [C: 031] "Looks already ok, I've added some minor comment inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:59:04] but no errors [15:59:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327526 (owner: 10Marostegui) [15:59:19] well it probably had already merged it, in spite of what it showed you [15:59:30] otherwise I could not have had the changed in puppet afterwards on the host [15:59:38] and tested it with a remote client and had the new results... [16:00:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 - T150644 (duration: 00m 40s) [16:00:37] grrrit-wm2: nick [16:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:40] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [16:00:42] (03CR) 10ArielGlenn: [C: 032] some pylint of script that produces list of lst good dumps [puppet] - 10https://gerrit.wikimedia.org/r/326413 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [16:00:55] paladox: I'll take a look in a few minutes [16:01:06] Ok thanks, /me is also looking too [16:01:18] apergos: https://phabricator.wikimedia.org/P4625 - this is 1001 and 1002 [16:01:34] (03PS1) 10Giuseppe Lavagetto: lvs: attempt to fix the icinga check for appservers [puppet] - 10https://gerrit.wikimedia.org/r/327528 [16:01:52] urandom: puppet running [16:01:56] no apparently it is "Access Denied: Restricted Paste", elukey... [16:02:12] elukey: kk [16:02:18] apergos: I put admins, mmm [16:02:27] not an admin on phab I don't think [16:02:49] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] lvs: attempt to fix the icinga check for appservers [puppet] - 10https://gerrit.wikimedia.org/r/327528 (owner: 10Giuseppe Lavagetto) [16:03:07] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327511 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:03:16] grrrit-wm: force-restart [16:03:16] apergos: now it should be good [16:04:20] grrrit-wm: nick [16:04:30] nope, elukey, still denied [16:06:45] PROBLEM - cassandra-a CQL 10.64.32.130:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.130 and port 9042: Connection refused [16:07:05] PROBLEM - cassandra-a SSL 10.64.32.130:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:07:09] how did it work out wm10. [16:07:15] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:07:55] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:01] this is me sorry --^ [16:08:02] sigh [16:08:12] urandom: it complains about the same key again [16:08:15] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.129, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:08:25] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.137 second response time [16:08:45] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:01] (03PS2) 10Ema: varnish cachestats.py: make key_prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/327504 (https://phabricator.wikimedia.org/T151643) [16:09:07] urandom: no now it works, bah [16:09:09] silencing [16:09:55] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:15] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:11:16] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active [16:12:45] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [16:12:49] urandom: solved :) [16:12:55] are you on it? [16:12:59] elukey: yup! [16:13:04] elukey: thank you! [16:13:05] RECOVERY - cassandra-a SSL 10.64.32.130:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-a valid until 2017-12-13 00:15:53 +0000 (expires in 362 days) [16:13:31] urandom: yw! [16:13:55] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:14:58] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2876837 (10hashar) The mariadb package is broken: ``` # apt-get insta... [16:20:10] (03PS1) 10Giuseppe Lavagetto: lvs: further fix appservers checks [puppet] - 10https://gerrit.wikimedia.org/r/327531 [16:20:17] (03CR) 10ArielGlenn: [C: 032] still more pylint of script that produces list of last good dumps [puppet] - 10https://gerrit.wikimedia.org/r/326416 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [16:21:11] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] lvs: further fix appservers checks [puppet] - 10https://gerrit.wikimedia.org/r/327531 (owner: 10Giuseppe Lavagetto) [16:21:24] (03PS2) 10Giuseppe Lavagetto: lvs: further fix appservers checks [puppet] - 10https://gerrit.wikimedia.org/r/327531 [16:21:33] (03PS3) 10ArielGlenn: list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) [16:21:38] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] lvs: further fix appservers checks [puppet] - 10https://gerrit.wikimedia.org/r/327531 (owner: 10Giuseppe Lavagetto) [16:22:31] (03PS1) 10Jcrespo: Add colors for all interactive clients and a smarter prompt [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/327533 [16:22:35] !log shutting down elasticsearch on elastic2006 for IO benchmarking - T153083 [16:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:49] T153083: Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083 [16:22:52] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2006.codfw.wmnet [16:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:19] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 15474 bytes in 0.203 second response time [16:24:19] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [16:25:20] (03PS2) 10Jcrespo: mariadb: New mariadb::packages_client: Add colors a smarter prompt [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/327533 [16:26:49] (03PS4) 10ArielGlenn: list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) [16:27:06] (03PS3) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [16:28:03] (03PS3) 10Jcrespo: mariadb: New mariadb::packages_client: Colors & a smarter prompt [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/327533 [16:28:24] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [16:28:27] (03PS4) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [16:29:13] (03CR) 10ArielGlenn: [C: 032] list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [16:29:34] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [16:30:18] (03PS7) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [16:30:27] (03PS4) 10Elukey: [WIP] Yandex ClickHouse puppetization [puppet] - 10https://gerrit.wikimedia.org/r/325797 (https://phabricator.wikimedia.org/T150343) [16:31:36] (03PS1) 10Jcrespo: Move role::mariadb::client to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/327536 (https://phabricator.wikimedia.org/T150850) [16:31:53] (03CR) 10ArielGlenn: [C: 032] list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [16:34:15] (03PS5) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [16:36:01] !log cleaning up unused kartotherian marker metrics from graphite - T150254 [16:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:14] T150254: cleanup marker metrics published by Kartotherian - https://phabricator.wikimedia.org/T150254 [16:36:33] (03PS6) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [16:37:44] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:43:58] (03PS5) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [16:45:38] (03PS1) 10Yuvipanda: Re-add docker.gpg [puppet] - 10https://gerrit.wikimedia.org/r/327537 [16:46:44] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 710944 msg: ocg_render_job_queue 499 msg [16:46:51] (03CR) 10Marostegui: [C: 031] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [16:46:54] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 710952 msg: ocg_render_job_queue 469 msg [16:46:54] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 710952 msg: ocg_render_job_queue 469 msg [16:47:04] (03CR) 10Filippo Giunchedi: [] Add hhvm_exporter role and class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [16:47:45] 06Operations, 06Labs, 10Labs-Infrastructure: labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2877061 (10Dzahn) Yes, checked on labnet1001 and it also occurs there. A _lot_ of the exact deprecation warning as above. Here the log is even 13GB, it just didn'... [16:47:47] (03CR) 10Yuvipanda: [V: 032 C: 032] "Should hopefully be able to revert later in the day." [puppet] - 10https://gerrit.wikimedia.org/r/327537 (owner: 10Yuvipanda) [16:47:53] (03CR) 10Jcrespo: [C: 032] mariadb: New mariadb::packages_client: Colors & a smarter prompt [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/327533 (owner: 10Jcrespo) [16:48:05] 06Operations, 06Labs, 10Labs-Infrastructure: labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2877063 (10Dzahn) [16:48:08] (03PS7) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [16:49:24] 06Operations, 06Labs, 10Labs-Infrastructure: labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2875187 (10Dzahn) Looks like this is where the oslo_utils get imported: modules/openstack/files/liberty/nova/virt-libvirt-driver:from oslo_utils import t... [16:50:24] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.132 second response time [16:52:20] 06Operations, 06Labs, 10Labs-Infrastructure: labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279#2877085 (10AlexMonk-WMF) yeah but virt-libvirt-driver doesn't use strtime, does it? this file (nova/contex.py) would come from the nova package [16:58:12] <_joe_> incoming... [16:58:12] (03PS1) 10Giuseppe Lavagetto: imagescalers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327538 [16:58:14] (03PS1) 10Giuseppe Lavagetto: appservers: add TLS termination to canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327539 [16:58:16] (03PS1) 10Giuseppe Lavagetto: api: add TLS termination to the canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327540 [16:58:18] (03PS1) 10Giuseppe Lavagetto: api: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327541 [16:58:20] (03PS1) 10Giuseppe Lavagetto: appservers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327542 [16:58:23] (03PS1) 10Giuseppe Lavagetto: conftool-data: add nginx as a service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327543 [16:58:25] (03PS1) 10Giuseppe Lavagetto: lvs: enable https endpoints for mw in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327544 [16:58:59] (03PS8) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1700). Please do the needful. [17:03:05] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [17:07:18] (03PS9) 10Jcrespo: mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 [17:09:09] (03CR) 10Marostegui: [C: 031] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [17:09:46] (03CR) 10Jcrespo: [C: 032] mariadb: Tune mariadb clients only install [puppet] - 10https://gerrit.wikimedia.org/r/327523 (owner: 10Jcrespo) [17:09:54] (03PS2) 10Giuseppe Lavagetto: imagescalers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327538 [17:11:12] yuvipanda, can I merge? [17:11:30] jynus: whoops, yes you can [17:14:44] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:54] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:03] <_joe_> jynus: ^^ you I guess? [17:15:15] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:34] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:44] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:44] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:47] arg [17:15:54] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:23] (03PS1) 10Jcrespo: Revert "mariadb: Tune mariadb clients only install" [puppet] - 10https://gerrit.wikimedia.org/r/327546 [17:16:24] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:44] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:44] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:53] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: Tune mariadb clients only install" [puppet] - 10https://gerrit.wikimedia.org/r/327546 (owner: 10Jcrespo) [17:17:24] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:17:44] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:17:44] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:17:44] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:17:44] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:04] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:24] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:24] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:44] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:18:54] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:14] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:15] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:24] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Tune mariadb clients only install"" [puppet] - 10https://gerrit.wikimedia.org/r/327548 [17:19:24] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:44] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:55] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:20:14] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:20:34] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:48] 06Operations, 05Prometheus-metrics-monitoring: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408#2877276 (10fgiunchedi) 05duplicate>03Open a:03fgiunchedi [17:34:49] (03CR) 10Giuseppe Lavagetto: [C: 032] imagescalers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327538 (owner: 10Giuseppe Lavagetto) [17:34:56] (03PS19) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:34:57] (03PS19) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:35:00] (03PS17) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:35:01] (03PS1) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 [17:35:04] (03PS3) 10Giuseppe Lavagetto: imagescalers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327538 [17:35:06] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] imagescalers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327538 (owner: 10Giuseppe Lavagetto) [17:36:54] (03PS2) 10Jcrespo: Revert "Revert "mariadb: Tune mariadb clients only install"" [puppet] - 10https://gerrit.wikimedia.org/r/327548 [17:37:45] (03PS3) 10Jcrespo: Revert "Revert "mariadb: Tune mariadb clients only install"" [puppet] - 10https://gerrit.wikimedia.org/r/327548 [17:38:51] (03CR) 10jenkins-bot: [V: 04-1] cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:40:12] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2877322 (10Nuria) [17:40:52] (03CR) 10Jcrespo: [C: 032] Revert "Revert "mariadb: Tune mariadb clients only install"" [puppet] - 10https://gerrit.wikimedia.org/r/327548 (owner: 10Jcrespo) [17:41:39] (03PS18) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:41:41] (03PS2) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 [17:42:44] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:42:46] (03CR) 10jenkins-bot: [V: 04-1] cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:42:49] (03CR) 10Dzahn: [C: 031] "re "repo moving fast" the patch is over a year old (nothing should have to wait that long, yes)" [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [17:42:54] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:43:34] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:43:44] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:44:14] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:44:44] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:44:44] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:44:54] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:45:18] (03PS19) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:45:20] (03PS3) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 [17:45:24] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:45:34] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:44] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:45:59] (03CR) 10BryanDavis: [] "Based on T127792#2695552 the tables have been created. Is it time to actually flip this switch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [17:46:04] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:46:21] (03CR) 10jenkins-bot: [V: 04-1] cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:46:24] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:46:24] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:46:34] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:46:44] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:46:44] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:46:45] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:47:14] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:47:24] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:47:44] (03PS20) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:47:44] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:47:46] (03PS4) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 [17:47:49] 06Operations, 10Traffic: String query string in varnish upload - https://phabricator.wikimedia.org/T153336#2877339 (10fgiunchedi) [17:47:54] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:47:54] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:48:03] 06Operations, 10Traffic: Strip query string in varnish upload - https://phabricator.wikimedia.org/T153336#2877354 (10fgiunchedi) [17:48:05] you know you're on the right track with a simple verifiable improvement when you reach PS20 :P [17:48:14] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:48:14] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:48:34] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:48:36] "It is just a matter of " [17:48:45] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:36] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2877357 (10mark) Approved. [17:49:44] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:50:20] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2877358 (10mark) Approved. [17:50:27] * mafk just found https://meta.wikimedia.org/wiki/Requests_for_comment/Move_the_WMF_and_Servers_to_Iceland [17:51:08] Do we care about earthquakes for the servers? [17:51:23] yes [17:51:37] Are they in earthquake zones now? [17:51:41] when we picked ulsfo they had a map detailing earthquake risk [17:51:41] Ignoring ULSFO [17:51:44] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:07] mafk: RESOLVED DUPE [17:52:13] xD [17:52:37] It seems a bit farfetched [17:52:38] well, on the Svalvard Island there's a big old seed bunker [17:52:38] "If the WMF and the servers of Wikipedia are moved to Iceland now, this would make the adoption of the constution more probable" [17:52:47] yep ^ [17:52:50] We'll only adopt this IF wikipedia moves to us [17:52:51] Wat. [17:52:56] that makes no sense to me, respectfully [17:52:59] people will complain about it being slow real quick [17:53:01] Someone should ask avar [17:53:07] Are we moving caching too? [17:53:28] well they dont seem to realize the difference between main dc and caching center yet [17:53:31] comparing it to SF [17:53:34] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:53:50] I don't think we should move to favor adoption of a Constitution [17:53:50] mutante: Remember, it'd be cheaper to run Wikipedia on AWS [17:53:54] As that person posted [17:54:00] sounds more like "move esams to Iceland" ? [17:54:27] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2877371 (10faidon) >>! In T144431#2848675, @GWicke wrote: > You are right that we finally need more clarity on our longer-term architectural direction and prioriti... [17:54:37] I thought esams is located somewhere where earthquakes doint happen much? [17:54:43] Reedy: they did the math ?:o [17:54:49] paladox: no, it'll just get flooded [17:54:50] (03PS1) 10BBlack: cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) [17:54:54] LOL [17:55:02] mutante: No, it was someone looking at labs... I'll have to try and find the post [17:55:04] that's correct :) [17:55:05] in iceland wont they freeze? [17:55:06] esams is in The Netherlands iirc [17:55:15] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2877375 (10RobH) a:05mark>03RobH [17:55:16] paladox: cold = save money on air con ? [17:55:20] Oh, thats just below us. [17:55:22] lol [17:55:26] It's a few years ago [17:55:29] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2877378 (10RobH) a:05mark>03RobH [17:55:32] Servers make a lot of heat too [17:55:34] (03PS2) 10Giuseppe Lavagetto: appservers: add TLS termination to canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327539 [17:55:46] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] appservers: add TLS termination to canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327539 (owner: 10Giuseppe Lavagetto) [17:55:58] You could reuse the heat to make money by generating power from the heat. [17:55:59] mutante: evo switch ams (AMS being IATA for EHAM, ie Schipol) [17:56:01] (03PS2) 10Giuseppe Lavagetto: api: add TLS termination to the canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327540 [17:56:06] mafk even [17:56:08] then move them to the Prince Edward Island in Canada :P [17:56:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] api: add TLS termination to the canaries in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327540 (owner: 10Giuseppe Lavagetto) [17:56:15] no more air-con issues [17:57:03] Move them to ireland, apparently you pay little tax there, lol [17:57:15] too bad we dont have profits [17:57:20] as a non-profit [17:57:41] paladox: Lieschstein or Switzerland better [17:57:54] I though Switzerland is expensive? [17:58:00] though = thought [17:58:01] it's safe [17:58:15] which part of Switzerland, the moutains? [17:58:28] 06Operations, 07Puppet, 10Deployment-Systems, 06Release-Engineering-Team, 05Mediawiki SWAT Deployments: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316#2877386 (10Mattflaschen-WMF) >>! In T153316#2876601, @hashar wrote: > If ones want to test a change via mwscript, I th... [17:58:42] all that said, the "safe harbor" thing Iceland wants to do is cool [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1800). [18:00:33] paladox: https://www.deltalis.com/ [18:00:53] oh [18:01:09] what about mt everas? [18:02:12] paladox: probably doesn't have the good connectity to backbone [18:02:20] LOL. [18:03:00] (03Abandoned) 10Jcrespo: Move role::mariadb::client to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/327536 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [18:03:02] (03PS1) 10Filippo Giunchedi: hieradata: set realserver_ips for role prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/327554 (https://phabricator.wikimedia.org/T148408) [18:03:04] (03PS1) 10Filippo Giunchedi: site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/327555 (https://phabricator.wikimedia.org/T148408) [18:04:04] (03CR) 10jenkins-bot: [V: 04-1] site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/327555 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [18:04:24] (03CR) 10Filippo Giunchedi: [C: 031] cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) (owner: 10BBlack) [18:05:26] (03PS1) 10Jcrespo: mariadb: Move role::mariadb::client to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/327556 (https://phabricator.wikimedia.org/T150850) [18:05:37] (03PS2) 10Jcrespo: mariadb: Move role::mariadb::client to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/327556 (https://phabricator.wikimedia.org/T150850) [18:07:00] (03PS6) 10Dzahn: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [18:09:30] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2877405 (10jcrespo) I have enabled TLS on neodymium and sarin, but because the mysql clients there are not using OpenSSL, clients will fail with: ``` ERROR 2026 (HY000): SSL connection err... [18:11:05] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2877415 (10kaldari) 05Open>03Resolved a:03kaldari @Marostegui: The script is finished. Feel free to reinstate the lag checks. [18:11:11] (03PS7) 10Dzahn: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [18:13:50] deltalis <3 [18:13:59] but must be very expensive [18:14:18] !log arlolra@tin Starting deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240 [18:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:56] (03PS2) 10Giuseppe Lavagetto: api: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327541 [18:17:26] (03CR) 10Jcrespo: [C: 032] mariadb: Move role::mariadb::client to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/327556 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [18:20:43] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:20:48] PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CRITICAL slave_io_state could not connect [18:21:11] that looks bad [18:21:23] =[ [18:21:58] RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes [18:22:14] !log arlolra@tin Finished deploy [parsoid/deploy@0df8628]: Updating Parsoid to 6719e240 (duration: 07m 56s) [18:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:05] !log Updated Parsoid to 6719e240 (T96555) [18:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:18] T96555: Bug tokenizing commented - https://phabricator.wikimedia.org/T96555 [18:32:42] (03CR) 10Giuseppe Lavagetto: [C: 032] api: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327541 (owner: 10Giuseppe Lavagetto) [18:32:49] (03PS3) 10Giuseppe Lavagetto: api: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327541 [18:33:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] api: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327541 (owner: 10Giuseppe Lavagetto) [18:34:18] (03PS2) 10Filippo Giunchedi: site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/327555 (https://phabricator.wikimedia.org/T148408) [18:38:01] arlolra, can parsoid execute ApiQueryContributors::execute mediawiki api calls? [18:38:36] RECOVERY - Juniper alarms on asw-a-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [18:40:18] (03PS1) 10Eevans: enable instance restbase1017-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) [18:40:56] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2877633 (10Papaul) The IDRAC extension was loose from the main board. i open open the server to realize that there were no screws attaching the IDRAC extension to the main - board . since... [18:41:19] (03CR) 10Eevans: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [18:42:29] (03PS2) 10Giuseppe Lavagetto: appservers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327542 [18:42:52] (03CR) 10Filippo Giunchedi: [C: 032] "> just out of curiosity, why labs ferm rules are important? I think" [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [18:42:58] (03PS3) 10Filippo Giunchedi: Add the prometheus-apache-exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [18:44:15] jynus: it does not include prop=contributors, if that's what you're asking [18:44:27] why [18:45:32] well, we just had a server outage seconds after one of your deploys [18:45:43] I want to know if it is related [18:45:51] <_joe_> ouch [18:46:02] <_joe_> I brainfarted and submitted the change [18:46:23] <_joe_> elukey: I just submitted https://gerrit.wikimedia.org/r/327240 by error [18:46:28] <_joe_> I wanted to comment :/ [18:46:40] <_joe_> well, it's not going to do any harm for now, right? [18:46:43] _joe_, revert, deploy, revert, continue the conversation [18:46:49] easy [18:47:03] <_joe_> jynus: no need to revert, in fact, that's what I am saying [18:47:30] <_joe_> but yeah elukey is not here, heh [18:47:34] _joe_: no that's fine it won't do anything [18:47:40] I was going to merge it anyways [18:47:52] <_joe_> ok cool [18:47:56] jynus: ok [18:48:00] thanks [18:48:12] (03CR) 10Giuseppe Lavagetto: [C: 032] appservers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327542 (owner: 10Giuseppe Lavagetto) [18:48:17] arlolra, you can see the effect here: https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1065&from=1481824079215&to=1481827679215 [18:48:19] (03PS3) 10Giuseppe Lavagetto: appservers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327542 [18:48:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] appservers: add TLS termination in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327542 (owner: 10Giuseppe Lavagetto) [18:48:32] arlolra, do you think it could be related to your deploy? [18:49:22] this is what i deployed [18:49:22] https://www.mediawiki.org/wiki/Parsoid/Deployments#Thursday.2C_December_15.2C_2016_around_10:15_am_PT:_6719e240_to_be_deployed [18:49:29] nothing changes api calls [18:50:08] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2877690 (10Marostegui) Thanks for the heads up - I have now removed the downtimes. [18:50:20] hmm, well,
ext tags now use the preprocessor [18:50:46] so those would be additional calls [18:51:21] ok, that would explain it [18:51:25] (03PS6) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [18:51:30] not a problem now [18:51:50] but it saturated one of the 2 mysql enwiki api servers [18:52:12] you have the calls here: https://tendril.wikimedia.org/report/slow_queries?host=db1065&hours=1 [18:52:55] the main issue seemed the deployment itself, maybe something could be done about that process? [18:57:16] the change would be to using action parse though [18:57:29] _joe_: FYI I'm merging https://gerrit.wikimedia.org/r/#/c/323079 which also isn't going to do anything on the appservers [18:57:38] <_joe_> cool [18:57:43] and, in fact the parsoid batching api [18:57:51] jynus: i still don't see how it's related [18:58:04] (03PS1) 10Cmjohnson: Adding dns entries for wdqs1003 [dns] - 10https://gerrit.wikimedia.org/r/327562 [18:58:44] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [18:59:46] it is ok, I will file a new bug on phabricator [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1900). [19:00:04] gehel: Dear anthropoid, the time has come. Please deploy Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T1900). [19:00:07] (03CR) 10Yuvipanda: [] "This is great! I left a couple minor nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:01:30] * tto is present and correct for SWAT [19:01:53] I can SWAT. huh, jouncebot didn't ping folks with patches: James_F ping [19:02:01] * James_F waves. [19:02:11] tto: hello [19:02:34] jynus: ok, thanks [19:02:59] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for wdqs1003 [dns] - 10https://gerrit.wikimedia.org/r/327562 (owner: 10Cmjohnson) [19:03:29] (03PS7) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [19:03:53] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:18] (03PS3) 10Thcipriani: Enable per-page language choice on wikis with Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [19:04:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [19:05:33] (03Merged) 10jenkins-bot: Enable per-page language choice on wikis with Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327416 (https://phabricator.wikimedia.org/T153209) (owner: 10TTO) [19:06:31] 06Operations, 10ops-eqiad, 10netops: asw-a2-eqiad PEM 0 not powered - https://phabricator.wikimedia.org/T153273#2877751 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Replaced both fuses on the B side on that phase. Both sides are powered. The alarm cleared cmjohnson@asw-a-eqiad> show chassis alarms... [19:06:34] tto: your change is live on mwdebug1002, check please [19:07:19] will do [19:08:56] thcipriani, it seems I don't have sysop rights at any of the affected wikis [19:09:24] I can see that Special:PageLanguage now exists (it shows a permission error instead of a "no such special page" error), but I can't actually confirm whether it works [19:09:39] (03PS1) 10Filippo Giunchedi: prometheus: add job definition for apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327565 (https://phabricator.wikimedia.org/T147316) [19:10:36] tto: hrm, I don't have those rights either, but it sounds like the affected change has happened, so I can push live if you're fine with it. [19:11:14] thcipriani, yes, I'm confident that it works [19:12:15] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add job definition for apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327565 (https://phabricator.wikimedia.org/T147316) (owner: 10Filippo Giunchedi) [19:12:20] ok, going live [19:13:29] 06Operations, 10ops-eqiad: Rack and setup wdqs1003 - https://phabricator.wikimedia.org/T153349#2877783 (10Cmjohnson) [19:13:33] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2877798 (10RobH) [19:14:12] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:327416|Enable per-page language choice on wikis with Translate]] T153209 (duration: 00m 42s) [19:14:19] ^ tto live everywhere [19:14:25] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2799514 (10RobH) 05Open>03Resolved With the creation/processing of the setup task T153350, this... [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:27] T153209: Actually enable Special:PageLanguage on all Translate wikis - https://phabricator.wikimedia.org/T153209 [19:14:30] (03PS2) 10Filippo Giunchedi: prometheus: add job definition for apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327565 (https://phabricator.wikimedia.org/T147316) [19:14:49] thcipriani: Excellent! Thanks [19:15:54] James_F: https://gerrit.wikimedia.org/r/#/c/327429/ live on mwdebug1002, check please [19:16:05] Checking. [19:16:50] (03PS1) 10Jcrespo: mariadb-backups: Fix x1 backups [puppet] - 10https://gerrit.wikimedia.org/r/327568 (https://phabricator.wikimedia.org/T151999) [19:17:47] thcipriani: Yup, LGTM. [19:17:53] James_F: ok, going live [19:18:35] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2877826 (10RobH) [19:20:01] !log thcipriani@tin Synchronized php-1.29.0-wmf.6/extensions/VisualEditor/modules/ve-mw: SWAT: [[gerrit:327429|Resolve URLs in show preview against correct base]] T153277 (duration: 00m 40s) [19:20:08] ^ James_F live everywhere [19:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:13] T153277: Links in show preview resolve to wrong URL in NWE - https://phabricator.wikimedia.org/T153277 [19:20:17] Thank you. [19:20:34] * thcipriani waits on jenkins a while [19:20:41] Don't we all? :-) [19:20:51] :) [19:22:05] (03PS2) 10Giuseppe Lavagetto: conftool-data: add nginx as a service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327543 [19:23:18] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Fix x1 backups [puppet] - 10https://gerrit.wikimedia.org/r/327568 (https://phabricator.wikimedia.org/T151999) (owner: 10Jcrespo) [19:24:40] (03PS3) 10Giuseppe Lavagetto: conftool-data: add nginx as a service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327543 [19:26:26] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] conftool-data: add nginx as a service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327543 (owner: 10Giuseppe Lavagetto) [19:27:29] (03PS1) 10RobH: setting contint2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/327570 [19:28:05] (03CR) 10RobH: [C: 032] setting contint2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/327570 (owner: 10RobH) [19:30:30] James_F: https://gerrit.wikimedia.org/r/#/c/327428 is on mwdebug1002, check please [19:30:41] Kk. [19:31:00] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=imagescaler,dc=eqiad,service=nginx [19:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:56] thcipriani: Yup, LGTM. [19:31:58] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:32:29] James_F: ok, going live [19:34:42] !log thcipriani@tin Synchronized php-1.29.0-wmf.6/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: SWAT [[gerrit:327428|Properly clear this.section when switching from VE]] T153276 (duration: 00m 39s) [19:34:54] ^ James_F live everywhere [19:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:56] T153276: Switching to NWE from VE section edit results in corrupted edits - https://phabricator.wikimedia.org/T153276 [19:35:06] thcipriani: Thank you! [19:36:45] (03PS1) 10RobH: install module update for contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/327572 [19:37:13] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2877910 (10RobH) [19:37:57] (03CR) 10Aaron Schulz: [C: 031] cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) (owner: 10BBlack) [19:39:39] (03CR) 10RobH: [C: 032] install module update for contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/327572 (owner: 10RobH) [19:39:50] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=20; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw11.* [19:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:34] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=20; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw12[0-3].* [19:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:05] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=25; selector: cluster=api_appserver,dc=eqiad,service=apache2,name=mw12[7-9].* [19:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:39] thcipriani: would you ping me when SWAT deploy is done? I'd like to squeeze in a mobileapps deploy. [19:46:58] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=appserver,dc=eqiad,service=nginx [19:47:00] bearND: SWAT is complete [19:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:22] (03PS1) 10RobH: setting contint2001 ipv6 entries [dns] - 10https://gerrit.wikimedia.org/r/327573 [19:47:29] thcipriani: What's this puppet logging I see here? [19:47:41] (03CR) 10RobH: [C: 032] setting contint2001 ipv6 entries [dns] - 10https://gerrit.wikimedia.org/r/327573 (owner: 10RobH) [19:48:20] bearND: not sure, looks like folks changing conftool appserver pooling/config [19:49:33] thcipriani: ok, I guess I'll just go ahead. Thanks! [19:51:17] !log bsitzmann@tin Starting deploy [mobileapps/deploy@c9b7386]: Scap: Add BetaCluster deployment configuration [19:51:18] !log oblivian@puppetmaster1001 conftool action : set/weight=30; selector: cluster=appserver,dc=eqiad,service=nginx,name=mw12[67].* [19:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:40] !log bsitzmann@tin Finished deploy [mobileapps/deploy@c9b7386]: Scap: Add BetaCluster deployment configuration (duration: 01m 22s) [19:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:38] (03PS1) 10ArielGlenn: fix up checkpoint file sorting [dumps] - 10https://gerrit.wikimedia.org/r/327575 [19:57:59] (03CR) 10ArielGlenn: [C: 032] fix up checkpoint file sorting [dumps] - 10https://gerrit.wikimedia.org/r/327575 (owner: 10ArielGlenn) [19:58:35] !log ariel@tin Starting deploy [dumps/dumps@2c675c4]: (no message) [19:58:37] !log ariel@tin Finished deploy [dumps/dumps@2c675c4]: (no message) (duration: 00m 02s) [19:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:23] (03PS2) 10Giuseppe Lavagetto: lvs: enable https endpoints for mw in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327544 [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T2000). Please do the needful. [20:00:59] (03PS1) 10RobH: set contint2001 to wrong partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/327576 [20:01:47] (03CR) 10RobH: [C: 032] set contint2001 to wrong partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/327576 (owner: 10RobH) [20:02:04] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: enable https endpoints for mw in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327544 (owner: 10Giuseppe Lavagetto) [20:03:09] (03PS3) 10Giuseppe Lavagetto: lvs: enable https endpoints for mw in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327544 [20:03:14] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] lvs: enable https endpoints for mw in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/327544 (owner: 10Giuseppe Lavagetto) [20:05:24] (03CR) 10Dzahn: [] add tendril[12]001, v4 and v6 IPs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [20:06:27] (03PS2) 10Dzahn: add dbmon[12]001, v4 and v6 IPs [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) [20:06:58] 06Operations, 10ops-codfw: update label/racktables visible label for contint2001/WMF6404 - https://phabricator.wikimedia.org/T153355#2878007 (10RobH) [20:07:10] (03CR) 10Dzahn: [C: 04-1] add dbmon[12]001, v4 and v6 IPs [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [20:07:22] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2877798 (10RobH) [20:09:08] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [20:09:52] (03CR) 10Dzahn: [C: 04-1] add dbmon[12]001, v4 and v6 IPs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [20:11:09] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2878046 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1073.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2... [20:13:16] (03CR) 10Rush: [C: 031] "my eyeball compare says this ok" [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda) [20:13:23] <_joe_> !log restarting pybal low-traffic in eqiad to pick up new TLS endpoints for appservers, T153042 [20:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:34] T153042: Enable TLS termination on the MediaWiki clusters - https://phabricator.wikimedia.org/T153042 [20:14:23] (03CR) 10Andrew Bogott: [C: 031] "Yep, I've confirmed that all of those projects are marked as deleted during the 2016 project purge (except for megachron which must've bee" [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda) [20:15:25] (03PS3) 10Dzahn: add dbmon[12]001, v4 and v6 IPs [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) [20:17:30] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2878051 (10Joe) [20:17:32] 06Operations, 10Traffic, 07HHVM, 13Patch-For-Review, and 2 others: Enable TLS termination on the MediaWiki clusters - https://phabricator.wikimedia.org/T153042#2878050 (10Joe) 05Open>03Resolved [20:25:57] (03PS4) 10Dzahn: introduce dbmonitor, add dbmonitor[12]001, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) [20:28:21] (03PS5) 10Dzahn: introduce dbmonitor, add dbmonitor[12]001, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) [20:29:18] Dereckson, will you need all of two hours after the train? [20:29:27] cc: twentyafterfour [20:30:09] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2878062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1073.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['db1073.eqiad.wmnet']) ``` [20:30:10] (03CR) 10Dzahn: [] "alright, so "dbmonitor" as a new type of server name, and i corrected the rows as pointed out by Alex, better?" [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [20:32:16] (03PS1) 10Cmjohnson: Adding dhcpd entry for wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/327578 [20:33:59] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entry for wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/327578 (owner: 10Cmjohnson) [20:34:49] 06Operations, 10ops-eqiad: Rack and setup wdqs1003 - https://phabricator.wikimedia.org/T153349#2878079 (10Cmjohnson) [20:36:06] yurik: what do you need to deploy and when ideally? [20:36:50] Dereckson, there are a few fixes that i would like to deploy for interactive, and it might require a scap [20:37:08] so i'm uneasy about doing it during swat (it failed yesterday for some unknown reason) [20:37:24] yurik: scap on MediaWiki code? [20:37:42] Dereckson, i18n rebuild [20:37:54] so yes, full scap :( [20:38:08] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:38:10] (03PS8) 10Dzahn: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [20:38:12] i could avoid it and just fix the metrics stuff [20:38:24] if time is scarce [20:39:23] !log stopping db1052's mysql for maintenance and clone to db1073 [20:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:40] Plan that during the SWAT. Even if I finish early, the scap time will eat swat window. A full scap is acceptable for i18n purpose. If you want to test code part before the full scap, what about divide in two parts (at deployment), code first, then i18n if working? [20:39:48] (03CR) 10Dzahn: [C: 032] "tested with apache-fast-test (tin -> mwdebug1001)" [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [20:41:14] 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2878093 (10Krinkle) [20:42:38] Dereckson, yeah, i guess i will have to do that, thx. Poke me when done? I'm not sure how fast the train will run today [20:42:52] * Dereckson nods. [20:45:06] (03PS4) 10Dzahn: hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [20:46:24] jouncebot: refresh [20:46:25] jouncebot: next [20:46:27] I refreshed my knowledge about deployments. [20:46:27] In 1 hour(s) and 13 minute(s): Create arbcom-cs.wikipedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T2200) [20:46:58] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:47:00] (03PS3) 10Krinkle: tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 [20:47:52] yurik: by the way, group2 is still at 1.29.0-wmf.5 (train hasn't been done yet) [20:48:39] i guess its running a bit late... hope everything checks out [20:48:58] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:49:09] (03CR) 10Dzahn: "already had +2 from joe, but wasn't submitted, done now" [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [20:49:10] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2878115 (10Gehel) benchmark with the current server configuration (on elastic2006) is generated with `bonnie++ -d /var/lib/elasticsearch/bonnie/ -n 1024 -m elastic2006_ba... [20:49:11] (03PS8) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [20:51:04] (03CR) 10Andrew Bogott: [C: 032] Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [20:51:36] (03CR) 10Dzahn: "ferm restart without problems on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [20:52:00] (03PS3) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 [20:52:30] (03CR) 10Yuvipanda: [V: 032 C: 032] labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda) [20:53:21] (03CR) 10Gilles: [] cache_upload: strip all query params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) (owner: 10BBlack) [20:54:51] (03CR) 10Dzahn: [] "@bblack should we still wait a while? how are the chances to sneak in this new backend" [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:55:57] (03CR) 10Dzahn: "sorry for the ridiculous waiting time this patch had, we should do better here" [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [20:57:18] (03CR) 10Dzahn: [] "this was hoping that a Google Code-In person picks it up.. but that did not really happen.." [puppet] - 10https://gerrit.wikimedia.org/r/324033 (https://phabricator.wikimedia.org/T127797) (owner: 10Dzahn) [20:57:56] (03Abandoned) 10Dzahn: puppet-lint.rc: make exception for "no docs" obsolete :) [puppet] - 10https://gerrit.wikimedia.org/r/324033 (https://phabricator.wikimedia.org/T127797) (owner: 10Dzahn) [21:00:23] (03CR) 10Dzahn: [] "@Alex what do you think, now that neon is gone this can be done, right?" [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [21:07:34] !log Deploying 1.29.0-wmf.6 to all wikis once https://gerrit.wikimedia.org/r/#/c/327582/1 merges [21:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:28] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2878168 (10RobH) [21:14:52] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2878171 (10RobH) [21:14:54] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2877798 (10RobH) 05Open>03Resolved Handed off to cont-int for use via parent task. [21:20:59] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:25:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4434869 keys, up 45 days 13 hours - replication_delay is 602 [21:34:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4422896 keys, up 45 days 13 hours - replication_delay is 0 [21:38:44] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2878240 (10RobH) Host selenium/WMF4724 has been allocated for mediawiki log host use in eqiad to replace fluorine. Setup will be done via sub-task. [21:40:45] 06Operations, 10hardware-requests: setup/install selenium/WMF4724 - https://phabricator.wikimedia.org/T153361#2878252 (10RobH) [21:41:03] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2866197 (10RobH) 05Open>03Resolved [21:41:05] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2878268 (10RobH) [21:44:26] (03PS1) 10Jdrewniak: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327589 [21:44:37] (03CR) 10jenkins-bot: [V: 04-1] Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327589 (owner: 10Jdrewniak) [21:47:23] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327590 [21:47:34] (03Abandoned) 10Jdrewniak: Bumping portal to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327589 (owner: 10Jdrewniak) [21:48:59] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:51:24] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2878341 (10hashar) Thank you ! [21:54:51] (03PS1) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [21:55:07] (03CR) 10jenkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [21:58:29] MaxSem, here's an example -- https://grafana-admin.wikimedia.org/dashboard/db/interactive-team-kpi?panelId=25&fullscreen&edit [21:59:00] example of what? [21:59:12] (03PS2) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [21:59:59] (03CR) 10jenkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [22:00:05] Dereckson: Dear anthropoid, the time has come. Please deploy Create arbcom-cs.wikipedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T2200). [22:00:21] jouncebot: train still going on [22:01:53] (03CR) 10Dzahn: [] "well, the puppet-tox-jessie fail it shows now is because the 2 .py files get checked now and weren't before, because they get identified a" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [22:07:22] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.6/extensions/MobileFrontend/includes/api/ApiWebappManifest.php: fix T153250 (duration: 00m 41s) [22:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:38] T153250: Warning: No such file or directory in /srv/mediawiki/php-1.29.0-wmf.6/extensions/MobileFrontend/includes/api/ApiWebappManifest.php on line 31 - https://phabricator.wikimedia.org/T153250 [22:07:54] jouncebot: now [22:07:54] For the next 1 hour(s) and 52 minute(s): Create arbcom-cs.wikipedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T2200) [22:08:10] (03PS1) 10Hashar: contint: provision the secondary CI master [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) [22:08:13] It doesnt seem like jouncebot cares Dereckson [22:09:26] (03CR) 10Papaul: [C: 031] introduce dbmonitor, add dbmonitor[12]001, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [22:09:34] Dereckson: yes train is still going I'm sorry it took the past 70 minutes to get one patch through jenkins [22:09:39] (03CR) 10Hashar: [] "Basically copy pasted contint1001. That will get us Jenkins master, make it a slave of itself and enable firewall/backup etc." [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [22:09:41] (03PS1) 10Dzahn: installserver/CI: give shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) [22:10:05] almost done though [22:10:23] (03CR) 10Paladox: [C: 031] installserver/CI: give shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [22:10:44] maybe grrrit-wm will kick jenkins to produce faster tests. [22:10:46] (03PS1) 1020after4: all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327596 [22:10:48] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327596 (owner: 1020after4) [22:11:30] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327596 (owner: 1020after4) [22:12:59] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.6 [22:13:06] (03PS3) 10Filippo Giunchedi: prometheus: add job definition for apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327565 (https://phabricator.wikimedia.org/T147316) [22:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:13] (03PS1) 10Filippo Giunchedi: prometheus: extend recording rules for ops [puppet] - 10https://gerrit.wikimedia.org/r/327607 [22:13:15] (03PS1) 10Filippo Giunchedi: prometheus: add more metrics to be collected globally [puppet] - 10https://gerrit.wikimedia.org/r/327608 [22:14:02] (03CR) 10Dzahn: [C: 04-1] contint: provision the secondary CI master (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [22:14:09] Wasnt all wikis supposed to be yesterday [22:14:29] Zppix: no [22:14:35] (03PS1) 1020after4: group2 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327615 [22:14:37] (03CR) 1020after4: [C: 032] group2 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327615 (owner: 1020after4) [22:14:46] (03CR) 10Dereckson: [] "We need to sync with Krenair to ensure we don't break wikitech static." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [22:14:53] !log rollback due to huge jump in error rate [22:14:57] Zppix they are done over a 3 day period. [22:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:09] first day is mediawiki / test wiki. [22:15:19] I know i thought it went monday tues wednesday [22:15:22] Anything exciting error wise twentyafterfour? [22:15:25] (03Merged) 10jenkins-bot: group2 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327615 (owner: 1020after4) [22:15:41] oh [22:15:42] Reedy: Warning: Invalid parameter for message "api-help-permissions-granted-to": [22:15:53] Reedy: it was all roses and sunflowers in all seriousness though idk [22:16:21] twentyafterfour: can i have the file to which error was in? [22:16:28] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 wikis to 1.29.0-wmf.5 [22:16:31] I may be able to fix it [22:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:53] "api-help-permissions-granted-to": "{{PLURAL:$1}}Cấp cho: $2", [22:16:56] That can't be right [22:17:15] https://github.com/wikimedia/mediawiki/commits/3280e72c80c284c9761c63e4e6907a14d037c07a/includes/api/ApiMain.php [22:17:19] Mediawiki doesnt support acccents in the code itself does it? [22:17:22] Try ^^ which defines that [22:17:40] Invalid parameter for message "timedmedia-in-job-queue" [22:17:48] LOL [22:17:56] thats caused by https://github.com/wikimedia/mediawiki/commit/4e6810e4a2c1d821d8d108c7974ac16917561764 [22:17:58] probaly [22:18:08] Welp it looks like we fucked up guys no more adult -- nevermind [22:18:38] :D [22:18:54] I was gonna say the whole thing then realise this is public logged [22:18:58] (03CR) 10Alex Monk: [] "Does the export process (see puppet) handle Flow pages properly? And do we need to make any changes to the import script? I can supply a c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [22:19:01] (03PS3) 10Dzahn: Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [22:19:14] 06Operations, 10hardware-requests: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878500 (10RobH) [22:19:15] lol [22:19:21] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878252 (10RobH) [22:19:29] (03CR) 10Dzahn: [C: 031] "http://packages.ubuntu.com/search?suite=trusty§ion=all&arch=any&keywords=php5-readline&searchon=names" [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [22:19:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add job definition for apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327565 (https://phabricator.wikimedia.org/T147316) (owner: 10Filippo Giunchedi) [22:19:44] twentyafterfour: ill look to see if i can throw a commt to fix it so we dont derail the train. [22:19:46] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2878517 (10Gilles) [22:19:48] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Traffic, and 4 others: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2878516 (10Gilles) 05Open>03Resolved [22:20:12] Interestingly, it's complaining in en [22:20:21] Really? [22:20:30] Reedy most likly caused by https://github.com/wikimedia/mediawiki/commit/4e6810e4a2c1d821d8d108c7974ac16917561764 [22:20:39] I copyedited those en files last montg [22:20:44] So it must be recent [22:20:47] Where this $this->msg( 'api-help-permissions-granted-to' ) is defined it changes [22:20:47] (03CR) 10Dzahn: [C: 032] Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [22:20:54] (03PS4) 10Dzahn: Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [22:21:01] from ->params( $this->getLanguage()->commaList( $groups ) ) to ->params( Message::listParam( $groups ) ) [22:21:21] https://www.youtube.com/watch?v=RMR5zf1J1Hs [22:21:48] Invalid parameter for message "api-help-permissions-granted-to": a:3:{i:0;s:3:"all";i:1;s:4:"user";i:2;s:3:"bot";} [22:22:09] thanks twentyafterfour for the swat [22:22:28] twentyafterfour This video is not available. [22:22:55] ozzy osborn. [22:23:11] twentyafterfour: https://www.mediawiki.org/w/api.php [22:23:14] Granted to: [INVALID] [22:24:19] 06Operations, 06Performance-Team: Upgrade Grafana to 4.0.2 - https://phabricator.wikimedia.org/T152473#2878554 (10Gilles) 05Open>03Resolved That's done now, right? [22:24:22] 06Operations, 10ops-codfw: update label/racktables visible label for contint2001/WMF6404 - https://phabricator.wikimedia.org/T153355#2878556 (10Papaul) 05Open>03Resolved Complete [22:24:24] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: setup/install contint2001/WMF6404 - https://phabricator.wikimedia.org/T153350#2878558 (10Papaul) [22:24:27] (03PS2) 10BBlack: cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) [22:25:00] (03CR) 10BBlack: [] cache_upload: strip all query params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) (owner: 10BBlack) [22:25:24] (03PS3) 10BBlack: cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) [22:25:40] (03PS1) 10RobH: setting mwlog1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/327648 [22:25:51] (03CR) 10BBlack: [V: 032 C: 032] cache_upload: strip all query params [puppet] - 10https://gerrit.wikimedia.org/r/327553 (https://phabricator.wikimedia.org/T153336) (owner: 10BBlack) [22:26:07] Reedy: I don't get it.. that string is in includes/api/i18n/qqq.json [22:26:10] (03CR) 10RobH: [C: 032] setting mwlog1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/327648 (owner: 10RobH) [22:26:14] why's it invalid? :-/ [22:26:21] it looks ok [22:26:28] It works locally too [22:26:35] (03PS2) 10RobH: setting mwlog1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/327648 [22:26:38] There's obviously something weird going on with some WMF groups [22:27:24] we've had to rebuild and resync l10n cache a few times this week... [22:27:47] I don't know if it's a 1l0n cache issue [22:28:05] can we try rebuilding .wmf6 on mediawiki.org ? [22:28:24] (03PS1) 10Hashar: zuul: manage service status from hiera [puppet] - 10https://gerrit.wikimedia.org/r/327649 (https://phabricator.wikimedia.org/T150771) [22:28:26] (03PS1) 10Hashar: contint: add a disabled zuul server on contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) [22:30:00] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: extend recording rules for ops [puppet] - 10https://gerrit.wikimedia.org/r/327607 (owner: 10Filippo Giunchedi) [22:30:05] (03PS2) 10Filippo Giunchedi: prometheus: extend recording rules for ops [puppet] - 10https://gerrit.wikimedia.org/r/327607 [22:30:29] twentyafterfour: replicable on eval.php [22:30:30] var_dump( wfMessage( 'api-help-permissions-granted-to' )->numParams( 3 )->params( Message::listParam( ['a', 'b', 'c'] ) )->parse() ); [22:30:47] hmmm [22:31:03] (03CR) 10Hashar: [] "The idea is to entirely prevent starting Zuul on contint2001. Maybe we will want to do the same for jenkins as well. I am hoping that mask" [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [22:31:06] Message::listParam() is to blame [22:31:07] wait there are only 2 parameters in the source [22:31:32] twentyafterfour: no [22:31:37] It's not what you think [22:31:41] It's not telling it how many parameters [22:31:52] It's saying, this prameter (in this case, the first one) is a numerical parameter [22:32:16] (03PS2) 10Filippo Giunchedi: prometheus: add more metrics to be collected globally [puppet] - 10https://gerrit.wikimedia.org/r/327608 [22:32:20] var_dump( Message::listParam( ['a', 'b', 'c'] ) ); is fine [22:32:44] ooh [22:32:48] $warning = 'Invalid list type for message "' . $this->getKey() . '": ' . [22:32:48] htmlspecialchars( serialize( $param ) ); [22:32:51] $param is undefined [22:33:04] (possibly unrelated) [22:33:27] but it's in a function called formatListParam [22:33:30] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add more metrics to be collected globally [puppet] - 10https://gerrit.wikimedia.org/r/327608 (owner: 10Filippo Giunchedi) [22:34:47] Reedy: where are you seeing this? /me can't find any of this with ack-grep of wmf.6 [22:35:19] I'm looking at the Message.php source code [22:35:24] I just crated https://gerrit.wikimedia.org/r/#/c/327652/ as a .6 revert [22:35:37] Left comments on https://gerrit.wikimedia.org/r/321404 [22:36:03] twentyafterfour: https://github.com/wikimedia/mediawiki/blame/master/includes/Message.php#L1307-L1314 [22:36:36] Reedy: thanks [22:37:03] Ugh [22:37:08] * twentyafterfour still doesn't know the l10n code very well [22:37:09] Is that revert enough [22:37:31] Nope [22:37:41] kaldari: is renaming still disabled, how much work the script has to perform? thanks [22:38:07] didn't MatmaRex do something about this? [22:38:09] TabbyCat: Script is done. Renaming should be back on in about 2 hours [22:38:33] jouncebot: now [22:38:33] For the next 1 hour(s) and 21 minute(s): Create arbcom-cs.wikipedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161215T2200) [22:38:39] kaldari: thanks, is the script complete at all wikis then or do we have to re-suspend? [22:38:58] (03PS1) 10Kaldari: Revert "Temporarily disable centralauth-rename right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327657 [22:39:37] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878626 (10RobH) [22:40:04] Krenair: ? [22:40:11] read up [22:40:29] arbcom_cs [22:40:40] did my dns patch got merged already? [22:40:51] yes [22:40:55] Krenair: i don't see anything up that sounds like i know anything about [22:41:14] MatmaRex, you didn't do some Message::listParam-related fix [22:41:24] https://gerrit.wikimedia.org/r/#/c/325790/ [22:41:47] huh [22:41:47] Krenair: not that i remember [22:42:03] TabbyCat: Yes, the script is complete for all wikis [22:42:06] ah, that might be the commit I remember. not sure why I thought it was Bartosz's [22:42:13] kaldari: awesome <3 [22:42:21] twentyafterfour: CR+2'd a backported fix for this [22:43:05] reedy: sweet [22:43:24] so the revert can be abandoned then? [22:43:46] oh already done [22:44:21] (03PS5) 10Kaldari: mediawiki: Add cron job for PageAssessments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) [22:44:49] Yeah, Gergo pinged me, so I blocked it and then looked at what he was mentioning :) [22:45:19] ostriches: does anything in particualr need to happen for a gerrit repo to be mirrored to github and/or diffusion? I'm looking at operations/software/hhvm_exporter [22:45:32] godog: yeah [22:45:37] godog: create an empty repo on github [22:45:38] push a commit [22:45:50] Yeah, empty repo on github, it'll pick up replication from there automatically [22:45:50] that should've been done as part of the repository creation [22:45:57] Yeah well, nobody does. [22:46:08] what a shit repo setup process [22:46:10] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2878663 (10Eevans) [22:46:18] 3fb44d34 22:46:53.131 22:45:53.131 (retry 1599) [bbfd97ba] push git@github.com:wikimedia/operations-debs-prometheus-redis-exporter [22:46:31] (03PS1) 10RobH: mwlog1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/327660 [22:46:47] that's another repo but I'm gues it is equally failing [22:46:57] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2878663 (10Nuria) Approved on my end. [22:46:58] mmm [22:47:22] (03CR) 10RobH: [C: 032] mwlog1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/327660 (owner: 10RobH) [22:48:23] geez latency to bast2001 just went to shit... this is going to be fun [22:48:25] (03CR) 10Hashar: [] "Thanks! next patch shuffle the hiera configuration around." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [22:48:27] Created on github. [22:48:33] 64 bytes from bast1001.wikimedia.org (208.80.154.149): icmp_seq=73 ttl=53 time=1083 ms [22:48:36] (03PS2) 10Hashar: contint: provision the secondary CI master [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) [22:48:46] try another bastion? [22:48:56] twentyafterfour: stop sharing linux isos [22:48:58] Reedy: Easier than pushing something, just kick off a replication job for a repo :) [22:49:01] lol [22:49:11] I think it's my connection not the bastion [22:49:52] nearly merged [22:49:53] ostriches: thanks! is it supposed to happen automatically or I missed a step? [22:50:06] No, it doesn't happen automatically [22:50:14] Used to, but the plugin broke and I never rewrote it [22:50:25] +I lost the original source code [22:50:33] unzip github.jar [22:50:56] Well, there is no more jar file, it was on $gerrit_host-2 [22:51:09] (03CR) 10Hashar: [] "Puppet compiler fails because it lacks facts for contint2001:" [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [22:51:11] heh [22:51:16] lost all the things [22:51:21] respawn $gerrit_host-2 [22:51:52] anyways, thanks ostriches Reedy Krenair [22:52:15] yw [22:53:22] php55 tests are slow [22:53:57] lovely 64 bytes from ord37s03-in-f4.1e100.net (172.217.6.100): icmp_seq=13 ttl=54 time=3518 ms [22:54:08] https://developer.github.com/v3/repos/#create [22:54:28] twentyafterfour who is that from? [22:54:46] that's me pinging www.google.com [22:54:58] Oh [22:55:07] 10 packets transmitted, 8 received, 20% packet loss, time 9021ms [22:55:43] now I'm afraid to try scapping because my connection won't be reliable enough [22:56:55] 22:56:50 SSSSSSSSSSSSS.............................................. 12449 / 13267 ( 93%) [22:57:51] twentyafterfour: mosh + tmux gives good results in such circumstances [22:58:51] I'll deploy that fix [22:58:54] See if it fixes mw.org [23:00:37] twentyafterfour mines 59 packets transmitted, 59 received, 0% packet loss, time 58073ms [23:00:55] paladox: that's how it should look [23:01:04] oh [23:01:05] packet loss is very bad [23:01:07] !log reedy@tin Synchronized php-1.29.0-wmf.6/includes/Message.php: Fix buggy parameter handling in Message::params (duration: 00m 42s) [23:01:10] yep [23:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:39] twentyafterfour: That's fixed it [23:01:54] twentyafterfour though there is a difference between mine isp and your's (it does depends on speeds) [23:02:06] No [23:02:10] Latency does not depend on speed [23:02:26] oh [23:02:29] yeah latency and packet loss are more important than throughput [23:02:41] especially for a ssh session [23:02:48] oh, mine is on the fast path. [23:03:09] it also depends on who is online too? (using same router). [23:03:18] also depends on the router too [23:03:32] It depends how much traffic is going down the same connections [23:03:45] And distance between you and the remote [23:03:51] yep [23:04:02] my nearest cabinet is up the street. [23:04:08] Doesn't matter [23:04:13] It's still backhauled to Telehouse in London [23:04:24] Before it gets to the public internet [23:04:43] oh. [23:04:44] * Reedy looks at logstash [23:04:49] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:05:00] but isent it the distance from your cabinet to your house? (fttc) [23:05:08] That affects your speed [23:05:13] !log update change-prop to 61cfbb2a [23:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:34] oh [23:05:40] what about latency? [23:05:41] Wouldn't have any noticeable affect on latency [23:05:47] Unless your cabinet is congested [23:05:54] But then, it's back to how saturated the links are [23:05:59] Oh [23:06:38] !log ppchelko@tin Starting deploy [changeprop/deploy@61cfbb2]: (no message) [23:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:43] !log ppchelko@tin Finished deploy [changeprop/deploy@61cfbb2]: (no message) (duration: 01m 04s) [23:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:27] I'm seeing a load of Rejected set() for commonswiki:page:6:002d6e6540417fcf201cd893be3b32d5cdc63a25 due to snapshot lag. [23:09:32] Look like they're known [23:09:45] how bad is it with the train? [23:10:46] yurik: I think we're good again... Just need to push group2 back [23:11:00] thx :) [23:11:33] twentyafterfour: Want me to push group2 back again? [23:11:47] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2878834 (10Deskana) [23:11:53] (03PS1) 10Andrew Bogott: Keystone hooks: monkeypatch keystone to change project id to project name [puppet] - 10https://gerrit.wikimedia.org/r/327664 (https://phabricator.wikimedia.org/T150091) [23:12:10] Reedy: I can do it now I think [23:12:16] Cool :) [23:12:18] (03Abandoned) 10Andrew Bogott: Keystone hook: Change project id to == project name [puppet] - 10https://gerrit.wikimedia.org/r/324928 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [23:18:04] (03PS1) 1020after4: all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327665 [23:18:06] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327665 (owner: 1020after4) [23:18:46] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327665 (owner: 1020after4) [23:19:50] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.6 [23:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:24] (03PS1) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) [23:22:29] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:26:06] !log 1.29.0-wmf.6: done [23:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:39] (03PS1) 10Jcrespo: Repool db1073 after maintenance as enwiki extra API node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327668 (https://phabricator.wikimedia.org/T149728) [23:27:26] (03PS2) 10Jcrespo: Repool db1073 after maintenance as enwiki extra API node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327668 (https://phabricator.wikimedia.org/T149728) [23:27:49] Urbanecm: ping? [23:28:03] twentyafterfour: you're done? [23:28:12] (03PS3) 10Jcrespo: Repool db1073 after maintenance as enwiki extra API node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327668 (https://phabricator.wikimedia.org/T149728) [23:28:36] Dereckson: yes [23:29:05] !log Start arbcom-cs.wikipedia.org creation [23:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:25] sorry it took so long :-/ [23:32:49] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:34:57] (03PS8) 10Dereckson: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [23:39:09] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:46] (03PS9) 10Dereckson: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [23:40:34] (03CR) 10Dereckson: [] "PS9: +small" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [23:41:00] (03CR) 10Dereckson: [C: 032] Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [23:41:21] oh, I forgot small.dblists sorry [23:41:35] the most important thing is that it be private [23:41:39] (03Merged) 10jenkins-bot: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [23:43:20] You know what we missed [23:43:27] wikiversions [23:43:27] wikiversions [23:43:30] I'm writing it [23:43:39] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2878918 (10fgiunchedi) Thanks @elukey @jcrespo for chiming in! I agree with some of the points especially building dashboards for each use case. In particular the... [23:45:39] (03PS1) 10Dereckson: Add arbcom_cswiki to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327669 (https://phabricator.wikimedia.org/T151731) [23:46:02] (03CR) 10Dereckson: [C: 032] Add arbcom_cswiki to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327669 (https://phabricator.wikimedia.org/T151731) (owner: 10Dereckson) [23:46:33] (03Merged) 10jenkins-bot: Add arbcom_cswiki to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327669 (https://phabricator.wikimedia.org/T151731) (owner: 10Dereckson) [23:46:49] (03CR) 10Jcrespo: [C: 032] Bump parser cache purging batch wait time [puppet] - 10https://gerrit.wikimedia.org/r/323764 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [23:46:54] (03PS2) 10Jcrespo: Bump parser cache purging batch wait time [puppet] - 10https://gerrit.wikimedia.org/r/323764 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [23:47:31] (03PS5) 10Dzahn: Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [23:49:29] So, with `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki cs wikipedia arbcom_cswiki arbcom-cs.wikipedia.org` [23:49:50] we'll have the same than ec.wikimedia: a need to adjust the whitelist with the future main page name. [23:50:29] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:54:27] Krenair: yet another successful run for addWiki.php [23:54:40] is it up? [23:54:55] (03CR) 10Dzahn: "Notice: /Stage[main]/Toollabs::Exec_environ/Package[php5-readline]/ensure: ensure changed 'purged' to 'latest'" [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [23:55:05] TabbyCat: nope, but database is created [23:55:09] arbcom-cs redirects to Incubator :) [23:55:13] I've still to sync the configuration [23:55:19] ah ktnx [23:56:24] (03PS2) 10Dzahn: remove wikimediacommons.eu [dns] - 10https://gerrit.wikimedia.org/r/327281 (https://phabricator.wikimedia.org/T137105) [23:56:33] !log decryption key for enWP Arbitration Committee election inserted at apx 21:45 UTC [23:56:43] (03CR) 10Dzahn: [C: 032] "croslof agreed (T137105#2878862)" [dns] - 10https://gerrit.wikimedia.org/r/327281 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [23:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:22] (03PS3) 10Jcrespo: Bump parser cache purging batch wait time [puppet] - 10https://gerrit.wikimedia.org/r/323764 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [23:58:40] !log dereckson@tin Synchronized dblists: (no message) (duration: 00m 40s) [23:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:14] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [23:59:24] https://arbcom-cs.wikipedia.org/ works on mwdebug1002 [23:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:29] with the Urbanecm cute logo [23:59:49] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878958 (10RobH) [23:59:55] Dereckson, yay