[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T0000). [00:00:06] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:10] (03Merged) 10jenkins-bot: Moving Sentry to CommonSettings/extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409750 (owner: 10Chad) [00:00:24] (03CR) 10jenkins-bot: Moving Sentry to CommonSettings/extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409750 (owner: 10Chad) [00:01:02] Hey. [00:03:01] I got you [00:03:05] I'm already sync'ing something [00:03:21] 10Operations, 10Phabricator: Upload php7.1 to apt.wm.org - https://phabricator.wikimedia.org/T160714#3108551 (10Krinkle) Don't mind me, but it seems like Debian does have packages for php7.1 and php7.2, albeit in newer release channels, but perhaps we prefer backporting those over using a different third party... [00:03:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: cleanup Sentry inclusion for labs, should be no-op (duration: 00m 56s) [00:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:05] 10Operations, 10Phabricator: Upload php7.1 to apt.wm.org - https://phabricator.wikimedia.org/T160714#3965698 (10mmodell) I think those packages are the same - sury.org is owned by the debian maintainer for php - so those are semi-official backports, as far as I can tell. [00:04:39] !log demon@tin Synchronized wmf-config/: cleanup Sentry inclusion for labs, should be no-op (duration: 00m 57s) [00:04:48] bd808: Ok, sentry refactor is live. I don't see it in production (good), can you have a look at beta? [00:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:57] (03PS3) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 [00:05:12] no_justification: I .. don't know where to look. that's tgr's toy [00:05:22] Oh, thought it was yours for some reason [00:05:26] I guess I can look at Special:Version [00:05:29] (03CR) 10jerkins-bot: [V: 04-1] LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [00:06:11] no_justification: its on https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Version [00:06:13] so probably ok [00:06:15] (03PS4) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 [00:06:18] Ok, good enough for me [00:07:51] no_justification: Can you SWAT my VE back-port? [00:07:55] bd808: you can do something like $('body').click(function() {throw new Exception();}) [00:08:01] (Whenever. ;-)) [00:08:04] and see if it gets logged [00:08:26] which it doesn't, although I haven't checked for a long time if it's still working [00:08:43] anyway, can be figured out later, the patch looked good [00:08:48] James_F: Waiting on CI.... [00:09:42] 10Operations, 10Phabricator: Upload php7.1 to apt.wm.org - https://phabricator.wikimedia.org/T160714#3965701 (10mmodell) Confirmed: all of the news entries on https://tracker.debian.org/pkg/php7.1 are by Ondřej Surý. [00:09:44] (03PS1) 10Chad: Moving EmailAuth to CommonSettings/extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410081 [00:12:52] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9938/" [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [00:13:23] no_justification we could do https://gerrit.googlesource.com/gerrit/+/cbd4106f3c2b03c9e1d38e3f85f1dc4ef0d64e89/rules.pl [00:14:08] that would at least unblock us [00:14:09] I've tried to avoid us having any prolog rules for yearssssss [00:14:11] :p [00:14:14] oh [00:14:37] no_justification i did file a task upstream but not sure how long that will take or even if they will accept that [00:14:55] (03CR) 10Jalexander: [C: 031] "Good from a T&S standpoint to release as it currently stands. We'll need to do some testing on the production cluster as well before relea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409445 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [00:16:41] !log demon@tin Synchronized php-1.31.0-wmf.20/extensions/VisualEditor/modules/ve-mw/ui/pages/: T187112 (duration: 00m 56s) [00:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:55] T187112: Inserting a new redirect or displaytitle crashes (using unsupported parameter of insertMeta) - https://phabricator.wikimedia.org/T187112 [00:17:11] no_justification i guess i could create a change that adds it? (prolog). that's if you doin't object :) [00:17:14] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3965714 (10mmodell) Although PHP 7.1 was declined in T1... [00:18:30] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/9937/mendelevium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn) [00:22:00] James_F: You're live everywhere [00:22:15] no_justification: Thanks! [00:22:27] no_justification https://gerrit.wikimedia.org/r/#/c/410084/ [00:32:14] no_justification actually maybe we could add a config that confures the default place it looks for reviewers.config [00:33:13] Actually, it'd be nice if we had a jenkins job for refs/meta/* stuff [00:33:18] I started one, but I don't have time for it [00:33:23] That'd help the submit part [00:35:03] no_justification oh, you mean like if it detects file reviewers.config then submit [00:35:12] otherwise it fails if it detects any other files? [00:39:54] I guess. Feels hacky though [00:40:02] Ideally we could configure this to not require project ownership [00:40:34] Granted, https://gerrit.wikimedia.org/r/#/settings/projects is similar, although it doesn't add you as a reviewer, just e-mails? [00:40:41] (it'd be nice to extend that...) [00:43:44] yep [00:44:01] mutante: Guess swat is over :p [00:44:44] tgr: Mind having a peek at https://gerrit.wikimedia.org/r/#/c/410081/ (another one of yours) [00:46:57] no_justification: the "make it do something testable" block should definitely stay beta-specific [00:47:23] more generally, ask the Security team if they want it? [00:48:02] (03CR) 10Dzahn: [C: 04-1] "Httpd::Mod_conf[headers] is already declared" [puppet] - 10https://gerrit.wikimedia.org/r/409480 (owner: 10Dzahn) [00:48:45] (03PS4) 10Dzahn: Gerrit: Set cookie path to / [puppet] - 10https://gerrit.wikimedia.org/r/409216 (owner: 10Chad) [00:48:48] tgr: I assume you'd done that before you deployed it ;-) [00:48:51] no_justification: ok... [00:49:46] (03CR) 10Dzahn: [C: 032] Gerrit: Set cookie path to / [puppet] - 10https://gerrit.wikimedia.org/r/409216 (owner: 10Chad) [00:50:21] (03PS6) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 [00:50:25] (03PS3) 10Dzahn: Gerrit: Set change.disablePrivateChanges to true [puppet] - 10https://gerrit.wikimedia.org/r/409052 (owner: 10Paladox) [00:51:22] mutante: Could bundle https://gerrit.wikimedia.org/r/#/c/406139/, it's a no-op (since we don't do auto-gc) [00:51:28] Since we're doing the service bounce [00:51:37] (03CR) 10Dzahn: [C: 032] Gerrit: Set change.disablePrivateChanges to true [puppet] - 10https://gerrit.wikimedia.org/r/409052 (owner: 10Paladox) [00:51:39] no_justification: this was written as an emergency option during the OurMine spree but eventually wasn't used [00:51:42] ok, already bundling the "no private" change thing [00:51:47] tgr: Ah ok [00:51:48] as it was "safe, but needs restart" [00:51:49] not sure if they still care about it [00:51:52] :) [00:52:25] (03CR) 10Ayounsi: "Addressing Alex's comments, and adapting the CR for the new netmon role structure." [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [00:53:09] I think it might also make sense in a wider context as a poor man's 2FA option but I don't have the bandwidth to pretty it up for that; I won't object if someone undeploys it [00:53:31] (03PS3) 10Dzahn: Gerrit: Set gc.aggressive = true [puppet] - 10https://gerrit.wikimedia.org/r/406139 (owner: 10Chad) [00:54:05] (03CR) 10Dzahn: [C: 032] Gerrit: Set gc.aggressive = true [puppet] - 10https://gerrit.wikimedia.org/r/406139 (owner: 10Chad) [00:54:25] tgr: I'm also ok with undeploying! [00:54:26] :p [00:54:56] (03CR) 10Dzahn: [C: 032] Gerrit: Proxy gitiles through gerrit.wikimedia.org/g/ [puppet] - 10https://gerrit.wikimedia.org/r/409211 (https://phabricator.wikimedia.org/T184116) (owner: 10Chad) [00:54:58] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9940/" [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [00:55:01] (03PS5) 10Dzahn: Gerrit: Proxy gitiles through gerrit.wikimedia.org/g/ [puppet] - 10https://gerrit.wikimedia.org/r/409211 (https://phabricator.wikimedia.org/T184116) (owner: 10Chad) [00:55:33] :) [00:55:58] And let's save the baseUrl bit until this all lands and is ok [00:56:03] (let's not point to busted urls, hehe) [00:56:16] (03CR) 10Dzahn: [C: 032] Gerrit: Set gerrit.baseUrl in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/409385 (owner: 10Paladox) [00:56:19] (03PS6) 10Dzahn: Gerrit: Set gerrit.baseUrl in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/409385 (owner: 10Paladox) [00:56:29] mutante: ^^^^^ What I said [00:56:31] Don't merge the last one [00:56:46] ok [00:57:09] no_justification that wont break going to /plugins/gitiles/ [00:57:47] I know, but it'd break the links in gerrit if the proxy part doesn't work [00:57:57] (gitiles) links would break if /g/ is busted [00:57:58] ready to do service restaret [00:58:00] :) [00:58:04] unless you want to do it [00:58:14] no_justification it works :) [00:58:18] no_justification https://gerrit.wikimedia.org/g/ [00:58:21] I know, I tested it too [00:58:23] But I'm paranoid :) [00:58:25] So yeah [00:58:26] hhee [00:58:27] though it shows me as logged out on there. [00:58:35] but im logged in [00:58:43] requires restarting gerrit [00:58:48] sounds like the cookie path [00:58:59] did we restart gerrit? :) [00:59:09] no [00:59:14] A-ha! [00:59:16] I did [00:59:18] We kept our logins [00:59:22] Go us [00:59:25] heh :) [01:00:22] Logged our, logged in, now I got a cookie at / [01:00:36] And using /g/ respects the login state now [01:00:37] Yay! [01:00:48] nice [01:00:58] now the config? [01:01:44] We can. Doesn't need a service restart since it's plugin config [01:01:53] I was gonna go afk now though since it's 5:00 :p [01:02:18] well, i applied it :) [01:02:33] checks it still works [01:03:06] https://gerrit.wikimedia.org/g/ [01:03:41] (gitiles) links still busted tho? [01:03:50] Er, pointing to old place [01:03:59] Same with the urls in /g/ [01:04:14] no_justification huh? [01:04:46] I want urls to remain at /g/ [01:04:49] That was the point :p [01:04:52] Not /r/plugins/* [01:04:53] yep [01:05:17] no_justification restart gerrit? [01:05:31] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [01:05:55] Bleh, I'll figure it out later [01:06:32] contint2001 is ok, fwiw [01:06:36] that's just delayed [01:06:55] nothing is broken that worked before, we can continue later, right [01:07:46] At least the pretty urls work with the auth now [01:07:57] no_justification i think this will need a restart for the url thing to take affect [01:08:06] ok :) [01:08:15] Bleh. Not a plugin reload? [01:08:40] no_justification aha [01:08:43] i think i know why [01:09:02] i know the fix [01:10:31] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:11:07] no_justification mutante https://gerrit.wikimedia.org/r/#/c/410090/ [01:11:31] (03Draft1) 10Paladox: Gerrit: Fix erb syntax in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/410090 [01:11:33] (03Draft2) 10Paladox: Gerrit: Fix erb syntax in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/410090 [01:11:58] (03Draft1) 10Paladox: Gerrit: Remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/410094 [01:12:13] Derp [01:12:28] (03PS2) 10Paladox: Gerrit: Remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/410094 [01:13:01] That'll do it. Two files fighting plus bad erb [01:15:05] well, yea. the actual change was only: [01:15:06] + baseUrl = [01:15:16] (03PS3) 10Paladox: Gerrit: Fix erb syntax in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/410090 [01:15:18] (03PS3) 10Paladox: Gerrit: Remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/410094 [01:15:56] (03CR) 10Dzahn: [C: 032] Gerrit: Fix erb syntax in gitiles.config [puppet] - 10https://gerrit.wikimedia.org/r/410090 (owner: 10Paladox) [01:15:59] 10Operations, 10MediaWiki-JobQueue, 10Wikidata, 10Performance-Team (Radar), and 3 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3965803 (10Krinkle) [01:16:12] 10Operations, 10MediaWiki-JobQueue, 10Wikidata, 10Performance-Team (Radar), and 3 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584606 (10Krinkle) [01:16:23] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3965805 (10Krinkle) [01:18:16] (03PS4) 10Dzahn: Gerrit: Remove unused gitiles config file [puppet] - 10https://gerrit.wikimedia.org/r/410094 (owner: 10Paladox) [01:18:50] (03CR) 10Dzahn: [C: 032] Gerrit: Remove unused gitiles config file [puppet] - 10https://gerrit.wikimedia.org/r/410094 (owner: 10Paladox) [01:20:43] Plugin "reviewers" failed to load [01:21:33] just once. cant repro [01:21:47] fix applied. not sure i see a difference. but nothing broken either [01:37:53] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: PNG thumbnail preview of SVG misses some text - https://phabricator.wikimedia.org/T123106#3965823 (10Perhelion) [01:38:16] no_justification delete gitiles.config and then gerrit needs to be restarted for it to take affect. andd let it be recreated since there were two files. and now one of them has been removed. [01:38:29] see https://gerrit.git.wmflabs.org/r/#/c/58/ [02:12:01] Is Gerrit being weird about staying logged in? And/or is it possible I have some dumb cookie right now? [02:34:00] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.20) (duration: 05m 29s) [02:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:09] Ivy: We made a change to the cookie path to support the new repo browser, it *may* have logged you out [02:41:21] Worst case: try clearing your cookies for gerrit.wikimedia.org and login fresh. There's a chance you still have two cookies [03:02:37] I could log in with incognito mode. Will try clearing cookies later. [03:09:20] PROBLEM - Hadoop DataNode on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [03:11:29] paladox: Mostlyyyyyy there. Gerrit's (gitiles) links are using /g/, but links inside gitiles default back to /r/plugins/gitiles/ [03:13:21] no_justification: ah thenks [03:13:32] Guess we need to do some more fixing in gitiles :) [03:13:51] Ivy I was also hit by that [03:14:05] Removing the cookie for *.wikimedia.org fixed it [03:15:12] gitiles has a servletPath variable [03:15:17] Guess that needs setting [03:15:21] gitiles.servletPath [03:18:05] gitiles.config [gitiles] section should be picked up by the plugin, right? [03:18:06] Hmm [03:18:17] Maybe [03:18:40] Either way, gitiles has a way to configure this :) [03:19:05] Heh [03:19:59] I mean honestly I'd just run gitiles standalone, except we've got a few private repos :\ [03:20:06] And replicating just to remove them seems wasteful [03:20:45] Oh [03:21:33] It's all good, we've almost got this :) [03:21:49] (and tbh, the /r/plugins/gitiles/ links will always work, I just want to remain consistent) [03:21:51] I guess servletPath comes from gitiles [03:21:57] Yep [03:22:21] As I carnt find it in plugins/gitiles [03:22:23] https://gerrit.googlesource.com/plugins/gitiles/+/master/src/main/java/com/googlesource/gerrit/plugins/gitiles/GitilesWeblinks.java [03:23:38] https://github.com/google/gitiles/search?utf8=✓&q=servletPath&type= [03:23:50] I wonder how we would override that ^^ [03:26:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 839.23 seconds [03:29:28] * paladox goes for the night as it’s 3:29am :) [03:36:30] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [03:36:32] ACKNOWLEDGEMENT - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T187146 [03:36:35] 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3965895 (10ops-monitoring-bot) [03:56:01] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.30 seconds [04:21:54] (03PS1) 10BBlack: cp50xx macaddrs [puppet] - 10https://gerrit.wikimedia.org/r/410102 (https://phabricator.wikimedia.org/T156027) [04:22:23] (03CR) 10BBlack: [C: 032] cp50xx macaddrs [puppet] - 10https://gerrit.wikimedia.org/r/410102 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [04:31:42] so apparently I'm unable to sign into gerrit? [04:32:15] Zackary: Try clearing your cookies for gerrit.wikimedia.org. We made an adjustment to the cookie path. Most users it was transparent for, but a few have had issues [04:34:45] ah that for chrome, tho it seems that lastpass and gerrit don't want to play nice on firefox [04:35:43] which seems to be a lastpass issue, since its not working on other sites either [04:38:17] (03PS2) 10TerraCodes: Move flaggedrevs to NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603) [04:44:59] Zackary: me too. [04:45:33] kart_: Gerrit or lastpass generally? [04:45:39] If the former, see my advice above ^^^ [04:45:42] Zackary: normal password seems not working. [04:46:00] no_justification: OK [04:47:23] no_justification: thanks. [04:48:48] no_justification: lastpass stopped autofilling for me on gerrit about 2? updates of gerrit ago. Its a really pain in the butt [04:50:12] the upstream folks seem to have put uatocomplete="off" on all of the form fields because they want me to feel stabby about them [04:50:34] Well *that* sucks [04:50:43] (1password doesn't seem to respect that :p) [04:59:20] there was a security issue at one point in the past with lastpass autofill. I don't remember the details or when, but I remember at the time I manually turned off autofill and left it off since [05:00:15] (probably that it's hard for the browser integration to not be fooled by a bad site into autofilling your info into places it shouldn't) [05:00:22] In any case: I sent a quick note to wikitech-l to warn folks of some oddities on login re:gerrit [05:05:15] (03CR) 10Krinkle: Add mcrouter module and mcrouter_wancache profile and enable on beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [05:06:24] quick question, since we moved from phab to gitlies, does that mean that phab is no longer going to be upto date with changes? [05:07:24] It is, for now [05:07:45] In T187149 I'd argue we shouldn't use it anymore [05:07:45] T187149: Delete all Phabricator git repos that haven't been referenced / aren't used. - https://phabricator.wikimedia.org/T187149 [05:09:32] welp, that sucks [05:09:32] since phab is a lot more usable (atleast for me) than gerrit, especially with the new UI and gitlies [05:10:56] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp50(0[12345789]|1[12]).eqsin.wmnet [05:11:01] I think Phabricator's Diffusion is a pretty crappy repo browser I'm afraid :( [05:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:34] I find myself using it...basically never. There's a reason everyone uses github links around here.... [05:12:18] legoktm: Oh, you asked about stable urls. The gerrit.wm.o/g/* is stable now [05:12:27] <3 [05:12:29] I'm still working on some linking to be consistent, but the /g/ won't go away [05:12:49] I already filed tasks to migrate codesearch over :) [05:12:54] I saw :) [05:13:03] (03PS1) 10BBlack: eqsin: add caches to node lists [puppet] - 10https://gerrit.wikimedia.org/r/410106 [05:13:07] The /r/plugins/gitiles/* is also an ok URL and won't disappear [05:13:20] But /g/ is gonna be the ideal one to use :)( [05:13:29] no_justification: huh, how would one go about using gitiles then? [05:13:29] because I can't find my way around it nor I can I even get to it via gerrit [05:13:33] (03CR) 10jerkins-bot: [V: 04-1] eqsin: add caches to node lists [puppet] - 10https://gerrit.wikimedia.org/r/410106 (owner: 10BBlack) [05:13:41] Zackary: gerrit.wikimedia.org/g/ [05:13:58] (or gerrit.wikimedia.org/r/plugins/gitiles/, but that's ugly) [05:14:00] hehe :) [05:14:27] (03PS2) 10BBlack: eqsin: add caches to node lists [puppet] - 10https://gerrit.wikimedia.org/r/410106 [05:14:46] All of the (gitiles) links are to gitiles :) [05:14:59] (03CR) 10BBlack: [C: 032] eqsin: add caches to node lists [puppet] - 10https://gerrit.wikimedia.org/r/410106 (owner: 10BBlack) [05:15:12] ye, in the old UI, but there's no (gitiles) link in the new UI [05:15:45] Ah, yes they don't say (gitiles) [05:15:50] But most sha1s will link to it [05:15:55] (takes some getting used to) [05:16:21] eg: https://gerrit.wikimedia.org/r/c/62430/ - cb564dc [05:18:38] ah thanks [05:18:49] (ye, its going to take some getting used to) [05:19:35] Yeah, polygerrit + gitiles is /different/ [05:19:42] But I think it's on-the-whole better [05:19:56] And gerrit 2.15 (we just moved to 2.14) polishes polygerrit's UI a lot [05:24:16] I wonder if gitiles is going to be included in the polishing [05:34:56] no_justification: huh [05:35:02] does https://gerrit.wikimedia.org/g work for you? [05:35:23] /g/ does. I can fix that! [05:37:48] ah, thanks [05:38:10] also I'm guessing https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master is the file listing, but it also shows a commit at the top? [05:39:20] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5011.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5011.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5012.eqsin.wmnet are marked down but pooled [05:39:23] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp5008.eqsin.wmnet are marked down but pooled: uploadlb_80: Servers cp5003.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled: uploadlb6_80: Servers cp5001.eqsin.wmnet, cp5003.eqsin.wmnet are marked down but pooled: textlb_80: Servers cp5012.eqsin.wmnet, cp5011.eqsin.wmnet are marked down but pooled: uploadlb6_4 [05:39:23] eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5011.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet are marked down but pooled [05:39:23] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_80: Servers cp5001.eqsin.wmnet, cp5005.eqsin.wmnet are marked down but pooled: uploadlb_80: Servers cp5001.eqsin.wmnet, cp5002.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5002.eqsin.wmnet, cp5005.eqsin.wmnet are marked down but pooled [05:39:36] ^ ignore the above [05:39:39] sorry! [05:41:20] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy [05:41:20] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [05:41:47] Do you know T187153? https://phabricator.wikimedia.org/T187153 [05:41:48] T187153: BadMethodCallException when viewing details or examine of Abuselog of Abusefilter 131 on zh.wikipedia - https://phabricator.wikimedia.org/T187153 [05:42:20] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [05:43:00] This is a production issue. [05:54:13] razesoldier: Added a stacktrace to it [05:57:43] Thanks [05:59:51] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:00:30] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:00:51] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create_user-replication@nihal-v4],Exec[create_user-puppetdb@nihal-v4] [06:02:00] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:02:24] is there a way to manually delete a dashboard and reset myself to the default? [06:02:40] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:02:50] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:11] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:30] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:31] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:31] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:31] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:10] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:20] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:31] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:45] (03PS1) 10BBlack: eqsin ipv6 corrections [dns] - 10https://gerrit.wikimedia.org/r/410108 [06:08:15] (03CR) 10BBlack: [C: 032] eqsin ipv6 corrections [dns] - 10https://gerrit.wikimedia.org/r/410108 (owner: 10BBlack) [06:11:47] (03PS37) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [06:12:25] (03CR) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [06:13:50] (03PS1) 10Krinkle: [WIP] extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 [06:13:53] (03PS1) 10Krinkle: [WIP] multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 [06:13:58] no_justification: ^ I think you'll like this [06:14:50] PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:50] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:51] PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:51] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:51] PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:51] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:51] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:52] PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:52] PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:14:53] PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:25:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:26:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:28:11] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:28:30] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:28:31] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:28:31] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:28:38] !log reload haproxy on dbproxy1005 [06:28:40] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:10] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:29:20] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:29:40] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:29:50] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [06:29:51] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:30] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:30:51] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:32:00] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:32:40] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:32:50] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:33:12] the 5xx alerts appear to be real (the rest above with cp50xx and puppetfails is just noise) [06:33:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:34:22] better esams+text link, last 3h: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=text&var-status_type=5&from=now-3h&to=now [06:35:35] ACKNOWLEDGEMENT - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:35] ACKNOWLEDGEMENT - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:36] ACKNOWLEDGEMENT - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:36] ACKNOWLEDGEMENT - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:35:37] ACKNOWLEDGEMENT - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin still being configured, mgmt network unreachable, known issues. [06:44:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:02:56] !log Deploy schema change on s5 db2089 db2084 db2075 db2039 db2059 - T187089 [07:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:09] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [07:09:03] (03PS1) 10BBlack: wgSquidServersNoPurge: add eqsin, remove dead IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410113 (https://phabricator.wikimedia.org/T156027) [07:13:32] 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3966180 (10BBlack) [07:15:04] 10Operations, 10ops-eqsin, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack) [07:28:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db2089:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410114 [07:33:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2089:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410114 (owner: 10Marostegui) [07:35:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2089:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410114 (owner: 10Marostegui) [07:37:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2089:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410114 (owner: 10Marostegui) [07:41:41] (03CR) 10Alexandros Kosiaris: [C: 031] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [07:50:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Hm I see a few crucial resources missing, namely" [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn) [07:59:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2089:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410118 [08:01:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2089:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410118 (owner: 10Marostegui) [08:02:49] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2089:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410118 (owner: 10Marostegui) [08:03:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2089:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410118 (owner: 10Marostegui) [08:09:29] !log installing exim security updates on trusty hosts [08:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:44] (03PS1) 10Marostegui: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410119 [08:30:45] (03PS2) 10Alexandros Kosiaris: Specify the ops-staff-group correctly [puppet] - 10https://gerrit.wikimedia.org/r/409871 [08:30:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify the ops-staff-group correctly [puppet] - 10https://gerrit.wikimedia.org/r/409871 (owner: 10Alexandros Kosiaris) [08:31:01] RECOVERY - Hadoop DataNode on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [08:32:43] (03CR) 10Jcrespo: "I am not going to argue against this deploy, instead, if queries start to pile up, or x1 starts to have performance problem I will directl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [08:32:55] !log installing wavpack security updates [08:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:01] PROBLEM - Hadoop DataNode on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [08:34:16] I am checking --^ [08:37:17] !log tin.eqiad.wmnet: removing live hack in /srv/mediawiki-staging/scap/plugins/clean.py | T187160 [08:37:28] marostegui: I have just dropped the hack [08:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:29] T187160: /srv/mediawiki-staging/scap/plugins/clean.py had a live hack - https://phabricator.wikimedia.org/T187160 [08:37:34] hashar: <3 <3 <3 [08:37:46] hashar: Thanks a lot :) [08:37:55] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410119 (owner: 10Marostegui) [08:39:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410119 (owner: 10Marostegui) [08:39:42] (03CR) 10jenkins-bot: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410119 (owner: 10Marostegui) [08:41:44] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2084:3315 (duration: 00m 56s) [08:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:39] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966288 (10elukey) [08:45:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089, db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410121 (https://phabricator.wikimedia.org/T162807) [08:49:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089, db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410121 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:49:16] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3966301 (10jcrespo) @Superyetkin As per Ladsgroup, it seems the CSS you added here is not correct https://tr.wikipedia.org/w/index.php?title=MediaWiki%3AVector.css&type=revision&diff=18969658&oldid=18388018 Pl... [08:50:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089, db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410121 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:50:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089, db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410121 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:51:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089, db1099 - T162807 (duration: 00m 55s) [08:52:08] !log Stop replication in sync on db1089 and db1099:3311 - T162807 [08:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] (03PS1) 10Marostegui: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410123 (https://phabricator.wikimedia.org/T162807) [08:58:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410123 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:00:12] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410123 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:00:23] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410123 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:01:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 - T162807 (duration: 00m 55s) [09:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:39] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:05:57] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: convert to using select instead of find [puppet] - 10https://gerrit.wikimedia.org/r/409850 [09:06:50] (03PS15) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [09:07:07] (03PS1) 10Marostegui: db-codfw.php: Repool db2084:3311,depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410124 [09:07:20] (03PS2) 10Filippo Giunchedi: Whitelist new Thumbor-Request-Date header in Swift [puppet] - 10https://gerrit.wikimedia.org/r/409942 (https://phabricator.wikimedia.org/T186594) (owner: 10Gilles) [09:08:20] (03CR) 10Filippo Giunchedi: [C: 031] Whitelist new Thumbor-Request-Date header in Swift [puppet] - 10https://gerrit.wikimedia.org/r/409942 (https://phabricator.wikimedia.org/T186594) (owner: 10Gilles) [09:08:23] (03CR) 10Filippo Giunchedi: [C: 032] Whitelist new Thumbor-Request-Date header in Swift [puppet] - 10https://gerrit.wikimedia.org/r/409942 (https://phabricator.wikimedia.org/T186594) (owner: 10Gilles) [09:08:50] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff) Added to cn=ops and cn=wmf LDAP groups. [09:09:03] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3966318 (10MoritzMuehlenhoff) [09:09:40] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10MoritzMuehlenhoff) [09:09:45] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2084:3311,depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410124 (owner: 10Marostegui) [09:11:22] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2084:3311,depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410124 (owner: 10Marostegui) [09:11:33] (03CR) 10jenkins-bot: db-codfw.php: Repool db2084:3311,depool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410124 (owner: 10Marostegui) [09:12:20] PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:35] elukey: ^ [09:12:40] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2084:3315, depool db2075 (duration: 00m 55s) [09:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:21] (03PS1) 10Filippo Giunchedi: swift: add conftool client to proxy [puppet] - 10https://gerrit.wikimedia.org/r/410125 [09:14:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410126 (https://phabricator.wikimedia.org/T162807) [09:15:57] marostegui: ack thanks [09:17:06] <_joe_> paladox: around by any chance? [09:17:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410126 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:18:47] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: convert to using select instead of find [puppet] - 10https://gerrit.wikimedia.org/r/409850 [09:19:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410126 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:19:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410126 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:20:00] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/9943/ms-fe1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/410125 (owner: 10Filippo Giunchedi) [09:20:08] <_joe_> hey godog [09:20:13] <_joe_> merge-sniping me again [09:20:24] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool::scripts: convert to using select instead of find [puppet] - 10https://gerrit.wikimedia.org/r/409850 (owner: 10Giuseppe Lavagetto) [09:20:29] !log Stop replication in sync on db1089 and db1065 - T162807 [09:20:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 00m 55s) [09:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:44] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:20:49] haha _joe_ I have a very complex script that watches irc and merge-snipes [09:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:12] (03PS2) 10Filippo Giunchedi: swift: add conftool client to proxy [puppet] - 10https://gerrit.wikimedia.org/r/410125 [09:21:51] (03PS16) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [09:22:14] !log disabling puppet on all eqiad databases [09:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] _joe_: merged your change too btw [09:22:49] !log powercycle analytics1062 - not reachable via ssh, frozen via serial console [09:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:16] <_joe_> godog: ouch, I was waiting a sec [09:23:19] <_joe_> for a verification [09:23:28] (03CR) 10Gergő Tisza: "If those are my options I'd prefer if you'd argue :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [09:23:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410127 [09:23:49] ah, heh too late unless you want to revert [09:24:11] <_joe_> heh indeed, I'm gonna revert [09:24:27] (03PS1) 10Giuseppe Lavagetto: Revert "conftool::scripts: convert to using select instead of find" [puppet] - 10https://gerrit.wikimedia.org/r/410128 [09:24:50] RECOVERY - Host analytics1062 is UP: PING WARNING - Packet loss = 61%, RTA = 0.35 ms [09:25:05] (03PS2) 10Giuseppe Lavagetto: Revert "conftool::scripts: convert to using select instead of find" [puppet] - 10https://gerrit.wikimedia.org/r/410128 [09:25:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "conftool::scripts: convert to using select instead of find" [puppet] - 10https://gerrit.wikimedia.org/r/410128 (owner: 10Giuseppe Lavagetto) [09:25:41] (03PS3) 10Giuseppe Lavagetto: Revert "conftool::scripts: convert to using select instead of find" [puppet] - 10https://gerrit.wikimedia.org/r/410128 [09:25:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "conftool::scripts: convert to using select instead of find" [puppet] - 10https://gerrit.wikimedia.org/r/410128 (owner: 10Giuseppe Lavagetto) [09:25:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410127 (owner: 10Marostegui) [09:28:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410127 (owner: 10Marostegui) [09:28:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410127 (owner: 10Marostegui) [09:29:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 00m 54s) [09:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:22] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:30:10] !log Stop replication in sync on db1089 and dbstore1002 - T162807 [09:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:35] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966345 (10elukey) [09:32:43] !log Stop mysql on db2075 for mysql and kernel upgrade [09:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:55] apparently, at least one host gets blocked on "Loading facts" on puppet for a long time [09:38:26] normally that takes 1 second unless it is the first time it executes [09:38:43] it is taking at least cumin timout minutes now [09:39:23] or disable-puppet timout, whatever happens first [09:39:27] *disable-puppet timeout ;) [09:39:33] cumin default is no timeout [09:39:39] wait indefinitely [09:40:13] I will file that as a ticket, as it seems important [09:40:25] meanwhile I will continue my deploy [09:41:50] but how does puppet used to work, it happens every time? [09:42:55] maybe it only gets stuck when run interactivly? [09:43:22] jynus: once you finish your deploy let me know so I can try to run puppet a few times there and look at puppetmasters logs [09:43:37] maybe it is just that host [09:44:18] (03PS1) 10Marostegui: db-codfw.php: Repool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410130 [09:44:55] it takes 3 secons on es1018, so it is es1019 only issue [09:46:25] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410130 (owner: 10Marostegui) [09:47:06] (03PS17) 10Jcrespo: mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) [09:47:40] (03CR) 10Jcrespo: [C: 032] mariadb: Redo mariadb::backup class into role/profile style [puppet] - 10https://gerrit.wikimedia.org/r/409008 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [09:47:45] let's go! [09:47:50] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410130 (owner: 10Marostegui) [09:47:51] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:48:00] (03CR) 10jenkins-bot: db-codfw.php: Repool db2075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410130 (owner: 10Marostegui) [09:48:17] (03PS1) 10Jcrespo: Revert "mariadb: Redo mariadb::backup class into role/profile style" [puppet] - 10https://gerrit.wikimedia.org/r/410131 [09:48:30] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3966371 (10MoritzMuehlenhoff) Yuvi's shell access was removed via https://gerrit.wikimedia.org/r/407577 and I've also just removed... [09:48:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Redo mariadb::backup class into role/profile style" [puppet] - 10https://gerrit.wikimedia.org/r/410131 (owner: 10Jcrespo) [09:49:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2075, depool db2038 and db2059 (duration: 00m 55s) [09:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:07] it seems noop on the first host I tested [09:51:48] will you test on one host per "service" as in: 1 eqiad, 1 codfw, 1 misc, 1 dbstore, 1 pc, 1 es? [09:52:01] and sanitarium ofc [09:52:03] more or less [09:52:10] you want to split? [09:52:13] I have not disabled puppet on codfw [09:52:16] ah ok ) [09:52:20] so I am manually testing on codfw [09:52:22] let me do a noop on sanitarium [09:52:24] some selected hosts [09:52:30] wait [09:52:33] ok [09:52:51] the plan is to let the whole of codfw apply it naturaly in the next 30 minutes [09:52:58] ah fine :-) [09:52:58] !log filippo@neodymium conftool action : set/pooled=no; selector: name=ms-fe2005.codfw.wmnet [09:52:59] and then do also hosts on eqiad [09:53:07] oki, just let me know when you need me :) [09:53:09] specially sanitarium [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] and others that are not duplicated [09:53:19] yeah, sanitarium and labs.. [09:53:26] and then slowly enable it on eqiad [09:53:32] in theory this is a noop [09:53:43] mostly worried about ferm disabling the network [09:53:52] even if for a few seconds [09:54:04] (by mistake, I mean, not intended) [09:54:31] in fact, on partitioned hosts, this may be applying a ferm change [09:58:44] the idea is that we used to open a lots of ports, now we only open the ones we are listening [09:59:22] (03CR) 10Addshore: Add federation-related configs for clients (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409622 (https://phabricator.wikimedia.org/T186955) (owner: 10Ladsgroup) [10:00:59] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3966387 (10elukey) >>! In T182832#3965043, @mmodell wro... [10:04:34] (03CR) 10WMDE-leszek: [C: 04-1] "This is not the prettiest as it left some settings in, that would now be redundant, which is not the best thing to do." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409622 (https://phabricator.wikimedia.org/T186955) (owner: 10Ladsgroup) [10:08:15] !log roll-restart ms-fe in codfw/eqiad after applying https://gerrit.wikimedia.org/r/c/409942/ [10:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:09] (03PS3) 10MarcoAurelio: Log accessing private abusefilter details [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409445 (https://phabricator.wikimedia.org/T160357) [10:11:30] (03PS4) 10MarcoAurelio: Log accessing private abusefilter details [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409445 (https://phabricator.wikimedia.org/T160357) [10:19:12] (03PS1) 10Jcrespo: mariadb: Depool db1099 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410135 (https://phabricator.wikimedia.org/T184697) [10:20:37] (03PS2) 10Volans: Query: allow to extract random subset of hosts [software/cumin] - 10https://gerrit.wikimedia.org/r/409980 (https://phabricator.wikimedia.org/T186818) [10:21:57] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1099 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410135 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [10:22:07] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1099 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410135 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [10:23:37] (03Merged) 10jenkins-bot: mariadb: Depool db1099 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410135 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [10:24:38] (03PS2) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) [10:24:40] (03PS2) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [10:26:13] (03PS1) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) [10:26:37] (03CR) 10jerkins-bot: [V: 04-1] maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel) [10:27:02] (03CR) 10jenkins-bot: mariadb: Depool db1099 for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410135 (https://phabricator.wikimedia.org/T184697) (owner: 10Jcrespo) [10:27:50] (03CR) 10Gehel: [C: 04-1] "Do not merge: There is currently an issue with tile generation metrics (T187082)" [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel) [10:28:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1099 (duration: 00m 54s) [10:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:43] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3835986 (10MoritzMuehlenhoff) We can import the PHP 7.1... [10:31:46] (03PS2) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) [10:31:56] (03PS2) 10Filippo Giunchedi: Avoid default 60s nginx proxy timeouts for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/409954 (https://phabricator.wikimedia.org/T185466) (owner: 10Gilles) [10:33:07] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3966461 (10Gehel) [10:33:38] (03CR) 10Filippo Giunchedi: [C: 032] Avoid default 60s nginx proxy timeouts for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/409954 (https://phabricator.wikimedia.org/T185466) (owner: 10Gilles) [10:33:46] (03PS1) 10Jcrespo: mariadb: Repool db1099, improve formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410137 [10:35:51] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3966467 (10MoritzMuehlenhoff) Still, PHP 7.1 should be... [10:37:54] (03PS5) 10Filippo Giunchedi: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:44:54] _joe_ hi, around for what? :) [10:46:23] (03CR) 10Filippo Giunchedi: [C: 032] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:46:33] <_joe_> paladox: i wanted to try to fix the gerrit login issue without asking people to wipe their cookies [10:46:45] Ah [10:46:52] <_joe_> and I wondered if you had soem test instance in labs i can play with :P [10:47:02] _joe_ yep i do [10:47:06] gerrit-test3 [10:47:10] also, if the new login page is a custom one on our side... maybe we could fix it there [10:47:20] gerrit-test3.git.eqiad.wmflabs [10:47:28] volans yep the gerrit login page is custom [10:48:05] <_joe_> paladox: oh indeed [10:48:13] <_joe_> yeah lemme do that [10:48:18] it's just like prod [10:48:29] (03Merged) 10jenkins-bot: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:48:37] ok [10:48:40] (03CR) 10jenkins-bot: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (https://phabricator.wikimedia.org/T186534) (owner: 10Muehlenhoff) [10:48:59] <_joe_> where is our custom login page? :P Which repo? [10:49:07] _joe_ puppet [10:49:11] i will find it [10:49:14] <_joe_> ook! [10:50:09] _joe_ https://github.com/wikimedia/puppet/blob/production/modules/gerrit/files/etc/GerritSite.css and https://github.com/wikimedia/puppet/blob/production/modules/gerrit/files/static/gerritLogin.cache.js [10:51:21] !log filippo@tin Synchronized wmf-config/ProductionServices.php: depool poolcounter1002 for disk replacement (duration: 00m 56s) [10:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:24] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3966530 (10fgiunchedi) a:05fgiunchedi>03Cmjohnson Machine isn't in service now, @Cmjohnson all yours [10:53:50] <_joe_> so, I would think that adding a directive to remove the cookie 'GerritAccount' in renderErrorMessage() [10:53:59] <_joe_> would do the trick [10:54:23] <_joe_> it's also the correct thing to do as it seems gerrit is unable to log you in if you have that cookie set to an incorrect value [10:55:13] yep [11:00:39] <_joe_> paladox: nevermind, I just tried hand-crafting my cookie and it was reset [11:00:48] ok [11:00:54] <_joe_> so I'm not sure what's really going on there, and why it doesn't get reset [11:01:07] <_joe_> I'll do some tests [11:01:17] 10Operations, 10Beta-Cluster-Infrastructure: Remove video scaler instances from deployment-prep - https://phabricator.wikimedia.org/T187063#3966561 (10MoritzMuehlenhoff) Should not have any impact on the production scalers, to be extra sure I only shut them down for now and if no one complains in the next days... [11:01:36] ok [11:01:51] _joe_: I had 2 GerritAccount cookies after trying to login and it failed [11:03:56] 10Operations, 10Ops-Access-Requests: Access request: #mediawiki_security for Quiddity - https://phabricator.wikimedia.org/T187108#3966564 (10fgiunchedi) p:05Triage>03Normal [11:04:03] _joe_ seems we could have done flushing the web sessions [11:04:09] which would have made the cookie inactive [11:04:13] i found this https://gerrit-review.googlesource.com/#/c/gerrit/+/15763/ [11:04:41] <_joe_> I don't think that's what we're seeing [11:12:29] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Increase storage space for Wikidata Query Service - https://phabricator.wikimedia.org/T186526#3966584 (10faidon) [11:15:06] 10Operations, 10Ops-Access-Requests: Access request: #mediawiki_security for Quiddity - https://phabricator.wikimedia.org/T187108#3964766 (10fgiunchedi) LGTM, I've granted access. @Quiddity please test! [11:17:56] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#3966613 (10fgiunchedi) p:05Triage>03Normal I'm +1 on logging output from cron scripts, at least stdout whereas stder... [11:18:12] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3966615 (10Paladox) >>! In T182832#3966447, @MoritzMueh... [11:20:01] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#3966616 (10fgiunchedi) Other things to consider: use `chronic` (from `devscripts` package) or something to the same effe... [11:23:13] 10Operations, 10Beta-Cluster-Infrastructure: Remove video scaler instances from deployment-prep - https://phabricator.wikimedia.org/T187063#3966620 (10fgiunchedi) p:05Triage>03Normal [11:23:50] 10Operations, 10Puppet: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101#3966622 (10fgiunchedi) p:05Triage>03Normal [11:23:56] 10Operations, 10Discovery, 10Discovery-Search, 10Wikidata, and 3 others: Setup a WDQS test cluster on real hardware - https://phabricator.wikimedia.org/T186713#3966625 (10faidon) [11:24:29] 10Operations, 10Puppet: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101#3964579 (10fgiunchedi) I agree with the general sentiment/idea, though implementation wise we should be going through icinga alerts or phab tasks and not emails. [11:25:17] 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3966642 (10fgiunchedi) p:05Triage>03Normal [11:25:44] 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3966643 (10fgiunchedi) p:05Triage>03Normal [11:25:52] 10Operations, 10ops-eqsin, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966644 (10fgiunchedi) p:05Triage>03Normal [11:26:00] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966645 (10fgiunchedi) p:05Triage>03Normal [11:26:09] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966646 (10fgiunchedi) p:05Triage>03Normal [11:29:16] (03PS2) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [11:34:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9946/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [11:36:12] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1099, improve formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410137 (owner: 10Jcrespo) [11:37:47] !log Stop MySQL on db2059 and db2038 for kernel upgrade [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:53] (03Merged) 10jenkins-bot: mariadb: Repool db1099, improve formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410137 (owner: 10Jcrespo) [11:39:04] (03CR) 10jenkins-bot: mariadb: Repool db1099, improve formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410137 (owner: 10Jcrespo) [11:41:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1099 (duration: 00m 56s) [11:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:20] (03PS1) 10Marostegui: Revert "db-codfw.php: Repool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410150 [11:42:24] (03PS2) 10Marostegui: Revert "db-codfw.php: Repool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410150 [11:43:12] no_justification: heh I think maybe we should file a bug report so that gitiles is shown as it is confusing [11:44:16] 10Operations, 10Puppet: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101#3966714 (10MarcoAurelio) icinga alerts or phab tasks works for me as well [11:46:51] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Repool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410150 (owner: 10Marostegui) [11:47:45] !log reenabling puppet on all eqiad databases [11:47:52] \o/ [11:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:19] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Repool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410150 (owner: 10Marostegui) [11:48:29] (03CR) 10jenkins-bot: Revert "db-codfw.php: Repool db2075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410150 (owner: 10Marostegui) [11:50:01] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Rpool db2038 and db2059 (duration: 00m 55s) [11:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:15] (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410152 [11:52:41] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410152 (owner: 10Marostegui) [11:55:53] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410152 (owner: 10Marostegui) [11:56:42] !log Deploy schema change on db2066 - T187089 T185128 T153182 [11:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:58] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [11:56:58] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [11:56:58] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [11:57:03] (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410152 (owner: 10Marostegui) [11:57:09] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2066 (duration: 00m 55s) [11:57:13] (03PS3) 10Arturo Borrero Gonzalez: apt: apt-upgrade: add switch for the node name output [puppet] - 10https://gerrit.wikimedia.org/r/409323 (https://phabricator.wikimedia.org/T181647) [11:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] (03PS19) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [11:57:51] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:57:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:59:49] (03CR) 10Muehlenhoff: [C: 04-1] [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:00:04] matthiasmullie: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Multimedia deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:04:02] paravoid: I plan to merge https://gerrit.wikimedia.org/r/409323 unless you have some further concern [12:05:04] depending on how pedantic you want me to be :P [12:05:23] I have another follow-up patch which is more important :-) [12:05:55] it's fine, just note that logging module can do string formatting itself and it's better to do that at that layer because strings do not get formatted unless they're about to be printed [12:06:19] but there's a slight problem (unless something has changed in very new pythons), that it only supports old-style string formatting (%s etc.) [12:06:36] and not the new braces format [12:06:42] I think you can hack it around but it's not pretty [12:06:49] volans may have more accurate/uptodate information :) [12:07:12] ok, lets merge and then have a later iteration with fine-tuning? [12:07:15] !log mlitn@tin Synchronized php-1.31.0-wmf.20/extensions/3D/modules/ext.3d.js: Fix 3D badge (duration: 00m 56s) [12:07:16] what paravoid said :) [12:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:38] logging.disable? [12:08:07] I'm deep into annual planning and can't review thoroughly now I'm afraid :( [12:08:15] (03PS4) 10Matthias Mullie: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) [12:08:49] volans: I'm reusing a function that logs things inside. But in that code path I want it disabled [12:09:15] the calculate_upgrades() function [12:10:18] couldn't be achieved using debug logging for this? [12:10:31] just wondering, didn't had look at the code thoroughly, just opened right now [12:10:47] I don't think so, since the inner logging is at INFO level [12:11:01] (03PS3) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [12:11:35] what I mean is to move the inner logging to debug, and use debug globally when you need that inner debug info, and use info when you don't [12:12:04] but it might not cover all cases. Technically you could add a log level, but is not really exposed properly by python logging module unfortunately [12:12:10] (03PS1) 10Muehlenhoff: Remove deployment-tmh01 and deployment-videoscaler01 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/410158 (https://phabricator.wikimedia.org/T187063) [12:12:11] and could be an overkill in this case [12:12:20] well, then is the same hack but the other way around, right? [12:13:39] volans: I'm interested in review for the next in the patch series (not yet uploaded) [12:13:41] (03CR) 10Matthias Mullie: [C: 032] Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:13:43] * arturo merging this one [12:14:58] I can't sign in to gerrit [12:15:11] check wikitech-l email [12:15:15] <_joe_> arturo: see wikitech-l, you have to clean your GerritAccount cookie [12:15:21] ok [12:16:14] (03Merged) 10jenkins-bot: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:17:26] (03CR) 10jenkins-bot: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:17:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: add switch for the node name output [puppet] - 10https://gerrit.wikimedia.org/r/409323 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:19:05] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) [12:19:08] volans: ^^^ [12:19:40] arturo: I'll try to have a look after lunch [12:19:40] (03CR) 10jerkins-bot: [V: 04-1] apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:19:50] !log mlitn@tin Synchronized wmf-config/CommonSettings.php: Enable STL uploads on Commons (duration: 00m 55s) [12:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:52] !log mlitn@tin Synchronized wmf-config/InitialiseSettings.php: Enable STL uploads on Commons (duration: 00m 55s) [12:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:05] (03PS4) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [12:22:33] (03PS2) 10Arturo Borrero Gonzalez: apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) [12:23:56] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:26:36] "Enable STL uploads on Commons" w0000t [12:28:06] stl? [12:28:41] 3d models [12:29:23] whose excited for the debate over if firearms are realistically useful for an educational purpose! [12:29:49] <_joe_> ahah [12:32:20] bawolff: :) [12:34:00] (03PS20) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:35:05] (03CR) 10Hashar: [C: 031] "Should be good." [puppet] - 10https://gerrit.wikimedia.org/r/410158 (https://phabricator.wikimedia.org/T187063) (owner: 10Muehlenhoff) [12:36:07] (03PS2) 10Matthias Mullie: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409928 (https://phabricator.wikimedia.org/T184728) [12:40:50] (03PS5) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [12:41:51] bawolff: hmm. the JS keeps in a loading state for me for the model... [12:42:05] worked the first time round.. weird. [12:43:35] (03CR) 10Matthias Mullie: [C: 032] Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409928 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:43:43] (03PS6) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [12:46:16] (03PS21) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:46:32] (03PS7) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [12:46:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:47:52] (03Merged) 10jenkins-bot: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409928 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:48:02] (03CR) 10jenkins-bot: Enable 3D on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409928 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [12:50:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9951/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [12:50:32] (03PS22) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:50:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:51:14] (03CR) 10Addshore: [C: 031] Enable the visual diff beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409091 (owner: 10Jforrester) [12:51:58] !log mlitn@tin Synchronized wmf-config/CommonSettings.php: Enable STL uploads on Commons (duration: 00m 56s) [12:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:08] (03PS23) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:53:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:55:41] (03PS2) 10Muehlenhoff: Remove deployment-tmh01 and deployment-videoscaler01 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/410158 (https://phabricator.wikimedia.org/T187063) [12:56:40] (03PS24) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:57:10] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [13:01:08] (03PS25) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [13:06:08] (03CR) 10Muehlenhoff: [C: 032] Remove deployment-tmh01 and deployment-videoscaler01 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/410158 (https://phabricator.wikimedia.org/T187063) (owner: 10Muehlenhoff) [13:07:29] (03PS4) 10Muehlenhoff: Add vgutierrez shell account [puppet] - 10https://gerrit.wikimedia.org/r/409844 (https://phabricator.wikimedia.org/T187035) (owner: 10Vgutierrez) [13:08:05] (03CR) 10Muehlenhoff: [C: 032] Add vgutierrez shell account [puppet] - 10https://gerrit.wikimedia.org/r/409844 (https://phabricator.wikimedia.org/T187035) (owner: 10Vgutierrez) [13:08:27] 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3815107 (10BBlack) These are fully up and functional, but I think they're still missing asset tags in the DNS. [13:09:13] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10BBlack) These are fully up and functional, but I think they're still missing asset tags in the DNS. [13:10:59] 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3794022 (10BBlack) Missing asset tags in the DNS. Also, 2/12 are in various borked states: T187157 T187158 [13:11:22] 10Operations, 10ops-eqsin, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10BBlack) [13:11:24] 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3966180 (10BBlack) [13:11:27] 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3966957 (10BBlack) [13:12:36] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3793986 (10BBlack) dns5001 done other than asset tags. dns5002 borked in sub-task: T186902 [13:12:45] 10Operations, 10ops-eqsin: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#3958682 (10BBlack) [13:12:47] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3966971 (10BBlack) [13:13:38] 10Operations, 10ops-eqsin, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966974 (10BBlack) Switch config says xe-1/0/4, but that port appears to be missing (no SFP?) [13:14:42] 10Operations, 10MediaWiki-JobQueue, 10Wikidata, 10Performance-Team (Radar), and 3 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3966996 (10Ladsgroup) 05Open>03Resolved The jobqueue size has been reduced to 1.6M and will go down more once we enable "lua fine graine... [13:14:56] 10Operations, 10ops-eqsin, 10Traffic, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967012 (10BBlack) [13:15:05] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#3967013 (10BBlack) [13:15:31] 10Operations, 10ops-eqsin, 10Traffic: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#3967014 (10BBlack) [13:15:34] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967015 (10BBlack) [13:15:47] 10Operations, 10ops-eqsin, 10Traffic: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3967016 (10BBlack) [13:16:08] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967017 (10BBlack) [13:16:18] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3967018 (10BBlack) [13:16:34] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3967022 (10BBlack) [13:16:46] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3967023 (10BBlack) [13:17:00] 10Operations, 10ops-eqsin, 10DC-Ops, 10Traffic: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3967024 (10BBlack) [13:17:33] (03PS1) 10Volans: Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) [13:19:22] * volans forgot to merge a fix [13:20:00] (03CR) 10jerkins-bot: [V: 04-1] Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [13:23:16] 10Operations, 10ops-eqsin, 10Traffic: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10BBlack) >>! In T181569#3901023, @ayounsi wrote: > The other atlas don't seem to be connected to a scs so I can't compare. Do we even have a use for scs on atlas, if it's not use... [13:25:32] (03PS1) 10Vgutierrez: Add vgutierrez to ops group [puppet] - 10https://gerrit.wikimedia.org/r/410168 (https://phabricator.wikimedia.org/T187035) [13:27:09] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3967086 (10BBlack) [13:27:12] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3967087 (10BBlack) [13:29:26] (03PS3) 10Arturo Borrero Gonzalez: apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) [13:30:14] (03PS2) 10Volans: Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) [13:35:02] (03CR) 10BBlack: [C: 032] Add vgutierrez to ops group [puppet] - 10https://gerrit.wikimedia.org/r/410168 (https://phabricator.wikimedia.org/T187035) (owner: 10Vgutierrez) [13:36:02] (03PS4) 10Rush: apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:36:21] (03CR) 10Rush: [C: 031] "Let's try this, seems good. I noted one errant comment on IRC but otherwise cool" [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:36:31] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:36:40] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:36:50] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:36:57] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: add package exclusion by reading a file [puppet] - 10https://gerrit.wikimedia.org/r/410159 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:37:10] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:37:11] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:37:30] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:38:00] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:42:20] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:48:08] (03PS3) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) [13:48:09] (03PS3) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [13:50:04] !log Deploy schema change on dbstore2001 - T187089 T185128 T153182 [13:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:17] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:50:17] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [13:50:18] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [13:51:53] !log Reboot db2066 to pick up new kernel [13:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:10] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:55:11] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:55:20] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [13:55:22] hello stat1005 [13:55:30] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [13:55:40] RECOVERY - Disk space on stat1005 is OK: DISK OK [13:55:40] RECOVERY - DPKG on stat1005 is OK: All packages OK [13:55:45] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#3967299 (10fgiunchedi) [13:55:50] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [13:56:13] jouncebot, next [13:56:14] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1400) [13:58:00] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:58:14] (03PS10) 10Muehlenhoff: Add support for selective automatic restarts of stateless services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [13:58:22] (03PS11) 10Muehlenhoff: Add support for selective automatic restarts of stateless services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1400). [14:00:04] kart_ and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] I'm here :) [14:00:14] me too. [14:00:16] I can SWAT today [14:00:42] zeljkof, that's great! [14:00:44] Wonder why jouncebot didn't ping amire80? [14:00:47] kart_ does your patch need more than a few minutes to test? [14:01:00] zeljkof: nope. just simple test. [14:01:08] Jayprakash12345, don't you want a IRC cloak? ;) [14:01:16] kart_: maybe because he left irc-nickname as his irc nickname :) [14:01:36] kart_: ok, I'll ping you when the patch is at mwdebug [14:01:37] :) [14:01:39] oh yeah. [14:01:45] zeljkof: OK! [14:02:23] kart_: can you test amir's patch? if not, it will not be deployed, I don't see him in the channel [14:03:02] zeljkof: he will join in a few minutes. [14:03:14] kart_: ok, he's next, has 5-10 minutes [14:03:21] Urbanecm: I want, But How I can get? [14:03:27] zeljkof: OK. [14:03:46] hallo hallo hallo [14:03:50] Jayprakash12345, fill https://docs.google.com/forms/d/e/1FAIpQLSc9f95s72cfLD-pTZrTxCD-Kyb9umm5XVz_JOWtazADYkjpvA/viewform#start=openform [14:04:00] aharoni: please stand by, your patch is important to us :) [14:04:05] \o/ [14:04:05] you are next, after kart_ [14:04:20] WMF's IRC group contact will grant you cloak or contact you in case of problems [14:04:32] aharoni: you did not format your name/irc in the calendar :) aharoni (irc-nickname) [14:04:40] (you won't be contacted if cloak will be granted) [14:05:22] um [14:05:24] 10Operations, 10Code-Stewardship-Reviews, 10Services: zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#3967379 (10akosiaris) [14:05:25] I'm stupid [14:05:39] what calendar? and what is a cloak? [14:05:51] https://meta.wikimedia.org/wiki/IRC/Cloaks [14:06:29] aharoni: this is the calendar https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1200 [14:06:31] aharoni, something that "hides" your IP address at IRC, shortly... [14:06:48] aharoni: cloak is for Jayprakash12345 [14:07:32] Official docs seem to have rather high requirements to get a cloak [14:08:29] Well Jayprakash12345 has lots of edits on hindi projects,s o he meets the requirements [14:08:38] oh, got it [14:08:40] thanks [14:08:54] unrelated, but this is scary (from fatalmonitor) [14:08:54] aharoni: You can also join the cool kids and get a cloak [14:08:56] 416 data error in /srv/mediawiki/php-1.31.0-wmf.20/extensions/Graph/includes/ApiGraph.php on line 125 [14:09:53] (03PS2) 10Zfilipin: Add sitename for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) (owner: 10Amire80) [14:10:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) (owner: 10Amire80) [14:10:57] zeljkof it seems the beta cluster got that too [14:11:08] zeljkof https://integration.wikimedia.org/ci/job/beta-scap-eqiad/195292/ [14:11:08] (03CR) 10Gehel: [C: 04-1] "Minor naming issues (comment inline). Otherwise LGTM." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [14:11:32] Urbanecm: Thanks. [14:11:42] Jayprakash12345, you're welcome [14:11:50] paladox: what's the problem? [14:12:00] in the middle of something else, well, swat [14:12:20] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Tue 2018-02-13 14:12:16 UTC. [14:12:31] (03Merged) 10jenkins-bot: Add sitename for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) (owner: 10Amire80) [14:12:46] (03CR) 10jenkins-bot: Add sitename for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408032 (https://phabricator.wikimedia.org/T184521) (owner: 10Amire80) [14:12:53] zeljkof permission errors [14:13:02] zeljkof 14:06:54 rsync: mkstemp "/srv/mediawiki/php-master/cache/gitinfo/.info-extensions-3D.json.WBJ3La" failed: Permission denied (13) [14:13:11] paladox: is there a task about it? [14:13:16] nope [14:13:22] it just started happening [14:13:27] could you please create one [14:13:36] ok [14:13:53] thanks [14:14:10] there were some permission problems yesterday [14:14:33] done https://phabricator.wikimedia.org/T187195 [14:14:35] zeljkof ^^ [14:14:47] thanks [14:14:56] your welcome [14:15:33] aharoni: your patch is deployed to mwdebug1002, please test and let me know if I can deploy it to low earth orbit [14:15:46] aharoni: oh, wait, will be there in a minute, will ping you [14:16:10] (03PS1) 10Arturo Borrero Gonzalez: WIP: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [14:16:35] (03CR) 10jerkins-bot: [V: 04-1] WIP: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [14:16:58] aharoni: ok, its at mwdebug1002, please test and let me know [14:18:03] (03PS1) 10Filippo Giunchedi: site: decom graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/410178 (https://phabricator.wikimedia.org/T187190) [14:18:13] Urbanecm: please stand by, your patch is important to us :) I will ping you in a few minutes when your patch is at mwdebug [14:18:27] still working on aharoni and kart_'s patches [14:18:54] * kart_ is waiting.. [14:19:11] zeljkof, I'm still here and will be for next 1,25 hours [14:19:13] (03PS2) 10Filippo Giunchedi: site: decom graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/410178 (https://phabricator.wikimedia.org/T187190) [14:19:27] checking [14:20:07] zeljkof: tested, works! [14:20:49] aharoni: ok, deploying [14:20:57] kart_: your patch is at mwdebug1002 [14:21:57] OK testing.. [14:22:20] aharoni, kart_, Urbanecm: you should work on getting deployment privileges, if you don't already have them [14:22:24] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:408032|Add sitename for sdwiki (T184521)]] (duration: 00m 57s) [14:22:32] swat process is likely to change [14:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:36] T184521: [[MediaWiki:Pagetitle-view-mainpage/sd]] i18n issue: Sitename is not configured - https://phabricator.wikimedia.org/T184521 [14:22:37] just fyi [14:22:41] zeljkof, change how, if you have time? :) [14:22:54] aharoni: deployed, please check at production [14:23:08] (03CR) 10Filippo Giunchedi: [C: 032] site: decom graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/410178 (https://phabricator.wikimedia.org/T187190) (owner: 10Filippo Giunchedi) [14:23:27] Urbanecm: it's likely that it will be more self service, and less done by #releng, but still not written in stone [14:23:36] zeljkof: tested, works. Thank you! [14:23:39] zeljkof, ok, thanks [14:23:54] aharoni: thanks for deploying with #releng! ;) [14:24:37] zeljkof: We are good. [14:24:44] zeljkof: go ahead, please! [14:24:45] kart_: ok, deplying [14:25:10] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:25:34] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#3967466 (10fgiunchedi) a:05fgiunchedi>03None Machine is spare in puppet, good to be decom whenever. [14:25:44] !log zfilipin@tin Synchronized php-1.31.0-wmf.20/extensions/ContentTranslation/extension.json: SWAT: [[gerrit:410105|Add ext.cx.widgets.overlay dependency to template editor (T187119)]] (duration: 00m 55s) [14:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] T187119: [wmf.20-regression] Uncaught TypeError: $targetTemplate.first(...).cxoverlay is not a function - https://phabricator.wikimedia.org/T187119 [14:25:57] kart_: deployed, please check production [14:26:10] Urbanecm: you are next, reviewing and merging your patche3s [14:26:16] zeljkof, ack [14:27:22] zeljkof: hmm. Let me clear cache and try again. [14:27:31] seems not working. trying again. [14:28:59] kart_: it worked at mwdebug, but does not work at production? [14:29:18] (03CR) 10Dzahn: "that the status page gets removed is expected and intended by joe,the Perl module though i dont know why yet" [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn) [14:29:38] zeljkof: yep [14:29:58] kart_: revert? or waiting a bit more for cache magic to happen? ;) [14:30:03] kart_, zeljkof - I'll try to test. [14:30:13] aharoni: OK [14:30:15] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409625 (https://phabricator.wikimedia.org/T185865) (owner: 10Urbanecm) [14:30:17] (03PS1) 10Jcrespo: [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) [14:30:19] zeljkof: wait for a while.. [14:30:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [14:31:33] kart_: can I help? [14:32:06] Nikerabbit: can you test CX in production? [14:32:22] (03Merged) 10jenkins-bot: Change logos for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409625 (https://phabricator.wikimedia.org/T185865) (owner: 10Urbanecm) [14:32:22] It seems worked in mwdebug, but not after deploy. [14:32:36] (03CR) 10jenkins-bot: Change logos for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409625 (https://phabricator.wikimedia.org/T185865) (owner: 10Urbanecm) [14:32:37] kart_: is there a page where the error always happens? [14:32:38] (03CR) 10Ottomata: "One thought, general +1 though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:32:56] Nikerabbit: take any templare, for example infobox. [14:33:01] template* [14:33:22] (03CR) 10Elukey: role::kafka::analytics::burrow: move to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:33:44] Nikerabbit, kart_ - I'm trying to translate https://en.wikipedia.org/wiki/Kartoffelsalat_%E2%80%93_Nicht_fragen! to Hebrew, with mwdebug1002 enabled, and I get "TypeError: $targetTemplate.first(...).cxoverlay is not a function" [14:33:52] when I click the template [14:34:07] (probably happens in every article with a template) [14:34:22] Urbanecm: 409625 is at mwdebug1002 [14:34:24] OK. So, fix didn't worked. Wondering why not in Production.. [14:34:29] zeljkof, will test [14:34:40] kart_: probably caching, I can try removing the file from cache [14:34:48] should I do that? [14:34:56] zeljkof: yes. [14:34:57] it works with debug=true, which seems to indicate resoure loader module cache is the issue [14:34:58] (03PS3) 10Volans: Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) [14:35:13] Nikerabbit: what is solution for that? full SCAP? [14:35:17] zeljkof, working, please deploy [14:35:24] (03CR) 10Volans: "Thanks for the review, replies inline, all fixed." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [14:35:48] kart_, aharoni, Nikerabbit: this is all I know about cache purges, you think it would work? :) https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges [14:36:01] hashar: ^ [14:36:28] I am pretty sure Roan or Krinkle had figured some fix for this some time in the back [14:36:42] something like touching the files and syncing again [14:37:12] zeljkof: try this ^ [14:37:16] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:409625|Change logos for sdwiki (T185865)]] (duration: 00m 55s) [14:37:26] but we only touched extension.json, hmm [14:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:30] T185865: Change the logo of Sindhi Wikipedia - https://phabricator.wikimedia.org/T185865 [14:37:30] Nikerabbit: yes. That happened few times. [14:37:34] (03PS2) 10Gehel: maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) [14:37:42] I wonder if we also touch tools/ext.cx.tools.template.editor.js and sync it [14:38:09] (03CR) 10jerkins-bot: [V: 04-1] maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [14:38:41] do we have time for such an experiment? [14:38:45] Urbanecm: deployed and purged cache, please test [14:38:59] zeljkof: that too. touch tools/ext.cx.tools.template.editor.js and extension.js and sync-file again. That should fix. If not, we have to ask experts.. [14:39:12] Nikerabbit: we should. Other patches are config patches. [14:39:31] (03PS3) 10Gehel: maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) [14:39:33] kart_, Nikerabbit: I'm a reluctant to do that, since it's not in the docs, are you sure it will not break stuff? [14:39:58] hashar, thcipriani|afk: around to help with a caching question? [14:40:07] Not sure, zeljkof. [14:40:24] godog: we have a caching problem/question [14:40:26] (03PS8) 10Elukey: role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) [14:40:34] zeljkof, will test [14:40:41] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3967535 (10BBlack) Well, the above solution would still leave your with your short-age problem, if you didn't also zero out the Age. Do we want to... [14:40:45] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [14:40:54] zeljkof: syncing a file that is not changed is not able to break anything in my imagination :) [14:40:55] zeljkof: shoot [14:40:59] zeljkof, working, we can go to the next patch :D [14:41:09] godog: https://gerrit.wikimedia.org/r/#/c/410105/ is deployed, works at mwdebug1002, does not work at prod [14:41:34] works with debug=true [14:41:44] so probably caching, right? [14:42:09] I'd say so, if in prod works with debug=true [14:42:10] swat deploy docs have only this regarding cache purges https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges [14:42:29] hm, there is also https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [14:43:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9953/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:43:16] I think it's probably this: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [14:43:42] also what file/url are we looking at ? [14:43:48] not sure how to clear that manually, I've been able to update l10n cache manually if there's a RL problem. [14:44:03] thcipriani|afk: thanks! I guess we can just wait a bit more then [14:45:19] (03CR) 10Filippo Giunchedi: icinga: add check_established_connections plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [14:45:21] https://www.irccloud.com/pastebin/Pt4HSxxF/ [14:45:33] godog: ^ [14:45:56] (copy/paste failure on my side) [14:46:32] ack, thanks zeljkof [14:47:14] I guess my question was how to reproduce the problem, i.e. exactly what url doesn't show up as it should [14:47:41] thcipriani|afk: looks like the deployment was 20 minutes ago :( "kart_: deployed, please check production" [14:48:03] kart_, aharoni, Nikerabbit: please see godog's question ^ [14:48:31] kart_, aharoni, Nikerabbit: still does not work (it's been 20 minutes, cache should be gone) [14:48:37] zeljkof, what's the state? ;) [14:48:37] godog: are you familiar with the Content Translation extension? [14:48:41] according to https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [14:48:51] yeah, seems like for some reason RL is not picking up the change even after it should have already. I don't know the logic by which it makes the decision to pick up new code. [14:48:57] Urbanecm: sorry, trouble with https://gerrit.wikimedia.org/r/#/c/410105/ [14:49:05] godog: there's no easy way to send a URL, because they are per user [14:49:06] aharoni: no I'm not [14:49:10] godog: it's hard to give a certain url because rl modules can be store din localstorage [14:49:21] but I am looking at https://fi.wikipedia.org/w/load.php?debug=false&lang=fi&modules=ext.cx.feedback%2Cpageselector%2Cprogressbar%2Ctools%7Cext.cx.tools.card%2Ccategories%2Cdictionary%2Cformatter%2Cgallery%2Cimages%2Cinstructions%2Clink%2Clinter%2Cmanager%2Cmt%2Cmtabuse%2Cpoem%2Creference%2Ctemplate%7Cext.cx.tools.mt.card%7Cext.cx.tools.template.card%2Ceditor%7Cmw.cx.widgets.TemplateParamOp [14:49:24] Urbanecm: I'll continue with your patches :) [14:49:27] tionWidget&skin=vector&version=07neehu for example [14:49:30] zeljkof, great! [14:49:41] (03PS2) 10Zfilipin: Update logo for urwikibooks, add hd logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406801 (https://phabricator.wikimedia.org/T185977) (owner: 10Urbanecm) [14:49:50] ugh the url broke into two lines [14:50:22] but I am not sure whether the dependency should just be included in that url in modules, or would it appear in the request body regardless [14:51:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406801 (https://phabricator.wikimedia.org/T185977) (owner: 10Urbanecm) [14:52:45] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406801 (https://phabricator.wikimedia.org/T185977) (owner: 10Urbanecm) [14:52:48] (03Merged) 10jenkins-bot: Update logo for urwikibooks, add hd logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406801 (https://phabricator.wikimedia.org/T185977) (owner: 10Urbanecm) [14:53:56] Nikerabbit: yeah in the url above changing to debug=true I still can't see overlay, if I'm doing it right [14:54:10] kart_, aharoni, Nikerabbit, godog: with 5 minutes left, revert? [14:54:32] or are we likely to have a solution soon? [14:54:44] or, if it's broken anyway, just leave it as is? :) [14:54:59] (it was broken before, and this does not fix it, that is) [14:55:13] (03CR) 10Elukey: [C: 032] role::kafka::analytics::burrow: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/409912 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:55:19] zeljkof: leave it. I hope that some cache magic should happen :) [14:55:50] Nikerabbit: is that OK? [14:55:54] zeljkof: up to you, might as well leave it [14:55:54] Urbanecm: 406801 is at mwdebug, please test, that will probably be the last commit for today [14:56:57] zeljkof, works, deploy please [14:57:21] kart_, Nikerabbit, godog: leaving it as-is is fine with me, if you think that is the way to go [14:57:25] Urbanecm: deploying [14:58:32] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:406801|Update logo for urwikibooks, add hd logo (T185977)]] (duration: 00m 54s) [14:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:46] T185977: Update logo for Urdu Wikibooks - https://phabricator.wikimedia.org/T185977 [14:59:02] +1 for leave it [14:59:17] hopefully some scap/timeout will clear it [14:59:21] (03CR) 10jenkins-bot: Update logo for urwikibooks, add hd logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406801 (https://phabricator.wikimedia.org/T185977) (owner: 10Urbanecm) [14:59:27] zeljkof: Thanks! [14:59:36] kart_, Nikerabbit, godog: ok, leaving 410105 as-is [14:59:39] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:406801|Update logo for urwikibooks, add hd logo (T185977)]] (duration: 00m 55s) [14:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:17] Urbanecm: 406801 is deployed, cache purged [15:00:36] zeljkof, thanks [15:00:37] sorry, ran out of time, please reschedule other commits for another swat [15:00:48] I've noted that other patches were skipped and added them for tomorrow swat [15:00:59] Urbanecm: please do [15:01:07] !log EU SWAT finished [15:01:07] I did it already ;) [15:01:14] in that case, thanks :) [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] 10Operations, 10Scap: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076#3967607 (10fgiunchedi) >>! In T187076#3963935, @demon wrote: > This usually happens for one of two reasons > # A root user has come along and stolen o... [15:01:51] (03PS4) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) [15:01:53] (03PS4) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [15:02:11] <_joe_> ema: should I dare to look? [15:02:34] _joe_: I'd spare you the pain till pcc is happy [15:02:57] <_joe_> ok! [15:03:04] <_joe_> we really need a portable pcc [15:03:12] <_joe_> that can work on our home dir [15:03:22] <_joe_> I'll do that once I've been cloned twice [15:03:39] <_joe_> the first clone is already allocated 100% [15:03:41] <_joe_> of course [15:04:18] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[burrow-eqiad] [15:05:10] 10Operations: Create HA setup for DNS recursion - https://phabricator.wikimedia.org/T79058#3967618 (10BBlack) 05Open>03Resolved We have redundant, HA recursors at all sites now via LVS. Next stages of this effort are anycast-related in T186550 [15:06:21] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3967622 (10BBlack) a:05BBlack>03RobH These have been spared-out for a while now and we're fine on the new ones, please kill. [15:06:59] grr [15:07:14] (03PS5) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [15:09:24] 10Operations, 10Traffic, 10Patch-For-Review: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#3967627 (10BBlack) [15:09:27] 10Operations, 10Traffic, 10Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#3967625 (10BBlack) 05Open>03declined We did some interesting refactoring here that helped with some other related issues, but the main goal here... [15:10:58] 10Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3967629 (10BBlack) 05Open>03Resolved AFAIK with the resolution of T162035 we haven't had further reports. [15:12:22] (03PS1) 10Marostegui: db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410184 [15:13:17] _joe_: mmh, perhaps the selector there is not equivalent to type(blah) =~ Type[whatever]? [15:13:20] > Error while evaluating a Function Call, Unexpected data in service configuration [15:13:30] https://puppet-compiler.wmflabs.org/compiler02/9955/lvs1001.wikimedia.org/change.lvs1001.wikimedia.org.err [15:14:55] <_joe_> ema: no, they work correctly in my tests [15:15:19] <_joe_> ema: give me 30 minutes, I'm finishing something else [15:15:25] sure [15:15:35] <_joe_> ema: actually I have a meeting after that [15:15:39] <_joe_> :( [15:17:31] _joe_: thanks for the help so far :) [15:17:54] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410184 (owner: 10Marostegui) [15:18:59] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Increase storage space for Wikidata Query Service - https://phabricator.wikimedia.org/T186526#3967646 (10RobH) a:05Gehel>03RobH I'll request quotation from the vendor for adding SSDS. I'll make a sub-task in #procurement, s... [15:19:31] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410184 (owner: 10Marostegui) [15:19:42] (03CR) 10jenkins-bot: db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410184 (owner: 10Marostegui) [15:20:41] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2066 (duration: 00m 55s) [15:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:15] (03PS1) 10Elukey: burrow: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/410186 (https://phabricator.wikimedia.org/T180442) [15:26:30] (03PS4) 10BBlack: URL Path Normalization: refactor, add to cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407488 (https://phabricator.wikimedia.org/T127387) [15:27:07] (03PS2) 10Elukey: burrow: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/410186 (https://phabricator.wikimedia.org/T180442) [15:27:46] jouncebot: next [15:27:46] In 1 hour(s) and 32 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1700) [15:30:37] !log deploying changes to URL-encoding normalization on caches - https://gerrit.wikimedia.org/r/407488 [15:30:41] (03CR) 10BBlack: [C: 032] URL Path Normalization: refactor, add to cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407488 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [15:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:01] (03PS3) 10Elukey: burrow: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/410186 (https://phabricator.wikimedia.org/T180442) [15:31:25] 10Operations, 10Ops-Access-Requests: Access request: #mediawiki_security for Quiddity - https://phabricator.wikimedia.org/T187108#3967687 (10RobH) 05Open>03Resolved a:03RobH @fgiunchedi: You added them in the fashion we add to the #wikimedia-channels, which is via the channel flags, security is controlle... [15:31:49] (03CR) 10Filippo Giunchedi: maps: Icinga alert when OSM replication lags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [15:32:02] 10Operations, 10Ops-Access-Requests: Access request: #mediawiki_security for Quiddity - https://phabricator.wikimedia.org/T187108#3967691 (10RobH) The fact the one channel differs from the rest annoys me, eventually (pending the discussion at the off site), I plan to rehaul it to match the rest. [15:34:11] (03CR) 10Elukey: [C: 032] burrow: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/410186 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [15:34:17] (03PS4) 10Elukey: burrow: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/410186 (https://phabricator.wikimedia.org/T180442) [15:34:26] 10Operations, 10Ops-Access-Requests: Access request: #mediawiki_security for Quiddity - https://phabricator.wikimedia.org/T187108#3967705 (10fgiunchedi) >>! In T187108#3967687, @RobH wrote: > @fgiunchedi: You added them in the fashion we add to the #wikimedia-channels, which is via the channel flags, security... [15:35:02] (03PS5) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) [15:35:04] (03PS6) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [15:35:14] !log Deploy schema change on s5 codfw master (db2052), this will generate lag on codfw - T187089 T185128 T153182 [15:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:30] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [15:35:30] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [15:35:30] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [15:37:21] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3967736 (10mmodell) >>! In T182832#3966467, @MoritzMueh... [15:37:35] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3967741 (10Papaul) @chasemp labtestnet2002:eth0 = ge-1/0/13 labtestvirt2003:eth2 = ge-1/0/14 [15:38:49] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3967748 (10Paladox) @mmodell though buster is a few yea... [15:39:18] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:41:58] (03CR) 10Jgreen: [C: 031] admin: revoke access for shrlak [puppet] - 10https://gerrit.wikimedia.org/r/409832 (https://phabricator.wikimedia.org/T186614) (owner: 10Filippo Giunchedi) [15:45:06] _joe_: oh, $service['ip'][$::site] is a struct, not an array! [15:45:36] (03PS3) 10Filippo Giunchedi: admin: revoke access for shrlak [puppet] - 10https://gerrit.wikimedia.org/r/409832 (https://phabricator.wikimedia.org/T186614) [15:45:45] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966288 (10Cmjohnson) megacli Enclosure Device ID: 32 Slot Number: 3 Drive's position: DiskGroup: 4, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 3 WWN: 5000c50080a382b5 Sequence Numb... [15:46:07] <_joe_> ema: Type[Hash] might match that [15:46:15] ema: clearly it's time to spend 5 minutes quickly refactoring the LVS hieradata :) [15:46:53] (03CR) 10Filippo Giunchedi: [C: 032] admin: revoke access for shrlak [puppet] - 10https://gerrit.wikimedia.org/r/409832 (https://phabricator.wikimedia.org/T186614) (owner: 10Filippo Giunchedi) [15:47:05] <_joe_> bblack: I was about to say [15:47:12] <_joe_> we should only use the latter format [15:47:34] yeah [15:50:56] which is the latter format? [15:51:00] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3967804 (10Cmjohnson) Ticket created with Dell You have successfully submitted request SR960779440. [15:51:03] bblack: I actually think there's nothing to refactor :) [15:51:18] <_joe_> bblack: we have ips in the format [15:51:29] <_joe_> eqiad: '1.2.3.4' [15:51:46] ema: I was just trolling. You end up staring at it for a week and then giving up :) [15:51:56] <_joe_> and eqiad: { ip4: '1.2.3.4', ip6: .... } [15:52:06] right [15:52:10] <_joe_> bblack: should we create a t-shirt? [15:52:22] <_joe_> "I refactored lvs::configuration" [15:52:31] so keep in mind, if contemplating that data, that there's a time dimension that matters, too [15:52:50] we occasionally during transitions have to configure > 1xv4+1xv6 per-site [15:53:05] <_joe_> yes [15:53:07] so in the general case, it's at least an array of v4 + an array of v6 [15:53:15] I've been trolled successfully! [15:54:13] we can solve it with a single array of v10 though [15:54:23] I'm not sure if we use the hash keys meaningfully (e.g. "uploadlb6") [15:54:25] <_joe_> ahahahah bblack we're both mean [15:54:42] maybe as diagnostic labels somewhere? [15:54:51] (probably not) [15:55:18] and arguably, it should be up to data consumers to sort out v4-vs-v6, making it just an array of IPs per site [15:55:32] but there are quite a few complex consumers... [15:56:47] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:56:52] so we just need an array of complex numbers, ipv4 + ipv6i [15:57:02] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/9960/" [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [15:57:05] * volans hides [15:57:17] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:57:28] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:57:38] elukey: ^^^ no more IOPS? [15:57:39] _joe_: IMHO it's enough to check if type is String, that's what the erb template does after all... [15:57:47] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:57:55] _joe_: CR updated, it looks good according to pcc [15:57:58] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:58:00] volans: somebody is hitting stat1005 with a huge script [15:58:07] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:58:43] note also that we have redundant definitions of flatten_ips() in multiple erb templates to deal with service IP structures. [15:59:10] (03CR) 10Imarlier: [C: 031] webperf: Introduce 'templates' in test fixture and use for mwload [puppet] - 10https://gerrit.wikimedia.org/r/404046 (owner: 10Krinkle) [16:01:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1018 - https://phabricator.wikimedia.org/T186988#3967860 (10Cmjohnson) A ticket has been created with HP our case was successfully submitted. Please note your Case ID: 5327018094 for future reference [16:01:58] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [16:04:13] 10Operations, 10ops-eqiad: Add label to kafka1023 - https://phabricator.wikimedia.org/T186895#3967869 (10Cmjohnson) 05Open>03Resolved [16:04:15] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3967870 (10Cmjohnson) [16:04:58] 10Operations, 10ops-eqiad: Missing servers in racktables - https://phabricator.wikimedia.org/T186814#3956080 (10Cmjohnson) 05Open>03Resolved This has been fixed. [16:06:54] (03PS6) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) [16:08:19] (03CR) 10Ema: icinga: add check_established_connections plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [16:08:52] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966345 (10Cmjohnson) his will require troubleshooting, moving DIMMs around and waiting to see if the error returns. Do you see anything else in logs besides the idrac log? Dell will requ... [16:09:10] (03CR) 10Gehel: maps: Icinga alert when OSM replication lags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [16:09:55] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Emails sent to Wikidata mailing list are not received - https://phabricator.wikimedia.org/T187163#3967897 (10herron) [16:10:03] 10Operations, 10ops-eqiad: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535#3967898 (10Cmjohnson) The disk has been replaced and needs a re-install [16:13:00] 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): tin has a failing hdd - https://phabricator.wikimedia.org/T174449#3562401 (10Cmjohnson) This server is on the 5+ years old server list and needs to be replaced with either a new server or on-site spare. Please make a h/w request an... [16:14:18] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [16:14:42] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3967930 (10Cmjohnson) @fgiunchedi This sever needs to be replaced and decommissioned. It is on the 5+ years old list. Please create a h/w... [16:15:07] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [16:16:12] 10Operations, 10ops-eqsin, 10Traffic, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3967940 (10ayounsi) I also double-check the connection tracking spreadsheet and that should be the proper port. Switch says > error: device xe-1/0/4 not found And the tran... [16:17:05] !log replacing disk poolcounte1002 [16:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:50] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410197 (https://phabricator.wikimedia.org/T166733) [16:18:32] (03CR) 10Filippo Giunchedi: maps: Icinga alert when OSM replication lags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [16:20:07] (03CR) 10Anomie: [C: 032] "Config change, already discussed with various people" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410197 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:20:47] (03PS4) 10Gehel: maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) [16:21:48] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3967957 (10Superyetkin) 05Open>03Resolved a:03Superyetkin Thanks, removing the redundant code solved the issue. I fully agree with jcrespo in that there may be other wikis still using similar code block... [16:22:20] cmjohnson1: re T186534 it isn't clear to me if you are replacing the disk or not? [16:22:20] T186534: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534 [16:22:39] (03CR) 10Gehel: maps: Icinga alert when OSM replication lags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [16:22:41] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410197 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:22:44] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3967961 (10chasemp) Any luck where @robh or @Cmjohnson ? [16:22:50] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410197 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [16:23:08] PROBLEM - Host poolcounter1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:51] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 0 (duration: 00m 56s) [16:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:27] PROBLEM - Host poolcounter1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:28] ah nevermind, looks like it! [16:25:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3967969 (10Cmjohnson) One will got A4 but we have to move or remove rdb1003. I don't know who owns that. The other will go in a 10G rack in row D. I am working on the network refresh... [16:25:37] that's me [16:25:37] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [16:25:47] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [16:25:48] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:26:07] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [16:26:27] godog: disk replaced [16:26:27] RECOVERY - DPKG on stat1005 is OK: All packages OK [16:26:57] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:22] cmjohnson1: awesome, thanks! [16:27:31] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3967974 (10Cmjohnson) The disk was replaced for you to either try and add it back or re-install. Historically, there have been issues tryin... [16:27:48] godog: YW see my msg about the disk and about the age of the server. Thx [16:27:59] (03PS1) 10Gilles: Update Thumbor header names [puppet] - 10https://gerrit.wikimedia.org/r/410199 (https://phabricator.wikimedia.org/T187159) [16:28:17] RECOVERY - Host poolcounter1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:28:51] cmjohnson1: ok! yeah I'll put in a task to decom [16:29:37] RECOVERY - Host poolcounter1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [16:30:03] well looks like it booted into pxe and reinstalling [16:30:18] so yeah that "fixes" the problem [16:30:18] PROBLEM - Poolcounter connection on poolcounter1002 is CRITICAL: connect to address 10.64.16.152 and port 7531: Connection refused [16:30:18] PROBLEM - Check whether ferm is active by checking the default input chain on poolcounter1002 is CRITICAL: Return code of 255 is out of bounds [16:30:27] PROBLEM - Check size of conntrack table on poolcounter1002 is CRITICAL: Return code of 255 is out of bounds [16:30:37] PROBLEM - DPKG on poolcounter1002 is CRITICAL: Return code of 255 is out of bounds [16:30:47] PROBLEM - poolcounter on poolcounter1002 is CRITICAL: Return code of 255 is out of bounds [16:30:47] PROBLEM - dhclient process on poolcounter1002 is CRITICAL: Return code of 255 is out of bounds [16:31:09] 10Operations, 10Scap: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076#3967985 (10demon) >>! In T187076#3967607, @fgiunchedi wrote: > scap IMO should be warning or refuse to continue if the user has a busted umask to avoi... [16:31:59] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3967989 (10chasemp) >>! In T183937#3967969, @Cmjohnson wrote: > One will got A4 but we have to move or remove rdb1003. I don't know who owns that. The other will go in a 10G rack in ro... [16:33:13] (03PS1) 10Chad: scap clean: Only use rmtree on directories, otherwise use remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410200 [16:33:15] (03CR) 10Chad: [C: 032] scap clean: Only use rmtree on directories, otherwise use remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410200 (owner: 10Chad) [16:35:18] (03Merged) 10jenkins-bot: scap clean: Only use rmtree on directories, otherwise use remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410200 (owner: 10Chad) [16:36:04] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3968007 (10Niedzielski) > PUPPETEER_SKIP_CHROMIUM_DOWNLOAD Scratch that. This now must be set in the production environment to the developmen... [16:36:34] !log demon@tin Synchronized scap/plugins/clean.py: no-op, consistency (duration: 00m 55s) [16:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:57] (03CR) 10jenkins-bot: scap clean: Only use rmtree on directories, otherwise use remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410200 (owner: 10Chad) [16:37:18] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3968009 (10Niedzielski) [16:37:26] 10Operations, 10ops-eqiad: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535#3968010 (10fgiunchedi) I tried to access `mw1256.mgmt.eqiad.wmnet` at `10.65.2.106` though that ends up being `lead`'s console: ``` Debian GNU/Linux 8 lead ttyS1 lead login: ``` [16:38:49] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3968012 (10Cmjohnson) @chasemp yes this is for another task..sorry I do not have an update for why your labvirts are not working. Maybe @robh has made some progress. [16:42:14] 10Operations, 10ops-eqsin, 10Traffic: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3968028 (10ayounsi) It's a nice to have, just in case, but clearly not a blocker. [16:42:38] (03PS1) 10Odder: Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187209) [16:44:18] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Tue 2018-02-13 16:44:16 UTC. [16:44:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3968050 (10chasemp) >>! In T183937#3968012, @Cmjohnson wrote: > @chasemp yes this is for another task..sorry I do not have an update for > why your labvirts are not working. Maybe @ro... [16:45:07] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:50:46] 10Operations, 10Scap: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076#3968110 (10fgiunchedi) >>! In T187076#3967985, @demon wrote: >>>! In T187076#3967607, @fgiunchedi wrote: >> scap IMO should be warning or refuse to co... [16:54:46] (03CR) 10Filippo Giunchedi: [C: 031] maps: Icinga alert when OSM replication lags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [16:56:26] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3478159 (10Halfak) It looks like this is done. Is that right? [17:00:05] godog, moritzm, and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:09] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968174 (10MoritzMuehlenhoff) >>! In T182832#3967736, @... [17:02:32] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968177 (10elukey) >>! In T182832#3968174, @MoritzMuehl... [17:03:20] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: connect to address 10.64.0.197 and port 9005: Connection refused [17:03:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: all (tags: ['dc=eqiad', 'cluster=ores', 'service=ores']) [17:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:07] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: all (tags: ['dc=codfw', 'cluster=ores', 'service=ores']) [17:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:16] <_joe_> akosiaris: *wow* [17:05:29] PROBLEM - Host mw1259 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:37] hmmm [17:05:48] <_joe_> mw1259 is chris in the dc [17:06:49] PROBLEM - Host mw1259.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:40] RECOVERY - Host mw1259 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:10:49] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3128: Connection refused [17:11:29] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:11:38] ^ looking (cp4031) [17:11:49] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [17:11:59] RECOVERY - Host mw1259.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [17:12:35] !log stat1001 going down to for rack relocation [17:12:43] hmm cp4031 is a cron-scheduled backend restart. usually icinga doesn't catch those (and they depool anyways) [17:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:29] !log sorry snapshot1001 is going down for rack relocation [17:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:53] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3968243 (10Jdlrobson) [17:16:09] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:17:50] PROBLEM - Host snapshot1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:20:00] (03PS1) 10Ayounsi: Diffscan: Add eqsin v4 range [puppet] - 10https://gerrit.wikimedia.org/r/410210 [17:20:15] !log demon@tin Synchronized README: forcing git config sync, setting core.sharedRepository=group, T187076 (duration: 01m 12s) [17:20:21] thcipriani: Went ahead and set it for the base mediawiki-config repo ^. Scap prep needs a fix still I s'pose [17:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:28] T187076: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076 [17:20:40] 10Operations, 10Scap: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076#3963501 (10thcipriani) FWIW, because we keep the masters in sync as part of scap, so unfortunately (maybe :)) naos was fixed as soon as the sync happe... [17:21:41] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3968299 (10ovasileva) [17:22:38] (03PS1) 10Alexandros Kosiaris: Revert "Disable notification for role::ores" [puppet] - 10https://gerrit.wikimedia.org/r/410211 [17:22:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Disable notification for role::ores" [puppet] - 10https://gerrit.wikimedia.org/r/410211 (owner: 10Alexandros Kosiaris) [17:22:59] RECOVERY - Host snapshot1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [17:23:39] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:25:50] (03PS1) 10Elukey: role::webserver_misc_apps: add kafka burrow lag monitoring for main [puppet] - 10https://gerrit.wikimedia.org/r/410212 (https://phabricator.wikimedia.org/T180442) [17:29:21] (03Abandoned) 10Niedzielski: WIP: Hygiene: remove pdfrender and electron-render services [puppet] - 10https://gerrit.wikimedia.org/r/409952 (https://phabricator.wikimedia.org/T186748) (owner: 10Niedzielski) [17:31:21] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Emails sent to Wikidata mailing list are not received - https://phabricator.wikimedia.org/T187163#3968339 (10herron) 05Open>03Resolved a:03herron Hi @Lea_Lacroix_WMDE, it seems that the messages you've recently sent were tagged with a low spam score (bet... [17:32:36] (03PS2) 10Elukey: role::webserver_misc_apps: add kafka burrow lag monitoring for main [puppet] - 10https://gerrit.wikimedia.org/r/410212 (https://phabricator.wikimedia.org/T180442) [17:34:56] (03PS1) 10Chad: Gerrit: Force expire the old /r login cookie [puppet] - 10https://gerrit.wikimedia.org/r/410214 [17:35:46] (03CR) 10Ayounsi: [C: 032] Diffscan: Add eqsin v4 range [puppet] - 10https://gerrit.wikimedia.org/r/410210 (owner: 10Ayounsi) [17:35:46] PROBLEM - HHVM rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:52] (03PS2) 10Ayounsi: Diffscan: Add eqsin v4 range [puppet] - 10https://gerrit.wikimedia.org/r/410210 [17:36:34] (03PS2) 10Chad: Gerrit: Force expire the old /r login cookie [puppet] - 10https://gerrit.wikimedia.org/r/410214 [17:37:26] (03PS3) 10Elukey: role::webserver_misc_apps: add kafka burrow lag monitoring for main [puppet] - 10https://gerrit.wikimedia.org/r/410212 (https://phabricator.wikimedia.org/T180442) [17:38:06] PROBLEM - HHVM processes on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:38:57] RECOVERY - HHVM processes on mw1256 is OK: PROCS OK: 6 processes with command name hhvm [17:39:46] (03CR) 10Paladox: [C: 031] Gerrit: Force expire the old /r login cookie [puppet] - 10https://gerrit.wikimedia.org/r/410214 (owner: 10Chad) [17:40:32] (03CR) 10Elukey: [C: 032] role::webserver_misc_apps: add kafka burrow lag monitoring for main [puppet] - 10https://gerrit.wikimedia.org/r/410212 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [17:40:37] PROBLEM - Apache HTTP on mw1256 is CRITICAL: connect to address 10.64.48.91 and port 80: Connection refused [17:40:37] PROBLEM - Nginx local proxy to apache on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time [17:40:40] (03PS4) 10Elukey: role::webserver_misc_apps: add kafka burrow lag monitoring for main [puppet] - 10https://gerrit.wikimedia.org/r/410212 (https://phabricator.wikimedia.org/T180442) [17:44:27] mw1256 is me [17:44:34] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 75552 bytes in 0.253 second response time [17:44:44] RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time [17:44:44] RECOVERY - Nginx local proxy to apache on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [17:45:24] PROBLEM - Check systemd state on krypton is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:47:46] this is me --^ [17:48:47] (03PS1) 10Elukey: burrow: parametrize configuration file names [puppet] - 10https://gerrit.wikimedia.org/r/410219 (https://phabricator.wikimedia.org/T180442) [17:49:44] (03PS2) 10Arturo Borrero Gonzalez: WIP: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [17:52:34] RECOVERY - Disk space on stat1005 is OK: DISK OK [17:54:24] !log repool mw1256 after disk swap - T186535 [17:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:37] T186535: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535 [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:15] !log Preparing to cut new MediaWiki branch wmf/1.31.0-wmf.21 - report deployment blockers for this branch in phabricator: T183960 [18:00:15] Nothing for ORES [18:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:27] T183960: 1.31.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T183960 [18:00:39] twentyafterfour: re: wmf.21....basically everything from wmf.20 is fixed and in respective masters. [18:00:41] nothing for parsoid [18:00:49] The remaining two issues weren't really "blockers" or serious [18:01:03] (03CR) 10Elukey: [C: 032] burrow: parametrize configuration file names [puppet] - 10https://gerrit.wikimedia.org/r/410219 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [18:04:38] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [18:05:49] (03Abandoned) 10Gehel: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [18:06:06] (03PS1) 10Giuseppe Lavagetto: Increase test coverage [software/conftool] - 10https://gerrit.wikimedia.org/r/410224 [18:06:08] (03PS1) 10Giuseppe Lavagetto: Add simple actions to be exercised only on the basic types. [software/conftool] - 10https://gerrit.wikimedia.org/r/410225 [18:06:10] (03PS1) 10Giuseppe Lavagetto: Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 [18:07:20] (03CR) 10jerkins-bot: [V: 04-1] Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 (owner: 10Giuseppe Lavagetto) [18:07:22] (03CR) 10jerkins-bot: [V: 04-1] Add simple actions to be exercised only on the basic types. [software/conftool] - 10https://gerrit.wikimedia.org/r/410225 (owner: 10Giuseppe Lavagetto) [18:07:24] (03CR) 10jerkins-bot: [V: 04-1] Increase test coverage [software/conftool] - 10https://gerrit.wikimedia.org/r/410224 (owner: 10Giuseppe Lavagetto) [18:09:16] 10Operations, 10Maps-Sprint, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3968439 (10Gehel) [18:17:52] !log mholloway-shell@tin Started deploy [mobileapps/deploy@e488cee]: Update mobileapps to 5851dfc [18:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] mdholloway: Actually you're deploying e488cee ;-) [18:20:20] I find the "what people say they're deploying" vs "what scap says you're deploying" discrepancy kinda fun :) [18:20:24] (I trust the latter) [18:22:21] "cute things scap says" [18:23:20] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@e488cee]: Update mobileapps to 5851dfc (duration: 05m 28s) [18:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:10] no_justification: fair enough [18:24:36] Not blaming you <3 [18:25:08] !log Analytics Hadoop cluster upgrade to Java 8 about to start - complete cluster shutdown is needed - T166248 [18:25:18] jynus: You should try `scap say` sometime :p [18:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:21] T166248: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248 [18:29:08] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:08] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:09] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:19] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:38] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:48] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:31:48] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:40:25] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968525 (10Dzahn) I suppose we don't bother with the cr... [18:41:35] no_justification: we haven't been super-consistent internally about whether we reference the code repo commit or deploy repo commit. i've been consciously doing the former on the theory that it's slightly faster to see at a glance what's deployed if the commit hash refers to the code repo, rather than having to look at the full deploy repo commit message and cross-reference. [18:41:43] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968527 (10mmodell) Yeah, I could try to rush the updat... [18:41:53] no_justification: but i suppose if there's an assumption it's a reference to the code repo, that assumption goes out the window :/ [18:42:19] *to the deploy repo, that is [18:42:59] It's a fair assumption for your team to make, sure! And honestly, any log message is better than no log message :) [18:43:14] (03PS1) 10Jcrespo: mariadb: Depool db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410231 [18:43:35] (Also, the idea of a "deploy" repo is kinda redundant and unnecessary for a lot of things) [18:43:50] (03PS2) 10Elukey: Force JAVA_HOME to openjdk-8's jre for all Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/408251 (https://phabricator.wikimedia.org/T166248) [18:43:59] mdholloway: I'm not knocking your !log message, I mostly find the discrepancy to be humorous :) [18:44:11] [also, your explanation clears up why it happens so frequently!] [18:44:36] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrade: logging messages go to stdout and stderr [puppet] - 10https://gerrit.wikimedia.org/r/410232 (https://phabricator.wikimedia.org/T181647) [18:44:46] how is the mediawiki queue, is the runway clear? [18:45:08] jynus: Ask twentyafterfour, but train doesn't start officially for another 1h15m [18:45:28] ok, I will do some sneaky deploy over here [18:45:32] (he's ATC this week :)) [18:45:33] on codfw only [18:45:34] (03PS2) 10Arturo Borrero Gonzalez: apt: apt-upgrade: logging messages go to stdout and stderr [puppet] - 10https://gerrit.wikimedia.org/r/410232 (https://phabricator.wikimedia.org/T181647) [18:46:13] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: logging messages go to stdout and stderr [puppet] - 10https://gerrit.wikimedia.org/r/410232 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:46:19] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [18:46:20] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410231 (owner: 10Jcrespo) [18:47:18] (03CR) 10Elukey: [C: 032] Force JAVA_HOME to openjdk-8's jre for all Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/408251 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [18:47:22] (03PS3) 10Elukey: Force JAVA_HOME to openjdk-8's jre for all Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/408251 (https://phabricator.wikimedia.org/T166248) [18:47:57] (03Merged) 10jenkins-bot: mariadb: Depool db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410231 (owner: 10Jcrespo) [18:48:08] (03CR) 10jenkins-bot: mariadb: Depool db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410231 (owner: 10Jcrespo) [18:53:00] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2042 (duration: 01m 58s) [18:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:02:13] (03PS2) 10Jcrespo: [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) [19:02:15] (03PS1) 10Jcrespo: mariadb: Move db2042 socket to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410233 (https://phabricator.wikimedia.org/T148507) [19:02:26] (03PS2) 10Jcrespo: mariadb: Move db2042 socket to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410233 (https://phabricator.wikimedia.org/T148507) [19:02:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [19:02:51] (03CR) 10Gergő Tisza: [C: 04-2] "Blocked on T187226." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [19:03:36] !log upgrade and restart db2042 [19:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:08] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [19:07:58] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2042 socket to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410233 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [19:08:00] (03PS3) 10Arturo Borrero Gonzalez: WIP: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [19:14:50] no_justification: are there known problems with scap prep and recursive submodules? [19:15:05] Um, not that I'm aware of. Whats up? [19:15:26] it's erroring on me and it seems like it failed the same way last time ... it isn't initializing the visual editor sub-submodules [19:15:46] but it I run the git command manually, of course it works fine [19:15:59] yet inside scap prep it returns error code 1 and skips the sub-submodules [19:16:27] * twentyafterfour needs to add better error logging to scap prep [19:17:31] Ummmm, so I dropped the "branch sub-submodules" stuff from make-wmf-branch [19:17:37] That....might be wonky [19:18:59] well the submodule update --init --recursive fixes things but scap prep doesn't finish so I think some things remain undone [19:19:32] no_justification: I'll straighten it out and submit a patch to scap prep to improve failure handling [19:22:33] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968786 (10Dzahn) I didn't even mean to imply you have... [19:23:09] (03PS1) 10Jcrespo: mariadb: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 [19:24:10] (03CR) 10Jcrespo: "Not sure if we should repool it or move it elsewhere (e.g. misc)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 (owner: 10Jcrespo) [19:25:59] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [19:26:16] this is me, downtime expired --^ [19:26:52] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968808 (10greg) >>! In T182832#3968527, @mmodell wrote... [19:32:08] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [19:33:28] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 44 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark] [19:34:58] fixing --^ [19:36:45] 10Operations, 10Code-Stewardship-Reviews, 10Services: zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#3967379 (10danstillman) Zotero dev here. I'm not clear on when all the above was written, but a few clarifications. - translation-server currently runs on F... [19:38:28] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:40:08] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:40:24] yes puppet I know [19:40:27] gimme a break :D [19:41:26] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3968879 (10Ladsgroup) I don't think we have anything else that is used wiki-wide in large scale. Unless my query is wrong. ``` ladsgroup@tin:~$ mwgrep "WikimediaBadges" bhwiki MediaWiki:Gadget-Liv... [19:44:08] no_justification: ok, scap prep fail was due to my gitconfig [19:44:31] Ah! [19:45:08] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:49:55] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3968906 (10jcrespo) ``` curl 'https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.cite.styles%7Cext.echo.badgeicons%7Cext.echo.styles.badge%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArt... [19:50:09] (03PS1) 10Smalyshev: Set SPARQL endpoint for category search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410242 [19:54:22] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968921 (10Nuria) As of me checking today referral groups are mostly unchanged from the graphs i pasted above. That is, safari sessions ap... [19:56:11] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3968930 (10mmodell) >>! In T182832#3968808, @greg wrote... [19:56:36] (03PS1) 10Elukey: profile::java::analytics: deploy only java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410244 (https://phabricator.wikimedia.org/T166248) [19:56:42] ottomata: --^ :) [19:56:57] spark-shell on stat1004 for some reason still prefers java 7 [19:57:02] so I want to nuke it :D [19:59:59] nuke it?! [20:00:07] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180213T2000). [20:00:07] No GERRIT patches in the queue for this window AFAICS. [20:01:08] (03CR) 10Ottomata: [C: 031] "Yeehaw" [puppet] - 10https://gerrit.wikimedia.org/r/410244 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [20:02:12] (03CR) 10EBernhardson: [C: 031] Set SPARQL endpoint for category search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410242 (owner: 10Smalyshev) [20:05:32] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968983 (10Tgr) If nothing seems more broken than before that's good enough for me :) The main goal was to prevent Edge and Safari from se... [20:05:56] !log MediaWiki Train 1.31.0-wmf.21 branched, prepped and patched | Changelog uploaded to https://www.mediawiki.org/wiki/MediaWiki_1.31/wmf.21/Changelog | Blockers: T183960 [20:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:12] T183960: 1.31.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T183960 [20:07:34] (03CR) 10Elukey: [C: 032] profile::java::analytics: deploy only java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410244 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [20:11:49] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3969008 (10Niedzielski) [20:11:58] !log Currently there are no blockers listed on T183960 and the train is leaving the station. [20:12:03] !log twentyafterfour@tin Started scap: T183960 Build l10n cache & Deploy wmf/1.31.0-wmf.21 to test wikis [20:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:12] T183960: 1.31.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T183960 [20:12:21] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3954012 (10Niedzielski) [20:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] (03PS1) 10Elukey: statistics::packages: deploy only java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410250 (https://phabricator.wikimedia.org/T166248) [20:16:31] ottomata: --^ mind to? [20:17:05] (03CR) 10Ottomata: [C: 031] statistics::packages: deploy only java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410250 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [20:17:21] !log andrew@tin Started deploy [horizon/deploy@c355366]: updated static content collection process [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:41] (03CR) 10Elukey: [C: 032] statistics::packages: deploy only java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410250 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [20:18:37] !log andrew@tin Finished deploy [horizon/deploy@c355366]: updated static content collection process (duration: 01m 17s) [20:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:48] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [20:21:58] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [20:21:59] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [20:22:02] (03PS1) 10Ppchelko: Added page-delete, page-undelete and page-properties-change to EventStreams. [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) [20:22:08] RECOVERY - Disk space on stat1005 is OK: DISK OK [20:22:28] RECOVERY - DPKG on stat1005 is OK: All packages OK [20:22:28] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [20:23:36] (03PS1) 10Eevans: cassandra: enable component/cassandra33 where applicable [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) [20:24:01] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3969059 (10Ladsgroup) One tip, add &debug=true so it doesn't minify the result and seeing results will be easier: ``` curl 'https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.cite.styles%7Cext... [20:24:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:58] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3969062 (10jcrespo) ok [20:26:52] !log upgrading labsdb1010 database - proxies will complain for some time [20:26:57] (03CR) 10Ottomata: "Thanks Petr! Is there anything in page-properties-change that could be privacy sensitive?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [20:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:28] dbproxy1010 and dbsproxy1011 should now complain [20:30:13] (03PS2) 10Ppchelko: Added page-delete, page-undelete and page-properties-change to EventStreams. [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) [20:30:18] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [20:30:38] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [20:32:17] (03CR) 10Ppchelko: "We also have page-create and page-move that I think we could expose as well. The page-restrictions-change, revision-visibility-change and " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [20:33:18] and they should recover soon, too [20:33:39] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [20:34:18] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [20:34:26] (03CR) 10Ppchelko: "As for page-properties-change and privacy - I'm not really sure, extensions can put pretty much anything in there. We might reconsider exp" [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [20:35:08] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:35:30] (03PS2) 10Eevans: cassandra: enable component/cassandra33 where applicable [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) [20:35:37] (03PS2) 10Niedzielski: New: add chromium_render service [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) [20:36:35] we are now serving 40989 database queries per second with a single server as routine [20:37:34] (03CR) 10Ppchelko: "All the page properties are stored in the page_props table, so I guess the same privacy concerns could be applied here as it's applied in " [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [20:39:39] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 62.91, 41.72, 33.10 [20:40:03] (03CR) 10Eevans: [C: 031] "This should be a no-op, adding a repository containing cassandra-3.11.0-wmf5 to a set of machines that already have it installed." [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [20:41:15] !log andrew@tin Started deploy [horizon/deploy@c355366]: updated static content collection process [20:41:23] (03CR) 10Ottomata: "Let's def add page-create and page-move if we are doing this" [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [20:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:35] !log andrew@tin Finished deploy [horizon/deploy@c355366]: updated static content collection process (duration: 00m 21s) [20:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:04] !log twentyafterfour@tin Finished scap: T183960 Build l10n cache & Deploy wmf/1.31.0-wmf.21 to test wikis (duration: 31m 01s) [20:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:18] T183960: 1.31.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T183960 [20:43:48] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:46:19] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Tue 2018-02-13 20:46:16 UTC. [20:50:36] (03PS3) 10Ppchelko: Added page-related events to EventStreams. [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) [21:01:43] (03CR) 10Ottomata: "milimetric checked, looks like page_properties is avail in labsdb replicas. let's use it" [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [21:01:54] (03CR) 10Smalyshev: [C: 031] Added page-related events to EventStreams. [puppet] - 10https://gerrit.wikimedia.org/r/410251 (https://phabricator.wikimedia.org/T187241) (owner: 10Ppchelko) [21:03:28] (03PS1) 10MaxSem: Deploy GlobalPreferences in Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410267 (https://phabricator.wikimedia.org/T184668) [21:03:49] no_justification: is gerrit login broken? I doesn't work for me. [21:04:14] AaronSchulz: did you see the email about removing cookies? =] [21:04:30] I've got a patch for this... [21:04:43] But yes: clear cookie in meantime [21:06:38] !log andrew@tin Started deploy [horizon/deploy@c355366]: another try with static content [21:06:41] !log andrew@tin Finished deploy [horizon/deploy@c355366]: another try with static content (duration: 00m 03s) [21:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:59] !log andrew@tin Started deploy [horizon/deploy@c355366]: another try with static content [21:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:48] !log andrew@tin Finished deploy [horizon/deploy@c355366]: another try with static content (duration: 00m 49s) [21:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:42] (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410277 [21:12:44] (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410277 (owner: 1020after4) [21:13:48] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:14:27] (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410277 (owner: 1020after4) [21:15:46] (03PS2) 10Smalyshev: Set SPARQL endpoint for category search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410242 (https://phabricator.wikimedia.org/T184840) [21:17:40] (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410277 (owner: 1020after4) [21:19:34] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.31.0-wmf.21 [21:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:40] 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257#3969415 (10herron) p:05Triage>03Normal [21:37:45] 10Operations, 10Puppet: puppetdb4: use postgres db backend in puppet-compiler - https://phabricator.wikimedia.org/T187258#3969426 (10herron) p:05Triage>03Normal [21:37:54] 10Operations, 10Puppet: puppetdb4: upgrade puppetdbquery module - https://phabricator.wikimedia.org/T187259#3969437 (10herron) p:05Triage>03Normal [21:39:55] (03CR) 10Dzahn: [C: 032] cassandra: enable component/cassandra33 where applicable [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [21:40:01] (03PS2) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [21:40:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [21:41:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3969463 (10Cmjohnson) [21:42:01] (03CR) 10Dzahn: [C: 032] "the repo (sources.list) has been added but no software upgrade or any change" [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [21:43:35] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969465 (10Dzahn) added to "WMF-NDA" in Phabricator https://phabricator.wikimedia.org/project/members/61/ This should let you see all the private tickets. [21:49:03] (03PS1) 10Cmjohnson: Adding mgmt and prodcution dns for db1115 [dns] - 10https://gerrit.wikimedia.org/r/410341 (https://phabricator.wikimedia.org/T185788) [21:52:22] (03CR) 10Cmjohnson: [C: 032] Adding mgmt and prodcution dns for db1115 [dns] - 10https://gerrit.wikimedia.org/r/410341 (https://phabricator.wikimedia.org/T185788) (owner: 10Cmjohnson) [21:56:00] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3969490 (10Cmjohnson) [21:57:40] 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257#3969491 (10herron) Current: ``` [Unit] Description="puppetDB centralized storage daemon" [Service] User=puppetdb Group=puppetdb Environment=CONFIG=/etc/puppetdb/conf.d ExecStartPre=/bin/bash -c "test -e /var/l... [21:58:55] (03PS1) 10Eevans: cassandra: actually enable component/cassandra311 (not component/cassandra33) [puppet] - 10https://gerrit.wikimedia.org/r/410342 (https://phabricator.wikimedia.org/T186619) [21:59:56] mutante: thanks for the merge! [22:00:20] mutante: but umm... i did a stupid typo :) [22:00:27] or brain-o [22:00:42] should have been cassandra311 not cassandra33 [22:01:06] even after watching apt complain i could understand why on earth it wouldn't work [22:01:14] couldn't [22:01:24] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969497 (10Dzahn) subscribed you to both ops mailing lists (others like wikitech-l are optional and self-service) https://lists.wikimedia.org/mailman/listi... [22:02:41] urandom: heh, ok, should we fix that really quick [22:04:57] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969513 (10Dzahn) please take a look at https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars and check that you can see the "Ops Maintena... [22:05:13] mutante: if you don't mind :) [22:05:28] though it's not hurting anything, other than my pride [22:06:11] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969514 (10Dzahn) @Robh could you do one more Racktables user? thanks! [22:06:38] edits the topic branch too :) [22:07:10] (03CR) 10Dzahn: [C: 032] cassandra: actually enable component/cassandra311 (not component/cassandra33) [puppet] - 10https://gerrit.wikimedia.org/r/410342 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [22:08:46] (03CR) 10Legoktm: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [22:09:23] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3969518 (10Dzahn) [22:11:28] mutante: \o/ [22:11:32] mutante: thank you! [22:13:07] urandom: welcome :) it was applied on restbase2001 [22:18:13] (03PS26) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [22:18:18] mark, paravoid ahem.. boldly moved this wiki and edited name: https://office.wikimedia.org/wiki/Tech_program_proposals/Program_Zero:_Reliability,_Performance_and_Maintenance [22:18:23] fel free to disagree [22:18:26] *feel [22:21:44] mutante: hi, is the purgeOldLogIPData.php script still running? :) [22:23:09] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10Dzahn) [22:23:11] 10Operations, 10LDAP-Access-Requests: ldap/ops membership for vgutierrez - https://phabricator.wikimedia.org/T187055#3969540 (10Dzahn) 05Open>03Resolved a:03Dzahn was already done in T187035#3966316 [22:29:56] jouncebot: now [22:29:56] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [22:30:02] jouncebot: next [22:30:03] In 1 hour(s) and 29 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T0000) [22:30:55] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: icinga ACK shows as CRIT when delivered via SMS - https://phabricator.wikimedia.org/T185862#3969551 (10Dzahn) before (there is "Subject: " in each message and the same content is repeated twice. the actual status as in "PROBLEM" or "ACK" is not show... [22:33:17] Hauskatze: checking now [22:33:53] yea, looks like it's workin on enwiki [22:34:13] 312200 [22:34:22] it's counting up [22:34:30] 312400 [22:34:36] hmm [22:34:51] until 18 M.... [22:34:54] and yea, that's from our log we wanted [22:35:02] heh, that will take a while [22:35:09] seems to match what j-ynus said on ticket, right [22:35:23] it won't end by the time the script restarts itself [22:35:33] at 01:15 utc [22:35:37] that probably explains the whole fail [22:36:03] maybe in part, the script failed because $this->requireExtension ('AbuseFilter') was wrong [22:36:03] maybe just run once a week? [22:36:46] with that data not being inmediatelly avalaible to users, I'd suggest to optimize the script, and kill the cron until that happens [22:37:02] then run in a screen in terbium the script only in enwiki [22:37:11] and after that, re-enable the cron [22:37:18] not sure if that's a good idea or not [22:37:37] or keep running the script excluding enwiki? [22:39:04] or we can just change the cron command [22:39:07] to run only on en [22:39:19] always prefer the things that dont include manual commands [22:39:49] or not run it every day [22:40:23] or .. it can be 2 commands. one for "en" and one for all others and they run at different times [22:42:48] enwiki is certainly the one that could give us more 'problems' as it's the one with the more entries to clear [22:43:20] not sure how to proceed [22:43:41] once the feature is enabled, the data should be regularly cleaned, as is rc_ip on CheckUser [22:45:24] (03PS4) 10Paladox: racktables: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409478 [22:45:26] (03PS1) 10Dzahn: mediawiki: reduce frequency of purge_abusefilter to weekly [puppet] - 10https://gerrit.wikimedia.org/r/410349 (https://phabricator.wikimedia.org/T187078) [22:47:06] (03CR) 10MarcoAurelio: [C: 031] "Looks like the most sensible solution for now." [puppet] - 10https://gerrit.wikimedia.org/r/410349 (https://phabricator.wikimedia.org/T187078) (owner: 10Dzahn) [22:47:55] ok :) thx [22:48:09] at least we can let it finish once.. right [22:48:23] (03CR) 10MarcoAurelio: [C: 031] "It won't kill the already-running script, right?" [puppet] - 10https://gerrit.wikimedia.org/r/410349 (https://phabricator.wikimedia.org/T187078) (owner: 10Dzahn) [22:49:00] (03CR) 10Dzahn: [C: 032] "no, it should not affect that. only the next start time" [puppet] - 10https://gerrit.wikimedia.org/r/410349 (https://phabricator.wikimedia.org/T187078) (owner: 10Dzahn) [22:50:54] heh - just received a message with "X-custom-header: INDONESIA DARKNET" [22:56:12] (03PS1) 10Ladsgroup: Enable xkill on top wikis that use x aspect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410353 (https://phabricator.wikimedia.org/T187265) [22:56:33] Krinkle: Could you have a look at https://gerrit.wikimedia.org/r/c/410214/? [23:00:03] no_justification: link to cookie path change? [23:00:14] My main GerritAcount cookie is also set with /r as path, and seems to work fine? [23:01:10] Yeah, it does for some people [23:01:11] Not everyone [23:01:24] The default setting includes the path to the install (so /r) [23:01:48] https://gerrit.wikimedia.org/r/c/409216/ was my change [23:02:17] Idea behind 410214 is to force the /r cookie to expire [23:04:54] (03PS6) 10Dzahn: icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) [23:07:11] (03CR) 10Dzahn: "here are screenshots that show the difference https://phabricator.wikimedia.org/T185862#3969551" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [23:13:01] Amir1: What is 'x' and what is 'xkill'? [23:13:44] x is the entity usage aspect that says 'track every change to then given item/property' [23:17:43] in theory, client means I'm using all data of that entity, but in reality it's just bad lua that want to get e.g. English Label but loads all of the itme [23:17:43] Amir1: Ah, right. So if a wiki page uses entity 'x' of a wikidata item, it's like catch-all? [23:17:43] yup [23:17:43] and xkill? [23:17:43] causing this job queue mess because if someone changed zh alias in P373 it would trigger 1M update in ruwiki [23:17:43] Yeah. I also noticed on wikispecies recent changes (first time I ever looked at that) - many change propagation updates from wikidata for properties that didn't affect those pages. [23:17:43] xkill makes a "ghost" table in lua when people access all of item and actually tracks what is being used [23:17:43] But I didn't see any obvious cause for it. [23:17:43] I think the only thing it used were interwiki links. [23:17:43] Krinkle: that's probably bad lua and xkill should fix that (once enabled there, which is not yet) [23:17:44] Amir1: Is that project or feature tracked or mentioned somewhere in Phab or mw.org that I can refer to? [23:17:44] yeah, let me get it for you [23:17:44] Krinkle: Tracking ticket: https://phabricator.wikimedia.org/T172914 [23:17:44] my ticket should be a subtask of the subtask [23:17:48] Ah, it isn't right now [23:18:00] No parent task on https://phabricator.wikimedia.org/T187265 [23:18:18] my bad, fixed now [23:18:43] Thx! [23:19:03] Krinkle: overall you can see drop of x usages, every place it got enabled causing drop in job queue (it's 1.6M now \o/) [23:19:06] some graphs [23:19:22] Yeah! [23:19:25] 10Operations, 10MediaWiki-JobQueue, 10Wikidata, 10Performance-Team (Radar), and 3 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3969716 (10Krinkle) >>! In T173710#3966996, @Ladsgroup wrote: > The jobqueue size [..] will go down more once we enable "lua fine grained us... [23:19:26] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-14d&to=now [23:19:37] (guess which day we enabled xkill on ruwiki) [23:20:12] https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage?refresh=5m&orgId=1 [23:20:39] (03CR) 10Krinkle: [C: 031] "Works. Verified :)" [puppet] - 10https://gerrit.wikimedia.org/r/410214 (owner: 10Chad) [23:20:41] (usage aspects for wikis that have it enabled: [23:20:41] https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage-project?orgId=1&var-project=arwiki&var-project=cawiki&var-project=cswiki&var-project=dewiki&var-project=elwiki&var-project=eswiki&var-project=fawiki&var-project=hewiki&var-project=huwiki&var-project=iawiki&var-project=jawiki&var-project=kowiki&var-project=ptwiki&var-project=rowiki&var-project=ruwiki&var-project=trwiki&var-project=ukwiki&var-project=viwiki&var-projec [23:20:41] t=wikidatawiki) [23:20:58] Amir1: Feb 2/Feb3 I suppose? [23:21:23] (03PS3) 10Krinkle: Gerrit: Force expire the old /r login cookie [puppet] - 10https://gerrit.wikimedia.org/r/410214 (https://phabricator.wikimedia.org/T187269) (owner: 10Chad) [23:21:46] Krinkle: Very close, 30 of January: https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage-project?orgId=1&var-project=ruwiki [23:21:56] look when all usage started to drop [23:22:38] it takes some time to refresh the usages and so being really useful and affect job queue /rc table [23:24:09] one big problem was that if we had really bad lua (which we had) that iterates over all languages in description or all statements. It could blow up the database (like what happened with cawiki) but that got handled now by this https://phabricator.wikimedia.org/T185693 [23:26:09] it means if number of languages that are being tracked is bigger than a threshold (=10) it will smash them into one big general "all labels" aspect which can increase the rc table/job queue problem but I looked at the data and I think I found the proper balance. This is very important for statements (the current number is 73, even with such a large number it cuts the wbc_entity_usage table to less than half in cawiki) [23:26:09] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 21.04, 22.57, 23.87 [23:26:21] sorry, got so excited [23:34:30] (03PS1) 10Ladsgroup: Add uploader user group and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) [23:36:33] (03CR) 10jerkins-bot: [V: 04-1] Add uploader user group and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [23:39:17] (03PS2) 10Ladsgroup: Add uploader user group to mznwiki and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) [23:41:17] (03CR) 10Samwilson: [C: 031] "Looks correct to me (but I'm not totally familiar with how this all works)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410267 (https://phabricator.wikimedia.org/T184668) (owner: 10MaxSem)