[00:02:16] James_F: live on mw1099 [00:02:18] Dereckson: here's the patch on the CN wmf_deploy branch... https://gerrit.wikimedia.org/r/#/c/304420 [00:02:34] Dereckson: Thanks. [00:02:51] AndyRussG: ack [00:03:16] MatmaRex: no, I figured out how: sync-dir php-1.28.0-wmf.14/ [00:03:35] ostriches, are you sure it worked? [00:03:44] Dereckson: LGTM. [00:03:46] ack [00:04:36] Krenair: I'm sure the change went through and gerrit restarted. [00:04:41] beyond that... [00:05:05] We're ready to sync. [00:07:53] AndyRussG: please add the change to https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0August.C2.A011 [00:08:09] Dereckson: the submodule update has merged automagically into into core BTW, sha 8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 [00:08:13] Dereckson: u bet! [00:12:20] AndyRussG: live on mw1099 [00:12:31] Dereckson: wooo one sec :) [00:13:29] Dereckson: please ping me when my patch is deployed, i'm afk for a bit. [00:13:32] RoanKattouw: I scap sync-dir 7 minutes ago, still no output, is that expected? [00:13:35] ostriches, it doesn't appear to have solved the problem with nfs :/ [00:13:46] I've the nice scap logo [00:13:54] then I guess it's busy to copy [00:13:58] No output AT ALL? [00:14:02] That's not normal AFAIK [00:14:04] only the logo [00:14:46] ah [00:14:48] 00:14:42 Started sync-masters [00:14:54] Krenair: ugh file a bug. I'll look later [00:14:59] ok [00:15:02] 00:14:53 Finished sync-masters (duration: 00m 10s) [00:15:08] this is dumb... [00:15:45] !log dereckson@tin Synchronized php-1.28.0-wmf.14/: VE: Fix TextState#getChangeTransaction bug (T141573) ; Echo: Revert "Hack around browser bug in IE breaking badge alignment in Monobook" ([[gerrit:304415]]) ; Core: Revert CSS fix ([[gerrit:304412]], T142750) (duration: 08m 58s) [00:15:48] T142750: Style for unpatrolled symbol ! is missing on watchlist in 1.28.0-wmf.14 - https://phabricator.wikimedia.org/T142750 [00:15:48] T141573: Typing a duplicate character after a link results in the wrong change being written to the model - https://phabricator.wikimedia.org/T141573 [00:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:20] in fatalmonitor: 1 Undefined variable: wmfMasterDatacenter in /srv/mediawiki/wmf-config/db-eqiad.php on line 235 [00:16:41] Dereckson: looks good! Don't see anything bad in logstash, either :) [00:16:44] (not related with changes from this SWAT) [00:16:55] MatmaRex: live [00:17:28] AndyRussG: I quickly look for wmfMasterDatacenter and I sync yours [00:17:51] Have we lost datacenter? [00:17:53] jk---thx!!! [00:18:08] no rush, eh? :) [00:19:11] ok, created https://phabricator.wikimedia.org/T142787 [00:20:59] !log dereckson@tin Synchronized php-1.28.0-wmf.14/extensions/CentralNotice: CentralNotice deployment [[gerrit:304420]] (duration: 00m 49s) [00:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:15] yurik: I'm done [00:21:23] Dereckson, thx! [00:21:36] Dereckson: mhm [00:22:00] MatmaRex: yes? [00:22:10] Dereckson: okay, looks good, thaks :) [00:22:19] You're welcome. [00:24:29] AndyRussG: https://gerrit.wikimedia.org/r/#/q/8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 [00:24:38] where exactly is this commit? [00:24:48] (00:08:09 < AndyRussG> Dereckson: the submodule update has merged automagically into into core BTW, sha 8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6) [00:26:57] Dereckson: yeah it seems now I just +2 a patch on the wmf_deploy branch of CentralNotice, and the submodule update lands in core... bypassing gerrit [00:27:39] AndyRussG: you're welcome to submit a change for core wmf/1.28.0-wmf.14 too, so we've a branch more updated for submodules [00:27:43] hmm... scap3 seems to have failed a rollback... weird [00:27:46] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.128, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [00:28:03] https://github.com/wikimedia/mediawiki/commit/8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 [00:28:26] ^ on it, not production [00:29:50] Reedy: AndyRussG: ok, I'm cherry-picking that [00:29:55] Dereckson: it already went into that branch! [00:29:58] No need :) [00:29:59] Yeah [00:30:05] It tracks the branch automagically [00:30:09] Mmmm wasn't me, rather a ghost in the machine [00:30:15] It didn't used to be so pro-active [00:30:19] it's done to make life easier [00:30:26] I guess the upgrade helped you ;) [00:30:28] git pull in core [00:30:35] update the CN submodule [00:30:41] AndyRussG: there is a tiny need: to make wmf/1.28.0-wmf.14 branch more conform with code actually deployed on the servers [00:31:20] hmmm so what should I do? [00:31:29] nothing, I'm cherry-picking it [00:31:35] Cherry picking what? [00:31:50] Dereckson: hmmm you can't just pull on that branch? [00:31:58] Reedy: 8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 in php-1.28.0-wmf.14 to avoid modified: extensions/CentralNotice (new commits) [00:32:00] Sorry if my question is silly [00:32:10] Dereckson: No [00:32:14] git pull [00:32:24] it'll bring it in, and rebase security patches ontop [00:32:59] git submodule update extensions/CentralNotice [00:33:56] Reedy: currently, on the cluster, CentralNotice is already at 575f4ae4f7319fb4415b029e45c6fbc4df5d5b05 [00:34:41] Good? [00:34:52] Cherry picking something ontop of security patches isn't gonna help [00:34:53] but if someone query submodules on php-1.28.0-wmf.14, it will get 86bcb9374a24337b6a9a3d9c77f205b9480695de, won't it? [00:35:12] Look at https://github.com/wikimedia/mediawiki/commit/8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 [00:35:15] oh I meant to submit to gerrit, not for deploy (of course I'll rebase) [00:35:21] -Subproject commit 86bcb9374a24337b6a9a3d9c77f205b9480695de [00:35:26] +Subproject commit 575f4ae4f7319fb4415b029e45c6fbc4df5d5b05 [00:35:31] Nothing needs submitting to gerrit [00:35:38] yes but tthat's core, not wmf/1.28.0-wmf.14 [00:35:54] Yes it is [00:35:59] CN doesn't have a .14 branch [00:36:03] oh [00:36:03] it has a special wmf_deploy branch [00:36:29] Reedy: I've noticed only now 8f09b9f2 was in php-1.28.0-wmf.14, not master [00:37:04] * AndyRussG apologizes for unruly CN [00:37:38] [dereckson@tin php-1.28.0-wmf.14 (wmf/1.28.0-wmf.14 *<>)]$ git log HEAD..origin/wmf/1.28.0-wmf.14 [00:37:41] commit 8f09b9f2a7e580c49e87b5203cb673b93a7f7ea6 [00:37:44] indeed :) [00:37:55] RECOVERY - kartotherian endpoints health on maps-test2001 is OK: All endpoints are healthy [00:38:10] So all is fine for our evening SWAT. [00:39:30] Dereckson: thx much [00:39:41] Reedy: thx also much :) [00:40:43] You're welcome. [00:41:56] yurik: deployment stalking you, looks like kartotherian's ports never started accepting tcp connections [00:42:24] thcipriani|afk, how do you mean? it seems to work fine [00:42:58] eh, I was just watching scap deploy-log -v in /srv/deployment/kartotherian/deploy [00:43:19] saw a bunch of 00:40:10 [maps1001.eqiad.wmnet] Port 6533 not up. Waiting 3.00s [00:44:03] thcipriani|afk, seems like something is weird going on: i just scaped again, and the 3 new servers didn't work ok (which is not a big deal, they are not in prod yet) [00:44:09] maps100* [00:44:20] but because of them, scap3 failed full deploy (not canary) [00:44:54] and offered to rollback, i agreed, but i suspect that it is still running the new version [00:45:03] (those that succeeded the deployment) [00:46:10] should be easy to tell on the targets. The symlink to /srv/deployment/kartotherian/deploy should point to /srv/deployment/kartotherian/deploy-cache/revs/[sha1 of deployed commit] [00:46:55] thcipriani, i am looking at the targets - all of them have "git log -1" as 10ffa99 [00:47:06] which is the latest, same as on tin [00:47:19] i was looking at /srv/deployment/kartotherian/deploy [00:47:50] in other words, when it rolls back, it goes boom :) [00:48:23] but yeah, looking at the logs maps1001, maps1003, maps1004 failed to accept tcp connections on port 6533 so that's why the deployment failed. [00:48:43] logs say things like: [maps1003.eqiad.wmnet] No rollback necessary. Skipping [00:48:57] I'll file a task for that. [00:49:03] thcipriani, already did [00:49:09] thank you. [00:49:23] thcipriani, https://phabricator.wikimedia.org/T142792 [00:49:43] could it be because it was my first deploy for those servers? [00:51:32] it's possible. The skip rollback message should only happen when there is no deploy in progress. (keeps a file around called .in-progress that should only be removed at the end of deployment) [00:51:48] !log deployed kartotherian & tilerator. maps100[134].eqiad are still down (non production) [00:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:26] yurik: I wonder if it is because it called service restart on those machines. Has the service ever been started? [00:52:30] thcipriani, but if i say "y" to rollback, shouldn't it also rollback those servers that deployed ok? [00:52:48] yes it should. [00:52:50] thcipriani, no, there was no kartotherian service on those machines [00:52:58] thcipriani, well, yes, but it didn't :) [00:53:22] all of production is now on the new version (which is ok), even though i told it to rollback [00:53:22] indeed. [00:54:01] yeah, I'll dig into this. Unclear why it didn't behave as expected in this instance. [00:54:11] thx :) [00:54:21] and thanks for keeping an eye on it!! [00:54:56] releng folks: always stalking deployments :) [00:56:03] 06Operations, 06Discovery, 06Maps, 10Maps-data, and 2 others: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2546530 (10Yurik) I deployed the new services version, including the `maps100[134].eqiad`, but couldn't start it because i had no account for cassandra, so couldn't... [00:58:50] urllib.error.HTTPError: HTTP Error 403: Bad Behavior [00:58:51] oops :/ [01:09:34] ohhh [01:09:39] I was hitting uk.wikimedia.org [01:09:46] But it redirects to https://wikimedia.org.uk [01:09:53] which was what was returning 403, I think [01:33:27] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [01:39:59] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Enable access to Wikipedia Tulu (tcywiki) on labs replicas - https://phabricator.wikimedia.org/T142223#2546632 (10AlexMonk-WMF) Note that this can be done by any member of the 'ops' group in puppet, it does not need to wait for my maintain-replicas rewrit... [01:40:49] where is grrrit-wm [01:40:56] greg-g, per RoanKattouw doing a hotfix to disable self-mentions, since it's causing problems with templates. [01:41:53] * [grrrit-wm] idle 01:59:38, signon: Thu Aug 11 22:47:37 [01:42:00] * Krenair kicks grrrit-wm [01:46:14] SOY CHARITWO Y TUVE SEXO CON MARCO AURELIO 💝💝💝 [01:46:17] LALALALALALALA [01:46:21] HAHAHAHAHHAHAHA [01:46:33] This kid needs a new hobby [01:58:57] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:03:43] (03CR) 10Alex Monk: "I ran this under my own user too. Check out u2170__meta_p on labsdb1003" [software] - 10https://gerrit.wikimedia.org/r/304425 (owner: 10Alex Monk) [02:04:54] !log mattflaschen@tin Synchronized php-1.28.0-wmf.14/extensions/Echo: Revert self-mentions pending further investigation and discussion, due to accidental self-mentions. (duration: 01m 04s) [02:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:27] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [02:11:42] Disable of self-mentions confirmed in production. [02:19:42] matt_flaschen, hey, did you see https://phabricator.wikimedia.org/T142790 ? [02:22:18] (03CR) 10Ladsgroup: [C: 031] "I don't know the background here but looking at style and stuff, except some nitpicking (mostly long lines here and there) it looks good." [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [02:26:33] Krenair, didn't see that duplicate specifically, but I disabled it just now. [02:26:51] yeah that's why I mentioned it :) [02:35:07] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:39:43] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.14) (duration: 18m 02s) [02:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:00] Someone removed the flags from wmfgc on this channel. I'll have them added back, but hopefully they'll check to see why it has those rights next time. [02:45:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Aug 12 02:45:40 UTC 2016 (duration 5m 57s) [02:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:16] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - Received genError(5) error-status at error-index 1 [02:56:17] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [03:17:17] (03PS1) 10Yuvipanda: labs: Add labvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/304429 [03:18:21] (03PS2) 10Yuvipanda: labs: Add labvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/304429 [03:19:34] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Add labvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/304429 (owner: 10Yuvipanda) [03:36:56] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [03:50:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [03:56:35] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [03:59:05] do we allow UDP on labs instances? Is there way to use mosh? https://mosh.org/ [04:00:37] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [04:32:26] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [04:33:17] (03PS1) 10Yuvipanda: lab: Sort libvirtkvm metrics by uuid [puppet] - 10https://gerrit.wikimedia.org/r/304431 [04:33:35] (03PS2) 10Yuvipanda: lab: Sort libvirtkvm metrics by uuid [puppet] - 10https://gerrit.wikimedia.org/r/304431 [04:33:44] (03CR) 10Yuvipanda: [C: 032 V: 032] lab: Sort libvirtkvm metrics by uuid [puppet] - 10https://gerrit.wikimedia.org/r/304431 (owner: 10Yuvipanda) [04:36:43] uh oh, that icinga alert was me. [04:37:54] sabya_ works on tools, but not elsewhere, because mosh doesn't have agent forwarding nor proxycommand support [04:38:25] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [04:38:30] ^ is me [04:38:34] I'm monitoring [04:48:49] (03PS1) 10Yuvipanda: labs: Don't collect network stats by default [puppet] - 10https://gerrit.wikimedia.org/r/304434 [04:48:50] (03PS1) 10Yuvipanda: labs: Disable libvirtkvm collector for now [puppet] - 10https://gerrit.wikimedia.org/r/304435 [04:49:42] (03CR) 10jenkins-bot: [V: 04-1] labs: Don't collect network stats by default [puppet] - 10https://gerrit.wikimedia.org/r/304434 (owner: 10Yuvipanda) [04:49:45] (03PS2) 10Yuvipanda: labs: Disable libvirtkvm collector for now [puppet] - 10https://gerrit.wikimedia.org/r/304435 [04:49:55] 06Operations, 06Performance-Team, 06Services, 07Availability, 07Wikimedia-Multiple-active-datacenters: Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#2546805 (10aaron) [04:50:14] 06Operations, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2546806 (10aaron) [04:51:37] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Disable libvirtkvm collector for now [puppet] - 10https://gerrit.wikimedia.org/r/304435 (owner: 10Yuvipanda) [04:52:08] I've disabled the collector for now, I'll try again tomorrow [04:54:06] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [04:54:40] ttyl [04:54:58] 06Operations, 10Traffic, 07Availability, 07Wikimedia-Multiple-active-datacenters: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#2546816 (10aaron) [05:12:23] 06Operations, 13Patch-For-Review, 07Performance, 07Wikimedia-Multiple-active-datacenters, 05codfw-rollout: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2546819 (10aaron) [05:21:29] 06Operations, 10media-storage, 07Wikimedia-Multiple-active-datacenters: Look into enabling HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#2546841 (10aaron) [05:22:02] 06Operations, 10media-storage, 07Wikimedia-Multiple-active-datacenters: Look into enabling HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#2044209 (10aaron) [06:09:17] (03PS4) 10Giuseppe Lavagetto: postgresql: support SSL connections/replication [puppet] - 10https://gerrit.wikimedia.org/r/303800 [06:10:28] (03CR) 10jenkins-bot: [V: 04-1] postgresql: support SSL connections/replication [puppet] - 10https://gerrit.wikimedia.org/r/303800 (owner: 10Giuseppe Lavagetto) [06:26:25] <_joe_> wut? [06:27:11] <_joe_> yuvipanda: do not v+2 patches that you didn't flake8 before [06:32:15] (03PS1) 10Giuseppe Lavagetto: diamond: flake8 fixes for libvirtkvm [puppet] - 10https://gerrit.wikimedia.org/r/304443 [06:32:43] <_joe_> I personally don't agree with autobanning trolls [06:34:28] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: flake8 fixes for libvirtkvm [puppet] - 10https://gerrit.wikimedia.org/r/304443 (owner: 10Giuseppe Lavagetto) [07:14:06] 06Operations, 10DBA, 06Performance-Team, 10Traffic, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2546887 (10jcrespo) [07:20:39] (03Abandoned) 10Jcrespo: Remove db role [puppet] - 10https://gerrit.wikimedia.org/r/302429 (owner: 10Jcrespo) [07:22:15] (03PS5) 10Giuseppe Lavagetto: postgresql: support SSL connections/replication [puppet] - 10https://gerrit.wikimedia.org/r/303800 [07:23:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:32:07] (03CR) 10Giuseppe Lavagetto: [C: 032] postgresql: support SSL connections/replication [puppet] - 10https://gerrit.wikimedia.org/r/303800 (owner: 10Giuseppe Lavagetto) [07:37:36] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [07:40:42] (03PS11) 10Giuseppe Lavagetto: puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) [07:44:13] (03PS2) 10Alexandros Kosiaris: Bump heap limits for Parsoid from 600 mb -> 800 mb [puppet] - 10https://gerrit.wikimedia.org/r/304253 (owner: 10Subramanya Sastry) [07:44:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Bump heap limits for Parsoid from 600 mb -> 800 mb [puppet] - 10https://gerrit.wikimedia.org/r/304253 (owner: 10Subramanya Sastry) [07:45:20] <_joe_> Platonides: I think you're causing more annoyance than good with this autokick, unless there is some serious reason for it, IMHO [07:48:36] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1047 is CRITICAL: Connection refused [07:52:13] <_joe_> ema: ^^ that you? [07:52:22] _joe_: nope [07:52:28] I'll take a look [07:52:32] <_joe_> thanks [07:56:05] 06Operations, 10Ops-Access-Requests: Access for platonides to chanops - https://phabricator.wikimedia.org/T142668#2542616 (10Joe) >>! In T142668#2544901, @Dzahn wrote: > What's the problem here really? Everybody was cool with adding platonides and he got added within 24 hours or something and it's resolved.... [07:56:25] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1047 is OK: HTTP OK: HTTP/1.1 200 OK - 176 bytes in 0.005 second response time [08:00:13] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/3694/" [puppet] - 10https://gerrit.wikimedia.org/r/303181 (owner: 10Muehlenhoff) [08:00:26] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add role for puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) (owner: 10Giuseppe Lavagetto) [08:00:39] (03PS12) 10Giuseppe Lavagetto: puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) [08:01:52] 06Operations, 10Traffic: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547002 (10ema) [08:02:03] _joe_: ^ [08:02:32] ema: thanks! I was trying to see if there was an issue with maps, but could not find it... [08:02:48] well, there are a lot of issues with maps :P but not this one... [08:02:48] (03CR) 10Alexandros Kosiaris: "yeah, seems like strontium's gone forever. may it RIP" [dns] - 10https://gerrit.wikimedia.org/r/302757 (owner: 10Dzahn) [08:02:52] gehel: :) [08:03:03] (03Abandoned) 10Alexandros Kosiaris: strontium: add IPv6 AAAA and reverse record [dns] - 10https://gerrit.wikimedia.org/r/302757 (owner: 10Dzahn) [08:03:08] gehel: yeah that's varnishd crashing in a interesting way [08:03:27] 06Operations, 10Traffic: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547014 (10ema) p:05Triage>03Normal [08:04:24] gehel: and systemd thinking it's running just fine I guess? [08:04:49] interesting ... [08:05:14] <_joe_> ema: uhm, systemd thinking it's running is strange [08:11:58] 06Operations, 10Traffic: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547042 (10ema) [08:13:07] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: puppet fail [08:14:43] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2547045 (10MoritzMuehlenhoff) >>! In T141634#2540707, @DarTar wrote: > Sounds good to me, let me know if you need anythin... [08:15:23] 06Operations, 10Traffic: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547046 (10ema) Full panic log on cp1047:~ema/varnishd-backend-crash.log. [08:25:29] !log install postgres security updates on maps clusters [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:57] (03PS1) 10Alexandros Kosiaris: postgres: Fix specs, add jessie spec [puppet] - 10https://gerrit.wikimedia.org/r/304447 [08:33:17] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: Fix specs, add jessie spec [puppet] - 10https://gerrit.wikimedia.org/r/304447 (owner: 10Alexandros Kosiaris) [08:36:06] !log roll restart parsoid to apply https://phabricator.wikimedia.org/rOPUPb32dd25950f1499a79e74fc811c050291b9ec6b8 [08:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:18] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 9 failures [08:45:28] 06Operations, 10Architecture, 10DBA: Architecture decision to solve the need larger serves (for better capacity and consolidation) vs. more, smaller servers (for high availability) - https://phabricator.wikimedia.org/T124681#2547076 (10jcrespo) 05Open>03Resolved a:03jcrespo I think naturally this has l... [08:47:05] (03PS1) 10Alexandros Kosiaris: postgres: Add the LSB pgversion selector [puppet] - 10https://gerrit.wikimedia.org/r/304448 [08:50:56] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM but we really should have this in a single place and not spread all over the postgresql classes." [puppet] - 10https://gerrit.wikimedia.org/r/304448 (owner: 10Alexandros Kosiaris) [08:52:55] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:53:10] (03CR) 10DCausse: [C: 031] Enable Language ID for Russian, Japanese, Portuguese Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304328 (https://phabricator.wikimedia.org/T142413) (owner: 10Tjones) [08:53:25] (03PS1) 10Giuseppe Lavagetto: puppetmaster::puppetdb: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/304449 [08:54:28] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::puppetdb: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/304449 (owner: 10Giuseppe Lavagetto) [08:56:09] (03PS2) 10Giuseppe Lavagetto: postgres: Add the LSB pgversion selector [puppet] - 10https://gerrit.wikimedia.org/r/304448 (owner: 10Alexandros Kosiaris) [08:56:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] postgres: Add the LSB pgversion selector [puppet] - 10https://gerrit.wikimedia.org/r/304448 (owner: 10Alexandros Kosiaris) [08:57:17] <_joe_> akosiaris: I'm merging all those changes, FYI [08:58:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [09:05:26] (03PS1) 10Giuseppe Lavagetto: puppetmaster::puppetdb: add dhparam, fix typo in postgresql ssl.conf [puppet] - 10https://gerrit.wikimedia.org/r/304451 [09:07:56] 06Operations: mw2086 & mw2087 do not respond to IPMI commands - https://phabricator.wikimedia.org/T142726#2544453 (10MoritzMuehlenhoff) It seems to be enabled for least a range of servers, though? Luca has been reimaging several other mw* systems successfully with the wmf-reimage (i.e. it worked there). [09:11:50] 06Operations: mw2086 & mw2087 do not respond to IPMI commands - https://phabricator.wikimedia.org/T142726#2544453 (10Joe) Most of the mw* systems have IPMI enabled and always had; I was unaware of this "security flaw" in IPMI and I honestly didn't see any discussions since I joined 2 years ago about it. We shou... [09:12:54] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::puppetdb: add dhparam, fix typo in postgresql ssl.conf [puppet] - 10https://gerrit.wikimedia.org/r/304451 (owner: 10Giuseppe Lavagetto) [09:26:38] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2547161 (10MoritzMuehlenhoff) [09:29:21] 06Operations: Optional expiry date for user accounts - https://phabricator.wikimedia.org/T142816#2547165 (10MoritzMuehlenhoff) [09:32:00] !log restarting Druid java daemons on druid100[123] for openjdk upgrades [09:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:36:52] 06Operations: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2547185 (10MoritzMuehlenhoff) [09:38:06] !log dropping aft tables from db1052 T59185 [09:38:08] T59185: Archive and drop all article feedback related tables from all wikis - https://phabricator.wikimedia.org/T59185 [09:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:09] 06Operations, 07LDAP: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2547205 (10Krenair) [09:46:40] (03PS1) 10Giuseppe Lavagetto: postgresql: fix ssl.conf syntax [puppet] - 10https://gerrit.wikimedia.org/r/304455 [09:46:42] (03PS1) 10Giuseppe Lavagetto: postgresql::server: fix service name on jessie. [puppet] - 10https://gerrit.wikimedia.org/r/304456 [09:47:01] <_joe_> akosiaris: I'd like a review of the second one at least [09:51:29] (03PS1) 10Giuseppe Lavagetto: puppetmaster::puppetdb: fix users definitions [puppet] - 10https://gerrit.wikimedia.org/r/304457 [09:51:49] (03CR) 10Giuseppe Lavagetto: [C: 032] postgresql: fix ssl.conf syntax [puppet] - 10https://gerrit.wikimedia.org/r/304455 (owner: 10Giuseppe Lavagetto) [09:52:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::puppetdb: fix users definitions [puppet] - 10https://gerrit.wikimedia.org/r/304457 (owner: 10Giuseppe Lavagetto) [09:53:12] 06Operations: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#2547240 (10MoritzMuehlenhoff) [09:54:55] 06Operations, 10Traffic: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547256 (10ema) [09:56:16] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Puppet has 4 failures [10:00:37] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:06:42] !log dropping aft tables from all enwiki hosts after archiving its contents T59185 [10:06:44] T59185: Archive and drop all article feedback related tables from all wikis - https://phabricator.wikimedia.org/T59185 [10:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:48] Krenair or anyone involved with wmf-config, what do you think of 302223 ? [10:09:21] is that a bug # or gerrit change? [10:09:28] it'll be gerrit [10:09:32] number is too high for phab [10:09:38] sorry [10:09:39] https://gerrit.wikimedia.org/r/302223 [10:10:16] we should order that in binary or in unicode collation, but one of the 2 [10:10:30] (03CR) 10Alex Monk: [C: 031] Sort s3.dblist in lexicographical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [10:10:33] it is such a tiny difference, but it is bothering me a lot [10:10:45] you should be asleep legoktm :) [10:10:57] shhhh [10:11:06] more than a review, I would like a discussion about which of the 2 :-) [10:11:42] the list should be sorted like you're making it [10:11:54] so, unicode? [10:12:59] the difference in this case is what, the underscores? [10:13:04] 06Operations, 10Traffic: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547296 (10ema) This error is happening only on maps nodes. We never noticed because usually varnishd starts a new child process if one dies. However, yeste... [10:13:09] Krenair, yes [10:13:19] which way have you sorted it in this version? [10:13:21] (03PS1) 10Giuseppe Lavagetto: puppetmaster::puppetdb: further fixes to users definitions [puppet] - 10https://gerrit.wikimedia.org/r/304461 [10:13:28] <_joe_> git /win 80 [10:13:33] the current patch is in unicode [10:13:48] what's the diff like the other way? [10:13:50] if I sort it in binary (LC=C) [10:13:58] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::puppetdb: further fixes to users definitions [puppet] - 10https://gerrit.wikimedia.org/r/304461 (owner: 10Giuseppe Lavagetto) [10:14:00] still +4, -4? [10:14:00] (it is explained on the comments) [10:14:08] there is only 1 missplaces [10:14:32] See commend @Aug 1 11:13 AM [10:14:33] it's probably currently in binary [10:14:35] *comment [10:14:38] based on your comment [10:14:40] unicode makes more sense to me IMO [10:15:04] so, to be clear [10:15:13] nothing on mediawiki depends on the order, right? [10:15:18] no [10:15:33] I seriously hope not [10:16:02] so I would prefer unicode (it is what most people will have as the default for grep on linux and for mysql) [10:16:43] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2547298 (10Qgil) As opposed to new/recent hires, managers are supposed to either have been around for a while or have a properly verifiable credentials. And if not, then the first thi... [10:17:47] if nobody complains, I will schedule it as a small change for next week [10:18:00] yeah this really isn't a big deal [10:18:08] I would expect to find 'zh_yuewiki' above 'zha', not below 'zhw' [10:18:11] as you can tell by the placement of azbwiki we're not *that* careful about the ordering of these :) [10:18:31] it probably needs rebasing, I think there will be a new wiki added [10:18:36] but that's probably due to years of file system sorting [10:18:44] valhallasw`cloud, is that unicode or binary? [10:19:26] by code point [10:19:31] I think? [10:19:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] postgresql::server: fix service name on jessie. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304456 (owner: 10Giuseppe Lavagetto) [10:19:51] _ comes after Z but before a. Hrm. [10:20:00] 06Operations: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#2547240 (10AlexMonk-WMF) I thought all our LDAP groups and all our production server groups were entirely separate? [10:20:09] 06Operations, 07LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#2547301 (10AlexMonk-WMF) [10:21:23] in any case, if I create a file '_blah', I expect it to float on top in a file manager [10:21:32] valhallasw`cloud my grep uses unicode collation (LANG=en), but please comment on the ticket [10:21:55] what I want is a reliable ordering, whatever it is [10:24:28] mysql (for database names) and WMF servers uses "en" language, so there that point on making it unicode [10:32:26] 06Operations, 07LDAP: Synchronise groups defined in data.yaml to LDAP - https://phabricator.wikimedia.org/T142821#2547312 (10MoritzMuehlenhoff) [10:33:00] (03CR) 10Merlijn van Deen: "I would expect '_q' to sort before 'a', rather than between 'p' and 'r'. Because this is an ascii-lowercase file, just sorting as binary s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [10:36:27] (03CR) 10Jcrespo: "You understand that "sort" would sort in unicode collation too on all WMF servers, right? It would only sort in binary if we force it by o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [10:36:31] 06Operations: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2547328 (10MoritzMuehlenhoff) [10:36:38] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2547341 (10akosiaris) p:05High>03Normal [10:39:53] (03PS1) 10Alexandros Kosiaris: puppetdb: gc-interval is in implicitly in minutes [puppet] - 10https://gerrit.wikimedia.org/r/304462 [10:41:50] (03CR) 10Alexandros Kosiaris: [C: 032] puppetdb: gc-interval is in implicitly in minutes [puppet] - 10https://gerrit.wikimedia.org/r/304462 (owner: 10Alexandros Kosiaris) [10:42:11] 06Operations, 07LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#2547356 (10MoritzMuehlenhoff) >>! In T142819#2547299, @AlexMonk-WMF wrote: > I thought all our LDAP groups and all our production server groups were entirely separate? Mostly, yes. cn... [10:47:35] !log dropping aft tables from all other hosts after archiving its contents T59185 [10:47:36] T59185: Archive and drop all article feedback related tables from all wikis - https://phabricator.wikimedia.org/T59185 [10:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:15] 06Operations: Provide wrapper script for account handling - https://phabricator.wikimedia.org/T142825#2547395 (10MoritzMuehlenhoff) [11:00:04] _joe_: I'm not convinced it was the legit user [11:00:16] <_joe_> Platonides: oh ok [11:00:32] !log upgrading httpd to 2.4.10-10+deb8u6+wmf2 on mw126[5678] [11:00:34] too many connections from dissimilar places [11:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:41] but anyway, can be unbanned now [11:01:38] actually... [11:02:13] looking at the messages by "Az1567" on #wikimedia-es [11:02:14] I'm pretty sure it was the trol [11:02:46] last time I checked he was using free proxies [11:03:27] jynus: I had banned that ip precisely because it came from Digital Ocean [11:03:42] a few people have bouncers there, though [11:04:04] did you figure out the kind of proxy he uses? [11:06:52] 06Operations, 10Traffic: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547422 (10ema) Steps to reproduce while varnish.service is running fine with two processes, assuming varnish-modules is installed: - apt-get remove varni... [11:07:25] 06Operations: Annotate data.yaml user information with an email address - https://phabricator.wikimedia.org/T142826#2547423 (10MoritzMuehlenhoff) [11:10:29] Platonides, I just did a couple of whois on a couple of ips and they showed on anonymous proxy lists [11:10:46] 06Operations: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827#2547437 (10MoritzMuehlenhoff) [11:15:40] 06Operations: Annotate data.yaml user information with an email address - https://phabricator.wikimedia.org/T142826#2547453 (10MoritzMuehlenhoff) Or alternatively store the mail address in labs LDAP (and validate that it's present there) since the users between data.yaml and labs LDAP are identical. [11:17:47] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 1 failures [11:21:17] hmmm did you really mean whois? [11:27:34] checking puppet on 1266 [11:27:41] I think it is due to apache being reinstalled [11:28:46] moritzm: fwiw, labs ldap already exposes wikitech email addresses (but it's not clear to me if those are re-usable for your use case) [11:28:47] yes all good now [11:28:59] mw126[5678] taking traffic regularly [11:29:56] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:29:58] valhallasw`cloud: yeah, that would be an option to use, but to really use it as a synchronisation point with other date we'd at least to make sure that all existing entries are filled out [11:36:40] moritzm: *nod*. mail definitely not present for all users [11:38:26] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:10] 06Operations: mw2086 & mw2087 do not respond to IPMI commands - https://phabricator.wikimedia.org/T142726#2547515 (10faidon) Pretty sure @RobH is referring to the IPMI cipher 0 vulnerability. This was fixed across the fleet at the time by disabling cipher 0 (not disabling IPMI in general). This was fixed by the... [11:43:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [11:45:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:54:20] 06Operations: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2547551 (10MoritzMuehlenhoff) [11:56:09] 06Operations: Require/track Phabricator username - https://phabricator.wikimedia.org/T142830#2547553 (10MoritzMuehlenhoff) [12:01:21] 06Operations, 10DBA: script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609#2547574 (10Krenair) [12:01:23] 06Operations, 10DBA: script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609#916322 (10Krenair) Thanks. [12:03:45] (03CR) 10Mobrovac: [C: 031] Add the fatalmonitor query to logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/304327 (owner: 10Thcipriani) [12:04:17] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:34:48] 06Operations: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2547672 (10MoritzMuehlenhoff) [13:00:05] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3006 is CRITICAL: Connection refused [13:04:31] (03PS1) 10Ema: cache_maps: switch to file storage backend [puppet] - 10https://gerrit.wikimedia.org/r/304466 (https://phabricator.wikimedia.org/T142810) [13:04:42] ema: I suppose cp3006 is you then [13:05:20] akosiaris: indeed [13:05:40] akosiaris: well it's not me but I know what's going on :) [13:08:05] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 176 bytes in 0.169 second response time [13:09:22] (03PS2) 10Gehel: Maps - categorize maps1002 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/304324 (https://phabricator.wikimedia.org/T138092) [13:09:50] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2547754 (10Aklapper) Relatedly, {T142830} proposes to store Phab username in LDAP. [13:10:35] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2547758 (10Gehel) 05Open>03Resolved No new issues seen in icinga history, all looks good, closing. Thanks @faidon and @Cmjohnson for your help! [13:11:09] (03CR) 10BBlack: [C: 031] cache_maps: switch to file storage backend [puppet] - 10https://gerrit.wikimedia.org/r/304466 (https://phabricator.wikimedia.org/T142810) (owner: 10Ema) [13:11:13] (03CR) 10Gehel: [C: 032] Maps - categorize maps1002 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/304324 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [13:20:01] (03PS2) 10Ema: cache_maps: switch to file storage backend [puppet] - 10https://gerrit.wikimedia.org/r/304466 (https://phabricator.wikimedia.org/T142810) [13:20:10] (03CR) 10Ema: [C: 032 V: 032] cache_maps: switch to file storage backend [puppet] - 10https://gerrit.wikimedia.org/r/304466 (https://phabricator.wikimedia.org/T142810) (owner: 10Ema) [13:21:39] !log initial configuration of maps1002 [13:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:16] 06Operations: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2547672 (10AlexMonk-WMF) Actually, cn=wmf can contain more than just people with shell access. > Users who are no longer in data.yaml should not be in cn=wmf or cn=nda in LDAP (meaning they were forgotten to handle when re... [13:23:24] 06Operations, 07Documentation, 07LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2547784 (10AlexMonk-WMF) [13:23:25] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2547785 (10AlexMonk-WMF) [13:23:27] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2547783 (10AlexMonk-WMF) [13:24:35] 06Operations: Require/track email addresses - https://phabricator.wikimedia.org/T142826#2547423 (10AlexMonk-WMF) If we're using the mailing address in LDAP then for my account you'll always get my @gmail.com [13:25:39] 06Operations: Provide wrapper script for account handling - https://phabricator.wikimedia.org/T142825#2547395 (10AlexMonk-WMF) I wrote some code (probably under my @Krenair account) a while ago that would look for any intersection between keys allowed in labs and keys allowed in production. It might be useful here. [13:26:32] 06Operations, 07LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#2547796 (10AlexMonk-WMF) Ah, cn=ops, yes, good point. [13:28:27] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2547816 (10AlexMonk-WMF) To get input about deployment-prep you need to add #Beta-Cluster-Infrastructure (excluding the list of members/sudoUser) ```dn: cn=under_NDA,ou=sudoers,cn=deploymen... [13:35:52] (03PS1) 10Mobrovac: Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) [13:50:06] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2547853 (10Aklapper) Created user https://phabricator.wikimedia.org/p/ops-monitoring-bot/ API Token for @volans at P3824. @Volans: Please close the task once it works as expected. See https://... [13:51:32] (03CR) 10Ottomata: [C: 032] Update camus job to use new check_jar [puppet] - 10https://gerrit.wikimedia.org/r/304195 (https://phabricator.wikimedia.org/T142717) (owner: 10Joal) [13:51:38] (03PS3) 10Ottomata: Update camus job to use new check_jar [puppet] - 10https://gerrit.wikimedia.org/r/304195 (https://phabricator.wikimedia.org/T142717) (owner: 10Joal) [13:52:24] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/3697/" [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [13:54:28] !log cache_maps varnish backends rolling restart (T142810) [13:54:29] T142810: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810 [13:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:03] 06Operations, 13Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#2547873 (10AlexMonk-WMF) [13:59:18] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2547874 (10mkroetzsch) I have signed the Acknowledgement of Wikimedia Server Access Responsibilities. My labs u... [13:59:56] (03PS1) 10Mobrovac: Admin: Add Parsoid deployers to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/304471 (https://phabricator.wikimedia.org/T120103) [14:01:34] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2547880 (10mkroetzsch) P.S. Alex is on vacation and possibly disconnected. His reply might therefore be delayed.... [14:03:12] (03PS1) 10Elukey: Add python-docopt to the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/304472 [14:07:33] (03PS2) 10Alex Monk: Fixes and improvements for maintain-meta_p [software] - 10https://gerrit.wikimedia.org/r/304425 [14:12:13] (03CR) 10Ottomata: [C: 032] Add python-docopt to the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/304472 (owner: 10Elukey) [14:13:22] (03CR) 10Jcrespo: [C: 04-1] "I am very sorry I made you lose some time. While this would, in fact, work, watchlist was detected as a very sensible table and it is not " [software] - 10https://gerrit.wikimedia.org/r/295751 (owner: 10Alex Monk) [14:16:01] (03CR) 10Alex Monk: "Isn't that what I'm doing? Why have you set Code-Review-1?" [software] - 10https://gerrit.wikimedia.org/r/295751 (owner: 10Alex Monk) [14:16:30] (03CR) 10Jcrespo: [C: 031] "I'm confused, this actually works. What did I review instead?" [software] - 10https://gerrit.wikimedia.org/r/295751 (owner: 10Alex Monk) [14:16:43] !log reboot labstore1004/1005 [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:06] Krenair, for some reason I saw a complete different diff on gerrit, one with count(*), maybe I clikced on the wrong link [14:18:27] you might've been looking at the wrong side of the diff [14:18:51] oh, it was that! [14:19:06] lol [14:19:14] I reviewed the current code! [14:22:10] (03PS2) 10Jcrespo: Remove db1027 from internal dns entries [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) [14:25:40] (03PS2) 10Alex Monk: Replace impossible watchlist_counts custom view with full view of already-filtered watchlist_count [software] - 10https://gerrit.wikimedia.org/r/295751 [14:26:23] (03CR) 10Jcrespo: [C: 032] Replace impossible watchlist_counts custom view with full view of already-filtered watchlist_count [software] - 10https://gerrit.wikimedia.org/r/295751 (owner: 10Alex Monk) [14:27:19] 06Operations, 10Ops-Access-Requests: Access for platonides to chanops - https://phabricator.wikimedia.org/T142668#2547901 (10Dzahn) Honestly it seems to me like you listed the disadvantages of not having a process or tracking. Anyways, i was actually trying to stop arguments over this. [14:27:19] (03PS1) 10Muehlenhoff: Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) [14:33:57] !log zotero restarted, mem usage was at 11% [14:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:08] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:39:29] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2547959 (10BBlack) [14:39:31] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547958 (10BBlack) [14:39:59] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack) [14:40:01] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:40:14] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:40:16] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack) [14:43:44] (03PS15) 10Alex Monk: Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [14:44:39] "This (so far) deals with the following issues: * It was written in Perl." [14:45:00] ❤️ [14:45:42] :) [14:46:37] Don't think I'd ever done anything serious with perl until I started porting that script away from it. [14:46:47] oh, I did [14:46:51] I think that makes me lucky. [14:46:54] well, not anything serious [14:47:08] Though I did have so much fun trying to read it [14:47:30] but serious enougth to handle root access to large clusters and lots of regular expressions [14:47:52] heh [14:48:15] In fact, I think I recently extended pt-heartbeat (percona toolkit), and I think that was perl [14:48:37] but I cannot be sure [14:50:30] Actually, it looks like I do perl and I never realized it: https://gerrit.wikimedia.org/r/#/c/302469/4/modules/icinga/files/check_mariadb.pl [14:54:11] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2547990 (10Volans) 05Open>03Resolved All good, thanks! [15:02:09] (03PS1) 10Muehlenhoff: hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 [15:13:03] !log upgrading openjdk on restbase-test/xenon/praseodymium/cerium) (along with cassandra restart) [15:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:50] gehel, i will scap the tilerator and initiate it on maps1002 [15:14:32] yurik: as far as I can see, maps1002 already has the same version as maps1001, do you see something different? [15:14:57] gehel, i'm not sure how that's possible - it was disabled yesterday [15:15:29] yurik: it pulled the code during initial puppet run [15:15:55] gehel, that's whats strange - i removed it from the targets file [15:16:02] in any case, doing it now, will see [15:16:51] done [15:17:19] yurik: kool, thanks! [15:18:00] gehel, yep, it created all the needed namespaces, and kartotherian also restarted ok [15:18:18] !log deployed tilerator to fix krtotherian restarts on maps100* [15:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:13] gehel, i will start the tile generation over the weekend. will see if i can optimize it somehow [15:26:05] (03PS1) 10Ottomata: Include sqoop on all hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/304479 [15:31:13] 06Operations: mw2086 & mw2087 do not respond to IPMI commands - https://phabricator.wikimedia.org/T142726#2548051 (10RobH) >>! In T142726#2547515, @faidon wrote: > Pretty sure @RobH is referring to the IPMI cipher 0 vulnerability. > > This was fixed across the fleet at the time by disabling cipher 0 (not disabl... [15:35:30] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [15:36:20] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [15:36:21] !log reset postgresql maps slave for maps1003 [15:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:28] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [15:41:39] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [15:44:19] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [15:45:38] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [15:59:25] (03PS1) 10Muehlenhoff: elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 [16:00:39] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 (owner: 10Muehlenhoff) [16:04:40] (03PS2) 10Muehlenhoff: elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 [16:07:16] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 (owner: 10Muehlenhoff) [16:08:26] (03PS3) 10Muehlenhoff: elasticsearch: Use domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304483 [16:13:32] (03PS1) 10Dzahn: remove exception for gerrit from enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/304484 [16:16:03] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2548141 (10Joe) [16:16:05] 06Operations: create puppetDB puppet role + debian package - https://phabricator.wikimedia.org/T142363#2548140 (10Joe) 05Open>03Resolved [16:21:04] (03CR) 10Dzahn: "I don't think we need this anymore. We have working https in labs using LE." [puppet] - 10https://gerrit.wikimedia.org/r/303435 (https://phabricator.wikimedia.org/T141803) (owner: 10Paladox) [16:21:25] 06Operations: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2548142 (10Joe) [16:21:34] (03CR) 10Chad: "I removed the 444:444 hardcoding at Moritz's suggestion, but yeah it should always be a system user now. The old one was a legacy accident" [puppet] - 10https://gerrit.wikimedia.org/r/304484 (owner: 10Dzahn) [16:21:39] (03CR) 10Chad: [C: 031] remove exception for gerrit from enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/304484 (owner: 10Dzahn) [16:22:27] (03CR) 10Chad: "This is already done on lead anyway, this just codifies it in puppet." [puppet] - 10https://gerrit.wikimedia.org/r/302949 (owner: 10Chad) [16:23:17] (03CR) 10Chad: "Yeah not as it exists right now. We could do some more cleanup of the apache config sure, but otherwise yeah let's abandon this." [puppet] - 10https://gerrit.wikimedia.org/r/303435 (https://phabricator.wikimedia.org/T141803) (owner: 10Paladox) [16:24:15] 06Operations: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2548161 (10Joe) We imported the resources, although running the `puppet storeconfigs export` command required around 25 GB of RAM (see https://tickets.puppetlabs.com/browse/PDB-165), whi... [16:24:22] (03CR) 10Chad: "Similar to Ibf7dbc, but makes it explicit for these config files." [puppet] - 10https://gerrit.wikimedia.org/r/303204 (owner: 10Chad) [16:24:38] (03CR) 10Chad: "Er, copy+paste fail, that's Ibf7dbc04" [puppet] - 10https://gerrit.wikimedia.org/r/303204 (owner: 10Chad) [16:24:57] (03CR) 10Dzahn: "Chad, should we still have this? LDAP auth works in labs too.." [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [16:25:10] (03CR) 10Chad: "@Paladox can you test this on your install?" [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [16:26:47] 06Operations: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2548178 (10Joe) For `naggen2`, it is incredibly easy and fast to fetch resources, probably with some jq wizardry it would be enough to do ``` curl https://nitrogen.eqiad.wmnet/v3/resourc... [16:26:58] 06Operations: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2548179 (10Joe) a:03Joe [16:33:39] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:42] ^me [16:34:50] !log reboot labstore1005 to test failure mode [16:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:09] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [16:37:51] (03CR) 10Chad: "I think it's useful for one-off testing where using DEVELOPMENT_BECOME_ANY_ACCOUNT is useful." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [16:38:28] (03CR) 10Chad: Gerrit: make auth_type configurable for labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [16:39:45] (03Abandoned) 10Chad: Gerrit: Simplify dependencies [puppet] - 10https://gerrit.wikimedia.org/r/302950 (owner: 10Chad) [16:40:09] (03Abandoned) 10Chad: Contint: remove contint-users group [puppet] - 10https://gerrit.wikimedia.org/r/298832 (owner: 10Chad) [16:41:49] (03PS1) 10Giuseppe Lavagetto: sshknowngen: add puppetdb-compatible version [puppet] - 10https://gerrit.wikimedia.org/r/304485 (https://phabricator.wikimedia.org/T142846) [16:42:16] <_joe_> paravoid: if you're around, take a look :P ^^ [16:43:49] looking [16:44:32] can't we do it without sshknowngen? [16:44:51] have you looked into puppetdbquery instead? [16:44:52] (03PS2) 10Giuseppe Lavagetto: sshknowngen: add puppetdb-compatible version [puppet] - 10https://gerrit.wikimedia.org/r/304485 (https://phabricator.wikimedia.org/T142846) [16:45:43] <_joe_> paravoid: well, yes, but that was a puppet face IIRC? [16:46:24] (03PS1) 10Chad: gerrit (2.12.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 [16:46:37] no [16:46:44] well yes, that too, but I'm talking about the parser functions [16:47:07] https://github.com/dalen/puppet-puppetdbquery [16:47:10] <_joe_> so you mean building a template in erb based on those parser functions? [16:47:10] query_resources() [16:47:33] (also, a hiera backend too apparently, although probably not that useful here) [16:47:39] <_joe_> yeah ok, that works too, but I've seen ebay has gone the python way for naginator on puppetdb [16:47:54] (03PS20) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [16:48:44] <_joe_> paravoid: I thought of this as very useful for doing things "let's find all the nodes that have this resource declared or smt like that [16:49:48] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:13] <_joe_> but yes, let's see how well it performs [16:50:44] 06Operations, 10Traffic: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) [16:52:59] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [16:53:14] about the same I'd guess [16:53:18] but big words for puppet :P [16:53:53] <_joe_> paravoid: I've just seen a puppet script take 2 hours to generate a tarfile of less than 2 megabytes [16:53:59] (03CR) 10Chad: "Even with this applied I still get "gbp:error: upstream/2.12.2 is not a valid treeish"" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [16:54:01] <_joe_> :P [16:54:02] yeah I heard :P [16:54:16] but what puppetdbquery does, I'm guessing, is exactly what you did [16:54:21] /v3/resources/$resource [16:54:23] <_joe_> about the time it would've took me to type it I guess [16:54:24] <_joe_> yes [16:54:27] <_joe_> it does the same [16:54:42] <_joe_> so it's ruby, not puppet [16:54:52] well a function in ruby [16:54:55] <_joe_> only a very large data structure to keep in memory for puppet [16:54:57] plus, presumably, erb/epp [16:55:01] <_joe_> erb [16:55:10] <_joe_> epp is an horror of our future :P [16:55:10] hating epp already? :) [16:55:27] <_joe_> not /hating/; not seeing the point? yes [16:55:38] the reason I'm leaning towards that instead of sshknowngen, is that we have other places where we use that kind of code [16:55:54] shinkengen, filippo wanted something similar for prometheus, etc. [16:56:02] <_joe_> ok [16:56:11] <_joe_> yes makes sense [16:56:17] and in general, it's a useful thing to have, even if we (rightfully) use exported resources in very limited places [16:56:21] <_joe_> naginator being the exception here I guess [16:56:35] I'm not sure! [16:56:49] <_joe_> for that, I kinda like the ebay naginator.py which does exactly what we need [16:57:00] ok [16:57:27] <_joe_> anyways, now that it has no load at all puppetdb is very fast [16:57:29] icinga2's config (for instance) is a bit different [16:57:53] so it'd be neat to be able to arbitrarily modify the generated config using a template language [16:58:06] <_joe_> jinja2? [16:58:10] <_joe_> they do that :P [16:58:12] heh ok [16:58:20] I was about to ask :) [16:58:21] <_joe_> like we did with naggen2 etc [16:58:30] 06Operations, 10Traffic: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548382 (10BBlack) [16:58:57] <_joe_> but yes, for the rest of things it's a good idea to remain within the puppet borders if possible [16:59:03] <_joe_> I'll do some tests [16:59:05] 06Operations, 10Traffic: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) (Added two forgotten items (xkey + purge/ban) in the downsides list) [16:59:17] did you guys migrate us to puppetdb entirely already? [16:59:22] <_joe_> no [16:59:28] <_joe_> we joked about doing it today [16:59:31] heh [16:59:41] <_joe_> we just ran an export/import of resources [16:59:50] ok :) [16:59:54] <_joe_> which was a precondition for me testing conversion of naggen/etc [17:00:00] yeah, smart :) [17:00:05] did alex figure out what to do with servermon? [17:00:30] <_joe_> i don't think so, but it's gonna be one curl for the nodes, a few more for other things [17:00:37] <_joe_> SOA [17:00:49] I suggested another alternative, but I'm sure which one is better at this point [17:01:07] the other alternative being, writing a puppet report handler that just submitted all the facts to a servermon endpoint [17:01:26] and servermon just keeping its own database of facts in its own database (using a better schema than storedconfig's) [17:01:41] <_joe_> uhm [17:02:29] <_joe_> well, either solution works [17:02:51] yeah :) [17:03:14] I suggested it as a safer option because puppetdb is largely unknown to me [17:03:23] <_joe_> but I mean services querying each other via REST interfaces, how can you resist being able to claim we're moving to an SOA? [17:03:26] I'm guessing it's not to you anymore, so I'll leave you guys to decide :) [17:03:34] heh [17:03:36] <_joe_> well, it is basically unknown [17:03:46] <_joe_> it's just very fast for simple queries [17:03:56] <_joe_> which is already more than i expected [17:04:21] <_joe_> lemme try to grind it down with ab :P [17:06:01] ok :) [17:08:43] <_joe_> heh, not that bad but will probably need tuning [17:09:16] <_joe_> anyways, the mean response time for all the sshkey resources at concurrency 10 (which is way higher than what I'd expect) is 380 ms [17:09:27] <_joe_> and again, it's largely untuned [17:27:22] (03PS21) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [17:29:03] (03CR) 10jenkins-bot: [V: 04-1] beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [17:38:28] (03PS7) 10Paladox: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 [17:38:58] (03CR) 10Paladox: "@Chad it should now work." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 (owner: 10Paladox) [17:39:16] (03Abandoned) 10Paladox: Testing [debs/gerrit] - 10https://gerrit.wikimedia.org/r/302371 (owner: 10Paladox) [17:50:51] (03PS2) 10Dzahn: Gerrit: Allow gerrit2 user/group to write to its config files [puppet] - 10https://gerrit.wikimedia.org/r/303204 (owner: 10Chad) [17:51:29] (03CR) 10Dzahn: "uhm..rebased into nothing-ness?" [puppet] - 10https://gerrit.wikimedia.org/r/303204 (owner: 10Chad) [17:53:13] (03PS2) 10Dzahn: remove exception for gerrit from enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/304484 [17:59:59] (03PS22) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [18:06:47] (03CR) 10BBlack: [C: 031] Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [18:10:26] (03CR) 10Paladox: "Ok yep, once my pic has installed its windows 10 build, just restarted when I wasn't looking." [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [18:10:34] (03Abandoned) 10Chad: Gerrit: Allow gerrit2 user/group to write to its config files [puppet] - 10https://gerrit.wikimedia.org/r/303204 (owner: 10Chad) [18:11:00] (03PS23) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [18:15:04] (03PS1) 10Yuvipanda: tools: Stop checking labs puppetmaster via toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/304492 (https://phabricator.wikimedia.org/T142452) [18:15:15] (03CR) 10Dzahn: [C: 032] "double checked with "dryrun" too" [puppet] - 10https://gerrit.wikimedia.org/r/304484 (owner: 10Dzahn) [18:15:36] (03PS2) 10Yuvipanda: tools: Stop checking labs puppetmaster via toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/304492 (https://phabricator.wikimedia.org/T142452) [18:15:46] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Stop checking labs puppetmaster via toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/304492 (https://phabricator.wikimedia.org/T142452) (owner: 10Yuvipanda) [18:17:02] mutante I merged your change too [18:17:36] yuvipanda: this must have been exceptional timing, i also did puppet-merge [18:17:45] i did not see the "multiple changes" warning [18:17:51] but then i saw toollabs [18:18:05] :) [18:18:16] either way, both are merged it tells me too [18:18:17] :) [18:18:24] yuvipanda, could you take a look at https://gerrit.wikimedia.org/r/#/c/303938/ at some stage? might not be a friday thing, not sure [18:19:49] krenair oh yeah, I looked at it - in general nginx 'if's are a bad idea (https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/) so I was thinking we should just add another server {} block (or even a whole nginx::site) for it [18:19:53] (03CR) 10Paladox: "Why not go with gerrit 2.12.4 since it includes some more fixes and they look like they are gearing to release 2.12.4." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [18:20:13] so I think the right thing to do is to add a nginx::site in the novaproxy role [18:20:24] ok [18:21:44] krenair another question I have is if our *.wmflabs.org cert covers wmflabs.org by itself [18:21:53] (03CR) 10Paladox: "https://gerrit.googlesource.com/gerrit/+/stable-2.12/ReleaseNotes/ReleaseNotes-2.12.4.txt" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [18:22:37] (03CR) 10Chad: "Yeah well I've been waiting and week and it hasn't come out yet. I need 2.12.3 more than 2.12.4. When it comes out we'll upgrade again." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [18:22:40] (03CR) 10Paladox: "https://gerrit.googlesource.com/gerrit/+/stable-2.12/ReleaseNotes/ReleaseNotes-2.12.4.txt#76" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [18:22:53] (03CR) 10Paladox: [C: 031] "Ok" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [18:22:55] yuvipanda, CN is *.wmflabs.org but it has wmflabs.org on SAN [18:23:31] like how our *.wikipedia.org cert magically covers mediawiki.org :) [18:23:54] (03Abandoned) 10Dzahn: amire80 .bashrc, add alias for sql host lookup [puppet] - 10https://gerrit.wikimedia.org/r/301326 (https://phabricator.wikimedia.org/T141255) (owner: 10Dzahn) [18:24:30] nice [18:27:27] Krenair: *.wikipedia.org covers mediawiki.org? LOL? [18:27:36] yes Luke081515 [18:27:59] interesting [18:28:14] got chrome or FF? [18:28:16] Magic! [18:28:19] FF, yep [18:28:25] I'm ucrrently looking at it [18:28:27] *currently [18:28:33] oh okay, you found the list? [18:28:51] "covers *.wikipedia.org" [18:28:52] Luke081515: you can go to certificate details and the "subject alt name" field [18:32:49] our primary prod cert covers something like 30-ish SANs [18:32:57] heh, that certificate covers for nearly all wikis [18:33:20] (for *.project.org, *.m.project.org, and plain project.org, for all the main project domains, plus a few other odds and ends) [18:33:26] Krenair: you see, not only mw.org ;) [18:33:34] yes, a lot more than just mw.o [18:33:37] that was an example [18:33:57] :D [18:34:08] wmfusercontent and *.planet are on their own separate certs, but arguably could be merged up on the next go-round [18:34:47] meh, I get SSL_ERROR_BAD_CERT_DOMAIN for beta again. It says that the certificate only covers beta.wmflabs.org, not *.beta.wmflabs.org :-/ [18:35:10] SAN wildcards only do one level of hostname [18:35:29] whyyyyyyy :P [18:36:14] well for most of the world it doesn't matter much anyways [18:36:26] we're the only major crazies that do en.wikipedia.org instead of wikipedia.org/en/ :P [18:38:08] wtf? beta was working just now [18:38:18] (that decision also screws us on getting an EV cert for the better-looking address bar icon like banks and such. because the standards say EV can't be wildcard. the value is a little dubious anyways, though) [18:38:31] (03PS5) 10Paladox: Gerrit: make auth_type configurable for labs [puppet] - 10https://gerrit.wikimedia.org/r/303355 [18:38:32] Krenair: the cert it's showing only has wmflabs.org as the single SAN [18:38:37] (03PS6) 10Paladox: Gerrit: make auth_type configurable for labs [puppet] - 10https://gerrit.wikimedia.org/r/303355 [18:38:39] (03PS2) 10Alex Monk: dynamicproxy: Add nginx config to redirect www.wmflabs.org/wmflabs.org to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/303938 (https://phabricator.wikimedia.org/T38885) [18:38:43] yeah I'm guessing it's dumped my hiera changes and gone back to default [18:38:44] but why [18:40:00] maybe I missed role:: [18:40:06] LE doesn't do wildcards, EV doesn't do wildcards. most major sites don't need wildcards because they're not supporting ~300 language subdomainnames. [18:40:11] (03PS7) 10Paladox: Gerrit: make auth_type configurable for labs [puppet] - 10https://gerrit.wikimedia.org/r/303355 [18:40:16] one flick of the canonical URLs could fix it all :P [18:41:29] (we could start by supporting both redirect-free and just having rel=canonical indicate which is prefered for search engines. then wait for traffic patterns to shit to majority /lang, then start redirecting from the old URLs, etc..) [18:41:32] ugh [18:41:39] (it would be years before we could actually kill them though, too many links all over) [18:43:10] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 304 bytes in 0.077 second response time [18:44:00] I wonder what URL that check is actually looking up [18:44:05] yuvipanda? [18:44:13] yuvipanda: ^ may want to silence not sure where you are in the process [18:44:24] Krenair: I think it checks on the tools checker host itself that it can fetch the catalogue [18:44:48] (03CR) 10Chad: "One last inline nit, otherwise lgtm." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [18:45:29] Okay, beta is back [18:45:46] This was the real issue: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep/host/deployment-cache-text04&diff=817194&oldid=817184 [18:46:32] (03PS8) 10Paladox: Gerrit: make auth_type configurable for labs [puppet] - 10https://gerrit.wikimedia.org/r/303355 [18:46:48] (03CR) 10Paladox: Gerrit: make auth_type configurable for labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [18:47:57] (03PS1) 10Ottomata: Render analytics-research-user mysql password in hdfs so that we can automate sqoop via Oozie [puppet] - 10https://gerrit.wikimedia.org/r/304494 [18:48:54] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2548670 (10Nuria) @DarTar: we do not deal directly with reserachers trying to get access so our docs are geared to develo... [18:50:07] (03PS2) 10Ottomata: Render analytics-research-user mysql password in hdfs so that we can automate sqoop via Oozie [puppet] - 10https://gerrit.wikimedia.org/r/304494 [18:52:04] (03PS3) 10Ottomata: Render analytics-research-user mysql password in hdfs so that we can automate sqoop via Oozie [puppet] - 10https://gerrit.wikimedia.org/r/304494 [18:57:13] (03PS4) 10Ottomata: Render analytics-research-user mysql password in hdfs so that we can automate sqoop via Oozie [puppet] - 10https://gerrit.wikimedia.org/r/304494 [18:57:19] (03CR) 10Ottomata: [C: 032 V: 032] Render analytics-research-user mysql password in hdfs so that we can automate sqoop via Oozie [puppet] - 10https://gerrit.wikimedia.org/r/304494 (owner: 10Ottomata) [19:01:50] (03PS1) 10Ottomata: Qualify path to echo in hdfs_put_mysql-analytics-research-client-pw.txt exec [puppet] - 10https://gerrit.wikimedia.org/r/304497 [19:02:03] (03CR) 10Ottomata: [C: 032 V: 032] Qualify path to echo in hdfs_put_mysql-analytics-research-client-pw.txt exec [puppet] - 10https://gerrit.wikimedia.org/r/304497 (owner: 10Ottomata) [19:03:39] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [19:03:59] nuh uh i just fixed it. [19:05:28] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:53] 06Operations, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2548760 (10Dzahn) >>! In T123525#2543709, @jcrespo wrote: > I would like to bring your attention to https://gerrit.wikimedia.org/r/304203 I checked the one for gerrit can be removed... [19:23:55] (03PS3) 10Alex Monk: dynamicproxy: Add nginx config to redirect www.wmflabs.org/wmflabs.org to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/303938 (https://phabricator.wikimedia.org/T38885) [19:35:24] (03PS3) 10Dzahn: Fix path to jenkins homedir for nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [19:35:54] (03CR) 10Dzahn: [C: 032] Fix path to jenkins homedir for nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [19:36:47] (03CR) 1020after4: "I never was able to figure out exactly how it gets set on the slave." [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [19:38:40] chasemp Krenair I merged a change earlier that removed that check, neon's just catching up I guess. [19:47:13] (03PS1) 10Mattflaschen: Set one year login in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304501 (https://phabricator.wikimedia.org/T68699) [19:47:55] OK. looks like we have an issue in prod for ORES that won't be fixed by a rollback. We have an issue where we have bad values in our cache. My plan is to deploy a minor fix (well tested in beta) and clear ORES' cache to resolve the issue. [19:48:17] Happily this issue is not affecting the ORES extension, so that's why we missed it. [19:48:24] Instead, it's affecting 3rd party tool devs. [19:49:50] So, we're geared up and ready for this bug fix deploy right now. And I've confirmed a pattern for clearing the caches. [19:50:35] We sent an email to ai-l (for our users) [19:50:52] and I tested the deployment in ores-beta so it won't break the ores review tool [19:51:08] ^ which is our primary concern as that affects the majority of users [19:51:14] 07Puppet, 10Continuous-Integration-Config, 07Jenkins: There is no sane way to get arcanist's conduit tokens onto nodepool CI slaves - https://phabricator.wikimedia.org/T140417#2548797 (10mmodell) [19:52:14] 07Puppet, 10Continuous-Integration-Config, 07Jenkins: There is no sane way to get arcanist's conduit tokens onto nodepool CI slaves - https://phabricator.wikimedia.org/T140417#2463982 (10mmodell) a:05hashar>03mmodell [19:52:21] 07Puppet, 10Continuous-Integration-Config, 07Jenkins: There is no sane way to get arcanist's conduit tokens onto nodepool CI slaves - https://phabricator.wikimedia.org/T140417#2463982 (10mmodell) p:05Normal>03Low [19:55:19] halfak: do you need somebody to merge the minor bugfix? [19:55:31] do you still need "Lowers ores-redis maxmemory setting to 2.5GB" ? [19:57:01] mutante, I don't need that, no. I'll go get rid of that change. [19:57:21] I think that Amir1 was able to merge the bugfix to the deploy repo and that's how we tested in beta. [19:57:29] Amir1, can oyu confirm? [19:57:30] (03CR) 10Mattflaschen: [C: 032] Set one year login in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304501 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [19:57:40] I already merged it [19:57:47] I'm in tin [19:57:51] Great. Thanks anyway mutante [19:57:58] (03Merged) 10jenkins-bot: Set one year login in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304501 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [19:59:11] halfak: If you give me a minute. I want to be sure I'm not accidentally deploying anything more than the fix [19:59:51] No problem. Please do confirm [19:59:53] sorry it takes some time [19:59:58] Na :) [20:00:12] alright, cool [20:03:11] !log deploying ores 2ef24f2 to scb2001.codfw.wmnet (canary node) T142857 [20:03:12] T142857: ORES format issue (whole score document is cached) - https://phabricator.wikimedia.org/T142857 [20:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:31] I love stashbot [20:03:51] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: T68699: Enable one-year login on Beta Cluster. Scheduled for prod on Tuesday. (duration: 00m 52s) [20:03:52] T68699: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699 [20:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:41] deployed [20:05:42] testing [20:06:15] (the canary node) [20:06:41] works fine [20:06:43] going to prod [20:07:05] !log deploying ores 2ef24f2 to all nodes T142857 [20:07:07] T142857: ORES format issue (whole score document is cached) - https://phabricator.wikimedia.org/T142857 [20:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:30] We should expect to see a minor CPU bump when I clear the cache. See https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen [20:10:41] * halfak is excited to have this issue resolved. [20:10:50] I've got a URL queued up so that we can confirm it right away [20:10:51] :) [20:10:53] :)))) [20:10:59] :D [20:11:06] The deploy is not finished yet [20:11:13] final stages though [20:11:14] I know. Just hanging out and waiting [20:11:32] * halfak has a big mess of graphs, terminals and browser tabs waiting for testing and monitoring [20:12:17] halfak: done, let the tests begin! [20:12:26] (clear the cache first though) [20:12:51] log here, that's important [20:12:59] !log running FLUSHALL on oresrdb1001 T142857 [20:12:59] T142857: ORES format issue (whole score document is cached) - https://phabricator.wikimedia.org/T142857 [20:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:04] (actually that's much more important than the deploy) [20:13:13] OK cache cleared [20:13:28] Response format fixed [20:13:31] Confirmed. [20:13:35] Watching CPU usage [20:13:44] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen [20:15:59] Everything seems nominal [20:16:42] (03CR) 10Paladox: [C: 031] "This is a success." [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [20:16:54] I think that we can declare victory [20:17:18] Oh! There goes CPU [20:17:27] From 20% to 35% [20:19:16] (03PS1) 10Yuvipanda: k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 [20:19:53] (03PS2) 10Yuvipanda: k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 [20:20:27] (03CR) 10Paladox: "@Muehlenhoff could you review and merge and upload to apt Wikimedia please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [20:22:59] (03PS3) 10Yuvipanda: k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 [20:29:27] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wqds200[12] - https://phabricator.wikimedia.org/T142864#2549225 (10RobH) [20:29:50] @Platonides doesn't take shit from anybody. ;D [20:30:27] lol [20:30:28] whenever i see somenoe handling irc stuff so i dont have to, it makes me happy. [20:30:49] lol [20:30:51] xD [20:31:11] (03PS4) 10Yuvipanda: k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 (https://phabricator.wikimedia.org/T142862) [20:31:13] (03PS1) 10Yuvipanda: tools: Allow multiple k8s master to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/304504 (https://phabricator.wikimedia.org/T142862) [20:31:19] this trol doesn't respect any nick! [20:33:14] Looks like ORES didn't suffer much for its cache hit rate during this deploy :) [20:33:14] https://grafana-admin.wikimedia.org/dashboard/db/ores?panelId=12&fullscreen [20:41:03] (03PS1) 10Ottomata: Use require_package and make sure python3-docopt is on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/304506 [20:48:36] (03CR) 10Ottomata: [C: 032] Use require_package and make sure python3-docopt is on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/304506 (owner: 10Ottomata) [21:00:52] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2549364 (10leila) @mkroetzsch your labs username should be enough. Please don't take any further step unless we... [21:11:27] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.937 second response time [21:46:09] 06Operations: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2089539 (10Dzahn) fwiw, strontium is now dead. but we have rhodium. i wonder if that changed this ticket at all. [21:55:19] 06Operations, 10Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#971192 (10Dzahn) http://www.freebsddiary.org/smart.php [21:58:24] 06Operations, 10Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#2549452 (10Dzahn) original check_smartmon https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_smartmon/details extension to the above, check_smartmon2 https://exchange.nagios.o... [22:08:30] 06Operations, 10Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#2549510 (10Dzahn) checked if we really don't have one in the distro nagios-plugins-* packages, and there is: @neon:/usr/lib/nagios/plugins# ./check_ide_smart ``` @neon:/usr/lib/nagios/plugins# file... [22:16:23] 06Operations, 10Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#2549571 (10Dzahn) so yea, we already have this plugin installed everywhere across the board: on canary appserver: ``` root@mw1099:/usr/lib/nagios/plugins# ./check_ide_smart /dev/sda2 Id= 1, Status=4... [22:22:41] 06Operations, 10Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#2549581 (10Dzahn) using the -n (nagios) option: [mw1099:~] $ sudo /usr/lib/nagios/plugins/check_ide_smart -n -d /dev/sda2 OK - Operational (17/17 tests passed) (but on jessie there is no -n (?) this i... [22:38:08] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10faidon) That would add at least a round-trip latency on every true miss that hits eqiad/esams/ulsfo (new or just purged page), won't it? [22:41:54] (03PS1) 10Dzahn: base/monitoring: add optional SMART disk check [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) [22:47:42] (03PS2) 10Dzahn: base/monitoring: add optional SMART disk check [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) [22:48:33] (03PS5) 10Yuvipanda: k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 (https://phabricator.wikimedia.org/T142862) [22:48:35] (03PS2) 10Yuvipanda: tools: Allow multiple k8s master to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/304504 (https://phabricator.wikimedia.org/T142862) [22:48:37] (03PS1) 10Yuvipanda: labs: Don't open everything to prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/304581 [22:48:47] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/3702/" [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) (owner: 10Dzahn) [22:49:46] (03CR) 10Faidon Liambotis: [C: 04-1] Disable unprivileged user namespaces on trusty systems (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [22:52:13] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549720 (10BBlack) On a true miss, yes, it would add a codfw<->eqiad round-trip. That's ~35ms though, which may be hard for MW to beat on average. True miss should be rare though, except whe... [22:54:51] (03CR) 10Yuvipanda: [C: 032] k8s: Make controller-manager & scheduler be HA [puppet] - 10https://gerrit.wikimedia.org/r/304503 (https://phabricator.wikimedia.org/T142862) (owner: 10Yuvipanda) [22:55:01] (03CR) 10Yuvipanda: [C: 032] tools: Allow multiple k8s master to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/304504 (https://phabricator.wikimedia.org/T142862) (owner: 10Yuvipanda) [22:55:08] (03CR) 10Yuvipanda: [C: 032] labs: Don't open everything to prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/304581 (owner: 10Yuvipanda) [22:55:40] (03PS3) 10Dzahn: base/monitoring: add optional SMART disk check [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) [22:55:59] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549747 (10BBlack) Oh, re-reading your question, you mentioned specific DCs. In the current layout where only eqiad is "primary", the side-checks from eqiad to codfw would only happen for req... [22:57:25] (03PS4) 10Dzahn: base/monitoring: add optional SMART disk check [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) [23:05:16] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549781 (10BBlack) Of course, if MW can beat an eqiad<->codfw trip for the same page... we could look at other ways to structure this so it doesn't kick in all the time. Perhaps trigger it on... [23:54:04] (03PS24) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501)