[00:04:11] (03PS1) 10BBlack: update-ocsp: refactor validation, check cert life [puppet] - 10https://gerrit.wikimedia.org/r/232873 (https://phabricator.wikimedia.org/T109737) [00:08:51] (03CR) 10BryanDavis: "> any opinions? or would it be more reasonable to just assign ::vagrant and ::vagrant::lxc directly to a labs instance and skip the role?" [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [00:12:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:15:20] (03CR) 10BBlack: [C: 032] update-ocsp: refactor validation, check cert life [puppet] - 10https://gerrit.wikimedia.org/r/232873 (https://phabricator.wikimedia.org/T109737) (owner: 10BBlack) [00:20:03] hey RoanKattouw, do you remember why we have both foreachwikiindblist and mwscriptwikiset? [00:20:22] (03PS1) 10BBlack: Revert "disable ocsp updater cron for now" [puppet] - 10https://gerrit.wikimedia.org/r/232874 (https://phabricator.wikimedia.org/T109740) [00:20:30] Krenair: No, I don't know [00:20:50] Krenair: foreachwikiindblist takes a newline-separated list of wikis in a text file, amybe mwscriptwikiset takes its list in some other form? [00:21:26] parameters are opposite ways around [00:21:48] foreachwikiindblist tries to read dblist from current directory [00:22:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:22:33] mwscriptwikiset forces them to be relative to $MEDIAWIKI_DEPLOYMENT_DIR, and makes Ctrl-C work [00:23:30] Oh OK [00:23:36] So it's a more useful version of foreachwikiindblist [00:23:51] Then we should probably remove foreachwikiindblist and alias it to mwscriptwikiset [00:24:15] unless you have a dblist in the current directory you want to use [00:24:33] but 31<Krenair>30 parameters are opposite ways around [00:25:06] `mwscriptwikiset scriptfile listfile` vs. `foreachwikiindblist listfile scriptfile` [00:26:05] Urgh [00:26:41] You could reduce foreachwikiindblist to a wrapper script around mwscriptwikiset that just flips the arguments I suppose? [00:26:49] But yeah you might also have dblists in your home dir or something [00:27:50] these are 4 or more years old - https://gerrit.wikimedia.org/r/559 [00:31:59] (03CR) 10BBlack: [C: 032] Revert "disable ocsp updater cron for now" [puppet] - 10https://gerrit.wikimedia.org/r/232874 (https://phabricator.wikimedia.org/T109740) (owner: 10BBlack) [00:32:33] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [00:32:48] 6operations, 10Incident-20150820-OCSP, 10Traffic, 5Patch-For-Review: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1559438 (10BBlack) [00:32:50] 6operations, 10Incident-20150820-OCSP, 10Traffic, 5Patch-For-Review: ocsp updater: handle openssl "trylater" and similar more-gracefully - https://phabricator.wikimedia.org/T109737#1559437 (10BBlack) 5Open>3Resolved [00:32:58] 6operations, 10Incident-20150820-OCSP, 10Traffic, 5Patch-For-Review: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1557846 (10BBlack) [00:33:00] 6operations, 10Incident-20150820-OCSP, 10Traffic, 5Patch-For-Review: ocsp updater: validate the signature expiry lifetime - https://phabricator.wikimedia.org/T109738#1559439 (10BBlack) 5Open>3Resolved [00:33:28] 6operations, 10Incident-20150820-OCSP, 10Traffic, 5Patch-For-Review: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1559441 (10BBlack) 5Open>3Resolved [00:40:18] 6operations: Merge scap scripts "mwscriptwikiset" and "foreachwikiindblist" into one - https://phabricator.wikimedia.org/T109798#1559451 (10Krenair) 3NEW [00:41:49] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1559460 (10bearND) >>! In T102524#1384229, @GWicke wrote: >> These pertain to MobileWeb, not MobileApps from what I can tell. > > My und... [00:44:31] (03PS3) 10Alex Monk: Make foreachwiki accept dblist expressions [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) [00:46:44] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1559466 (10BBlack) [00:47:52] (03PS2) 10Alex Monk: General maintenance script cleanup [puppet] - 10https://gerrit.wikimedia.org/r/232871 [00:48:02] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1544965 (10BBlack) The hardest parts of the first two checkboxes were resolved with the mobile VCL patches that were merged this morning, and there doesn't appear to be any negat... [00:50:02] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 647361 msg: ocg_render_job_queue 4224 msg (=3000 critical) [00:50:13] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 647605 msg: ocg_render_job_queue 4322 msg (=3000 critical) [00:50:53] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 648240 msg: ocg_render_job_queue 4489 msg (=3000 critical) [00:57:54] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:12:12] (03CR) 10Ori.livneh: [C: 031] "LGTM, but needs manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/232470 (owner: 10Giuseppe Lavagetto) [01:40:53] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 677365 msg: ocg_render_job_queue 499 msg [01:41:04] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 677375 msg: ocg_render_job_queue 410 msg [01:41:44] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 677428 msg: ocg_render_job_queue 116 msg [01:44:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [01:54:02] (03PS1) 10Tim Landscheidt: Tools: Add missing motd banner for Toolsbeta's submit host [puppet] - 10https://gerrit.wikimedia.org/r/232884 [01:54:32] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:56:28] (03CR) 10Chad: [C: 04-1] "Should've put a -1 with my comment. I really really don't think exposing this via a one-off hack on tin is the right thing to do at all." [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) (owner: 10Ori.livneh) [01:56:33] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 14.81% of data above the critical threshold [100000000.0] [02:05:12] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:06:05] RoanKattouw: so, does gerrit need diffie-hellman-group1-sha1 now or what? [02:06:27] AaronSchulz: ?? [02:06:53] my "curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256" is suddenly rejected now [02:07:12] I have no idea if anyone's been messing with it [02:07:32] * AaronSchulz sees others commiting around...hmm [02:10:07] * AaronSchulz looks at bblack [02:17:01] * AaronSchulz reads https://code.google.com/p/gerrit/issues/detail?id=3517 [02:21:50] (03PS1) 10Tim Landscheidt: Tools: Puppetize missing intermediate directory [puppet] - 10https://gerrit.wikimedia.org/r/232886 [02:27:52] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [02:29:23] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 15 not-conn: cp4011_v6 [02:33:32] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 16 ESP OK [02:34:38] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 11m 19s) [02:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:33] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (101720s 100000s) [02:43:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:43:52] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [03:15:06] (03PS2) 10Tim Landscheidt: Tools: Puppetize missing intermediate directory [puppet] - 10https://gerrit.wikimedia.org/r/232886 (https://phabricator.wikimedia.org/T87387) [03:36:53] PROBLEM - puppet last run on mw2074 is CRITICAL Puppet has 1 failures [03:41:44] bblack: rolling back openssh fixed my gerrit woes \o/ [03:41:58] rolling back what openssh? [03:42:13] I guess 7.2 just killed that keyx method, so adding it to .ssh/config wasn't enough [03:42:22] bblack: on my cygwin ;) [03:42:26] oh [03:42:29] heh [03:42:53] there's a 7.2? [03:43:00] I only see 7.0 on openssh [03:43:41] (which was only released ~10 days ago) [03:44:07] opps, I meant 7.0p1-1 [03:44:17] ah [03:44:39] the workaround in https://code.google.com/p/gerrit/issues/detail?id=3517 didn't help with that version [03:44:47] how nice of them to kill weak crypto that's the strongest we can get with some other clients [03:46:42] actually the release notes don't sound like it should have killed curve25519-sha256 [03:46:58] perhaps yeah it's the kx: [03:46:59] Support for the 1024-bit diffie-hellman-group1-sha1 key exchange is disabled by default at run-time. It may be re-enabled using the instructions at http://www.openssh.com/legacy.html [03:47:00] * AaronSchulz had to use -G to notice that my config was ignored [03:47:35] so maybe one version removed it by default and then 1-1 removed it completely? not sure what was going on there. [03:47:54] we should support something better too, though [03:48:14] well, it applied the config, but just ignored the sha1 part, the rest applied as expected [03:48:21] well like curve25519-sha256@libssh.org [03:48:28] it's in both sides' configs I think [03:49:56] maybe the c25519 option doesn't actually work under the cygwin build, so the other (dh) option is the only one in effect? [03:50:42] supposedly "ssh -Q kex" would tell you [03:57:26] * AaronSchulz would have to upgrade again and lazy :) [03:58:07] I do see 'curve25519-sha256@libssh.org' in that output [03:58:32] for 6.9p1 [04:03:54] RECOVERY - puppet last run on mw2074 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:05:12] PROBLEM - Last backup of the maps filesystem on labstore1002 is CRITICAL - Last run result was exit-code [04:07:27] (03PS2) 10Tim Landscheidt: Tools: Puppetize updatetools [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) [04:11:47] (03CR) 10Tim Landscheidt: "Puppetry tested on Toolsbeta for the files and service to be set up." [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [04:19:20] (03CR) 10Tim Landscheidt: "Oh, and for future testing in Toolsbeta I need to have a separate database for Toolsbeta that is selected per $labsproject, but due to me " [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [04:22:43] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 15.38% of data above the critical threshold [100000000.0] [04:29:23] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (10212 100000s) [04:36:34] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1559762 (10Moushira) [04:42:12] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:51:13] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%) [05:10:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [06:08:32] PROBLEM - puppet last run on mw2185 is CRITICAL Puppet has 1 failures [06:11:31] (03CR) 10Giuseppe Lavagetto: service: add deployment_script define (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [06:12:16] (03PS6) 10Giuseppe Lavagetto: service: add deployment_script define [puppet] - 10https://gerrit.wikimedia.org/r/231790 [06:15:29] https://git.wikimedia.org/ - down? [06:23:15] It look like down. [06:29:53] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [06:31:53] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:35:44] RECOVERY - puppet last run on mw2185 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:57:22] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [06:58:32] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:23] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:22] <_joe_> kart_: sigh, restarting it [07:06:34] <_joe_> !log restarting gitblit, because it will be decommissioned "soon"... [07:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:09:03] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61428 bytes in 0.601 second response time [07:28:07] _joe_: thanks [07:32:08] <_joe_> kart_: yw - it's my job too [07:32:38] <_joe_> well, my job would be better done if I redirect git.wm.org to a blackhole instead than to gitblit... [07:33:53] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=78%) [07:40:33] (03CR) 10Merlijn van Deen: [C: 031] Tools: Add missing motd banner for Toolsbeta's submit host [puppet] - 10https://gerrit.wikimedia.org/r/232884 (owner: 10Tim Landscheidt) [07:41:24] (03CR) 10Merlijn van Deen: [C: 031] Tools: Puppetize missing intermediate directory [puppet] - 10https://gerrit.wikimedia.org/r/232886 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [07:52:30] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1559894 (10MaxSem) Oops, noticed just now: >>! In T109286#1554694, @BBlack wrote: > 3. The "disable" Cookie doesn't seem to be in use. IIRC it may have been from `ZeroOpts=disa... [07:53:58] (03CR) 10Merlijn van Deen: [C: 04-1] Tools: Puppetize updatetools (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [07:59:33] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 14.29% of data above the critical threshold [100000000.0] [07:59:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [08:28:00] 6operations, 10Wikimedia-General-or-Unknown, 7Availability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705#1559906 (10MoritzMuehlenhoff) p:5Triage>3Normal [08:28:28] 6operations, 6Performance-Team: Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434#1559908 (10MoritzMuehlenhoff) p:5Triage>3Normal [08:28:54] 6operations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#1559910 (10MoritzMuehlenhoff) p:5Triage>3Normal [08:29:23] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1559911 (10MoritzMuehlenhoff) p:5Triage>3Normal [08:29:42] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [08:36:48] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1559917 (10MoritzMuehlenhoff) 5Resolved>3Open Reopen, this isn't resolved yet and should go through the regular approval process. [08:47:30] 6operations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#1559946 (10Joe) I like this, but I'd like to retain role-based lookups as well, for the reasons @faidon pointed out and others. What would probably make it easier to not mess those up with the rest of the lookups so my proposal... [08:47:31] (03PS1) 10Faidon Liambotis: Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 [08:47:33] (03PS1) 10Faidon Liambotis: Remove class misc::deployment::passwordscripts [puppet] - 10https://gerrit.wikimedia.org/r/232904 [08:48:21] (03CR) 10jenkins-bot: [V: 04-1] Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [08:48:44] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1559948 (10Krenair) >>! In T109640#1559917, @MoritzMuehlenhoff wrote: > Reopen, this isn't resolved yet and should go through the regular approval process. Looks resolved to me. What approval process i... [08:51:06] (03PS2) 10Faidon Liambotis: Remove class misc::deployment::passwordscripts [puppet] - 10https://gerrit.wikimedia.org/r/232904 [08:51:08] (03PS2) 10Faidon Liambotis: Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 [08:51:13] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [08:55:38] (03PS4) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [08:57:34] (03CR) 10Filippo Giunchedi: cassandra: WIP support for multiple instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [08:59:49] (03CR) 10Faidon Liambotis: [C: 04-1] move misc/labsdebrepo out of misc to module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [09:05:02] (03CR) 10Alex Monk: [C: 04-1] "modules/scap/files/sqldump:MP=`wikiadmin_pass`" [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [09:05:13] dbstore1002 is ok, lag is expected there and does not affect production traffic [09:05:28] *sigh* [09:05:30] and thanks :) [09:05:52] (03PS2) 10Filippo Giunchedi: Rename swift_new to swift [puppet] - 10https://gerrit.wikimedia.org/r/231240 (owner: 10Faidon Liambotis) [09:06:35] godog: no need to rebase, you can just re-run the commands that are in the commit msg [09:06:44] and just reuse the Change-Id :) [09:07:18] heheh that's true, heh [09:07:36] the compiler is running, if that's fine I'm going to change private too and merge it \o/ [09:08:10] :D [09:08:41] paravoid, thanks for helping with the cleanup! [09:08:54] which cleanup? [09:09:08] the pass scripts [09:09:19] oh that [09:09:23] broken, though :) [09:09:33] as Krenair pointed out :) [09:10:02] I only found references to that one script [09:10:08] the others might be ok [09:10:51] so we could move this to templates/mw-deployment-vars.erb, right? [09:11:07] although hardcoding passwords like that doesn't thrill me as an idea [09:11:14] (but the current situation isn't any better) [09:12:47] that is what I want to change [09:13:10] * Krenair grumbles about modules/admin/files/home/akosiaris/.my.cnf showing up in every grep [09:15:53] I thin that would work, yes paravoid [09:15:55] think* [09:22:32] paravoid, the other thing that could be done is have it get the password from mediawiki [09:23:38] how do you mean? [09:24:13] it uses mwscript to get a database hostname [09:24:50] could probably get the wikiadmin pass the same way [09:25:24] interesting [09:38:51] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560036 (10JohnLewis) p:5Triage>3Normal [09:43:05] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1560045 (10MoritzMuehlenhoff) 5Open>3Resolved Nevermind, I thought this ticket was also intended to add him to the ops group in modules/admin, so this is in fact fully resolved. [09:49:36] !log disable puppet on ms-fe1/ms-be1 before merging https://gerrit.wikimedia.org/r/#/c/231240/ [09:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Rename swift_new to swift [puppet] - 10https://gerrit.wikimedia.org/r/231240 (owner: 10Faidon Liambotis) [09:55:53] PROBLEM - puppet last run on ms-be2013 is CRITICAL Puppet has 2 failures [10:02:03] RECOVERY - puppet last run on ms-be2013 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:05:19] Phabricator down? [10:06:21] WFM [10:06:29] ah, it just started working for me again [10:06:56] for a minute or so it wouldn't connect [10:07:11] !log enable puppet on ms-fe1/ms-be1 [10:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:48] paravoid: ^ \o/ all merged [10:07:53] \ [10:07:57] \o/ [10:08:51] \o/ \o/ rare sensation to get some closure on puppet! [10:17:20] (03PS1) 10Alex Monk: Book namespaces for Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232915 (https://phabricator.wikimedia.org/T109505) [10:20:00] <_joe_> oh man [10:20:11] <_joe_> ganglia_new and swift_new merged in a month [10:20:17] <_joe_> now hell will freeze over [10:20:26] heh [10:20:26] Krenair: so yeah, if you want to fix that like you said, that would work for me :P [10:20:43] whats the next mysql::core dissapearing? Anybody thing of the children? [10:21:25] _joe_: I have changes up in gerrit for stages.pp, deployment.pp and removing nfs.pp stuff [10:27:49] (03PS3) 10Alex Monk: Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [10:28:37] paravoid, done [10:28:43] heh [10:34:03] (03CR) 10Steinsplitter: [C: 04-1] "this should be on a third party domain if it is a third party service." [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) (owner: 10Dzahn) [10:36:03] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video, 5Patch-For-Review: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1560128 (10Steinsplitter) this should be on a third party domain if it is a third party service. privacy issues. [10:38:19] (03PS2) 10Yuvipanda: Tools: Add missing motd banner for Toolsbeta's submit host [puppet] - 10https://gerrit.wikimedia.org/r/232884 (owner: 10Tim Landscheidt) [10:38:27] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Add missing motd banner for Toolsbeta's submit host [puppet] - 10https://gerrit.wikimedia.org/r/232884 (owner: 10Tim Landscheidt) [10:39:01] (03PS3) 10Yuvipanda: Tools: Puppetize missing intermediate directory [puppet] - 10https://gerrit.wikimedia.org/r/232886 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [10:39:10] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Puppetize missing intermediate directory [puppet] - 10https://gerrit.wikimedia.org/r/232886 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [10:40:40] (03CR) 10Mobrovac: [C: 031] ":thumbup:" [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [10:43:06] (03PS1) 10Alex Monk: Localise Kannada Wikiquote logo and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232919 (https://phabricator.wikimedia.org/T104260) [11:04:20] (03PS1) 10Alex Monk: Fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232920 (https://phabricator.wikimedia.org/T109045) [11:06:25] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Database: dbtree loads third party resources (from jquery.com) - https://phabricator.wikimedia.org/T96499#1560194 (10Krenair) [11:10:02] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Database: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1560200 (10Krenair) [11:13:42] RECOVERY - Disk space on labstore1002 is OK: DISK OK [11:15:39] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Database: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1560208 (10Krenair) It's not just jQuery, but also the Google Visualisation API [11:39:04] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1560234 (10NahidSultan) [[ https://bn.wikipedia.org/w/index.php?title=%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A6%BF%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:%E0%A6%86%E0%A6%B2%E0%A7%8B%E0%A6%9A%E0%... [13:10:42] PROBLEM - RAID on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:42] PROBLEM - salt-minion processes on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:38] (03CR) 10Zfilipin: [C: 031] contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [13:12:22] PROBLEM - SSH on analytics1021 is CRITICAL - Socket timeout after 10 seconds [13:12:54] PROBLEM - puppet last run on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:02] PROBLEM - dhclient process on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:13] PROBLEM - configured eth on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:52] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1560428 (10BBlack) Can someone clearly explain exactly what this ticket is about? [13:13:52] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:57] PROBLEM - jmxtrans on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:52] RECOVERY - puppet last run on analytics1021 is OK Puppet is currently enabled, last run 17 minutes ago with 0 failures [13:14:52] RECOVERY - dhclient process on analytics1021 is OK: PROCS OK: 0 processes with command name dhclient [13:15:02] RECOVERY - configured eth on analytics1021 is OK - interfaces up [13:15:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL 42.86% of data above the critical threshold [10.0] [13:15:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL 50.00% of data above the critical threshold [10.0] [13:15:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL 28.57% of data above the critical threshold [10.0] [13:15:42] RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [13:16:12] what's going on? [13:16:12] RECOVERY - SSH on analytics1021 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [13:16:15] ottomata: that you? [13:16:23] RECOVERY - salt-minion processes on analytics1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:16:23] RECOVERY - RAID on analytics1021 is OK no disks configured for RAID [13:17:50] paravoid: no, just signing on... [13:17:57] checking [13:18:01] getting pages [13:18:08] me too! [13:18:11] phone was away [13:19:53] what about analytics1052 + 1056 that have been crit for ~1d in icinga? [13:21:36] bblack, those are new hadoop worker nodes that came online recently, meant to look at them yesterday mutante even reminded me, but forgot. those are non critical (hence no pages), but this kafka one is a little more bad. looks like a bad disk, i think system is ok though, other brokers took over [13:22:17] the critical alerts are non-critical :P [13:22:27] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader [13:22:41] if it shows up there (which it won't if it's downtimed or ACK'd), it's critical! [13:23:26] I think the odd-looking tilerator entries are because the maps-test hosts are downtimed, but the tilerator service is newer than the host-wide downtime so it's not [13:23:57] bblack, you are right, i just mean i am prioriziting this kafka thing atm [13:24:05] will look into the other ones. [13:25:12] cmjohnson1: yt? [13:25:18] oh no service-level, just host. will "and all services" the maps-test ones [13:25:20] i am [13:25:30] [root@analytics1021:/var/spool/kafka] 1 # ls -l /dev/sdg [13:25:30] brw-rw---- 1 root disk 8, 96 Jan 27 2015 /dev/sdg [13:25:30] [root@analytics1021:/var/spool/kafka] # fdisk /dev/sdg [13:25:30] fdisk: unable to open /dev/sdg: No such device or address [13:25:36] sdg seems busted on analytics1021 [13:26:48] okay, I will get a new one...still under warranty for 45 more days [13:26:58] ha, ok. whats the eta? [13:27:10] Monday [13:27:32] looking at uranium disk space... [13:27:33] this is a kafka broker (one of the ones we were going to decom as broker anyway) [13:27:33] ooo [13:27:33] hm, ok [13:27:33] might do a thing on friday that i was going to wait til monday to do then.... [13:27:33] : [13:27:33] :/ [13:27:33] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [13:27:36] eh? [13:27:39] hmmm [13:27:46] oh cause i umounted that disk, puppet just started it [13:27:47] hm. [13:28:26] oh, please don't unmount until you provide a megacli report showing the bad pd [13:28:26] uranium looks like an explosion in the latest ganglia apache log, due to someone hitting it from comcast heavily [13:28:29] it is what I use to request a new one [13:30:52] in oregon? [13:30:52] hm, cmjohnson1 ok. i got I/O errors on it, I can try to remount it [13:30:52] oh that's just very recent and probably legit [13:30:52] nah..that's okay [13:30:52] well maybe [13:30:52] hm, actually this is cool, haven't seen done this before. kafka died because of disk io errors on one disk. i umounted. kafka started back up, and is working 100% fine for all partitions not on that disk. [13:30:52] so its partitions that aren't on that disk will just be way out of sync. [13:30:52] hm. [13:30:52] RECOVERY - Disk space on uranium is OK: DISK OK [13:30:52] mabye I can try to move those around today to the new brokers. [13:30:52] !log wiped ganglia apache access log on uranium, to free up half of the (full) rootfs [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:57] <_joe_> bblack: there is apache::logrotate::rotate we can set to something less than 52 weeks maybe :) [13:32:16] it looks like it was rotating semi-regularly, probably on size [13:33:18] but the currently-writing log (that I wiped) was at 4.2G, way past the size of previous rotated logs [13:33:27] !log stopping kafka broker on analytics1021 due to bad disk. [13:34:36] <_joe_> bblack: I'm taking a look then [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:53] PROBLEM - Kafka Broker Replica Max Lag on analytics1021 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [13:34:53] PROBLEM - High load average on labstore1002 is CRITICAL 100.00% of data above the critical threshold [24.0] [13:35:38] <_joe_> bblack: so just apache access logs? [13:36:12] cmjohnson1: what do you want from megacli? [13:36:14] ottomata: i actually can replace that disk today [13:36:30] ! that would be grand. [13:36:41] that would make me feel way safer [13:36:42] We have spares of them...from when ceph was killing disks once a week [13:36:53] otherwise i'd have to start this partition migration that i was intentionally waiting until monday to do [13:37:14] okay, I will ping you once its replaced [13:37:54] cmjohnson1: awesome, eta? just need to time some other things [13:38:43] around 1030ish [13:39:23] PROBLEM - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [13:39:47] _joe_: specifically the ganglia access log [13:39:48] yes! perfect, thank you cmjohnson1 [13:40:05] /var/log/apache2/ganglia.wikimedia.org-access.log [13:41:49] 6operations, 10ops-eqiad: Failed disk analytics1021 Kafka Broker - https://phabricator.wikimedia.org/T109832#1560478 (10Cmjohnson) 3NEW a:3Cmjohnson [13:42:31] ACKNOWLEDGEMENT - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties ottomata /dev/sdg failed. [13:42:34] i appreciate that analytics1021 decided to wait until I had just signed on to start working to let its disk fail. [13:42:41] (03PS1) 10Giuseppe Lavagetto: uranium: don't keep more than two weeks of apache logs [puppet] - 10https://gerrit.wikimedia.org/r/232929 [13:42:47] <_joe_> bblack: ^^ [13:43:21] probably as good idea, but still wouldn't stop the current log with only a few days in it from filling the rootfs? [13:43:47] the historical ones weren't huge, just the current [13:44:19] (which was aug 17 -> 21 basically) [13:44:19] well I guess daily rotation would've broken it up, you're right [13:44:29] but a size limit to make it rotate even faster would be ideal too [13:44:39] hm, still getting paged, do I have to do more than ack? [13:44:43] (03PS1) 10BBlack: ganglia: collapse metric groups by default [puppet] - 10https://gerrit.wikimedia.org/r/232931 [13:45:21] hm, i think that's an old text of mine [13:45:23] <_joe_> bblack: rotation daily [13:45:24] <_joe_> :) [13:45:44] ottomata: if you ack something that's already paged when it went down, it still pages when it comes back up [13:45:58] the ack just takes it off the "critical and not downtimed/acked" list in the web ui [13:46:12] ditto for downtiming something after it already died: it will still page when it comes back up [13:46:38] aye, ok, so i shoudl schedule downtime? [13:47:09] "for this host and all services", if you intend it to be down and flapping for a relatively fixed period of time [13:47:31] or just ack the alerts if they're real alerts and we're now aware and working on them [13:48:14] (03CR) 10BBlack: [C: 031] uranium: don't keep more than two weeks of apache logs [puppet] - 10https://gerrit.wikimedia.org/r/232929 (owner: 10Giuseppe Lavagetto) [13:48:26] (03CR) 10BBlack: [C: 032] ganglia: collapse metric groups by default [puppet] - 10https://gerrit.wikimedia.org/r/232931 (owner: 10BBlack) [13:48:26] aye, but i don't want it to page others while i'm working on it [13:49:40] I don't even know what "it" is, but services are separate from hosts, and if it's going to flap, downtime will keep the flapping at bay for the period of downtime [13:49:44] whereas ack only acks the single failure, doesn't stop more pages from future failures [13:50:07] regardless, anything that pages or channel-alerts a failure, will page or channel alert the related recovery no matter what you change afterwards [13:50:14] aye, have scheduled downtime. [13:50:14] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1560501 (10Aklapper) @Moushira: Please describe what this task is about and how to reproduce some problem or where which specific suggestion should be shown and what "WP link ownership" refers to and what kind... [13:50:41] bblack, yes, 2 unrelated things. kafka broker disk failure, and weirdness on 2 new hadoop worker nodes. [13:50:50] hundreds of nagios processes running on one of the new hadoop nodes [13:50:55] load average > 2500 [13:52:34] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 55.56% of data above the critical threshold [60.0] [13:54:05] apparently my ganglia change did nothing. it's possible that config file is just no longer puppetized in practice :/ [13:54:08] looking into it [13:57:58] !log restarting restbase1001 to apply temporary GC setting [13:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:14] RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0] [13:59:31] (03CR) 10Alexandros Kosiaris: [C: 031] add parsoid/ocg/bastiononly user groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 (owner: 10Dzahn) [13:59:53] (03CR) 10Alexandros Kosiaris: [C: 031] enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 (owner: 10Dzahn) [14:00:47] (03PS2) 10Giuseppe Lavagetto: uranium: don't keep more than two weeks of apache logs [puppet] - 10https://gerrit.wikimedia.org/r/232929 [14:00:58] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] uranium: don't keep more than two weeks of apache logs [puppet] - 10https://gerrit.wikimedia.org/r/232929 (owner: 10Giuseppe Lavagetto) [14:06:04] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce new labs role for vagrant+lxc [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [14:06:11] (03PS3) 10Alexandros Kosiaris: Introduce new labs role for vagrant+lxc [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [14:06:17] (03CR) 10Alexandros Kosiaris: [V: 032] Introduce new labs role for vagrant+lxc [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [14:07:40] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1560522 (10NahidSultan) Recently We had a discussion on Bengali Wikipedia Community whether we can add our social network links in bnwp's main page. Now we'd like to add verification badge next to bnwp's websit... [14:08:19] (03PS2) 10Giuseppe Lavagetto: mediawiki: cleanup the config of hhvm on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/232470 [14:09:32] 6operations, 6Services, 5Patch-For-Review: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1560535 (10akosiaris) Uploaded patch merged. The various *oid services have now their logs stored in /srv/log. cxserver should be converted to service::node/service-runner at which point the log con... [14:10:28] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: cleanup the config of hhvm on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/232470 (owner: 10Giuseppe Lavagetto) [14:12:54] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL 77.78% of data above the critical threshold [60.0] [14:13:18] !log rebooting analytics1056 after upgrading kernel to linux-image-3.13.0-61-generic [14:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:33] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [14:17:21] (03CR) 10Alexandros Kosiaris: "merged. I 'll fix the naming problems altogether in LDAP+puppet in a later patch" [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [14:20:03] RECOVERY - Persistent high iowait on labstore1002 is OK Less than 50.00% above the threshold [40.0] [14:24:24] RECOVERY - High load average on labstore1002 is OK Less than 50.00% above the threshold [16.0] [14:25:05] Krenair: Etherpad and Planet have named service clusters now? I thought they were on misc. boxen. [14:25:40] ostriches, etherpad1001 and planet1001 [14:25:51] Ah, #til [14:26:36] I'm not sure why. [14:26:44] But those are the names, so... [14:27:43] ganeti VMs [14:28:03] created by akosiaris back in May/June [14:30:47] yup. there was a discussion back then [14:30:58] turns out those were a bit of a mistake [14:31:06] on my part of course [14:31:43] we could reinstall them with misc boxen names. Not sure if it's worth the trouble though [14:32:01] probably not [14:32:17] (03CR) 10Filippo Giunchedi: "3) can't we separate transitioning to a systemd unit out of this change? The "one thing at a time" principle holds here as well." [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [14:35:01] (03CR) 10Giuseppe Lavagetto: "@godog fair enough, I just wanted to minimize change if possible :)" [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [14:35:03] * YuviPanda bugs akosiaris with https://phabricator.wikimedia.org/T107576 again :) [14:36:13] RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:36:13] RECOVERY - Hadoop DataNode on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:36:22] YuviPanda: I am honestly not sure what to comment in there [14:36:31] I 've re-read the entire ticket [14:36:43] RECOVERY - salt-minion processes on analytics1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:36:44] I see no addressing of my one single concern [14:37:03] RECOVERY - dhclient process on analytics1052 is OK: PROCS OK: 0 processes with command name dhclient [14:37:46] I thought the concern was 'this might leak private data' and the answer was 'we already have a http setup for similar things, just need an rsync one so there is no additional exposure'? [14:39:27] YuviPanda: last comment from ottomata has me thinking otherwise [14:40:03] as in ppl will do it manually, not the setup accessible via HTTP we already got [14:40:32] akosiaris: people put stuff in the http 'manually', except they have to wait for a cron [14:40:33] RECOVERY - salt-minion processes on analytics1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:40:33] RECOVERY - dhclient process on analytics1056 is OK: PROCS OK: 0 processes with command name dhclient [14:40:42] not sure what the difference is between running rsync and waiting for a cron to run rsync [14:40:53] akosiaris: 'manually' as in 'scp it to their home computers and then back to labs', but that's because the http exporter is a cron that runs once a day or something and people don't want to wait [14:40:53] RECOVERY - Hadoop DataNode on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:40:53] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:41:57] ottomata: I was hoping it was not 'manually' in the first place to put it in http [14:42:15] YuviPanda: yeah I got that, but that's not what I was referring to [14:42:28] akosiaris: they just cp into a directory in /srv, that gets rsynced via cron [14:42:36] to datasets.wm.org on stat1001 [14:43:01] so we are actually just waiting for someone to cp something wrong by mistake ? [14:43:20] scratch the "wrong" word from there. bad translation [14:43:33] yup, but what would you prefer? : [14:43:47] something with less human interference to be honest [14:43:55] but that's not the scope of that task [14:44:07] - productionize every one off visualization service that people use in labs [14:44:07] - puppetize every one off dataset that people generate in hadoop / stat* boxes [14:44:51] none of those 2 has anything to do with that [14:45:03] !!log restarting each of analyitcs1050-analytics1056 to load newer kernel version [14:45:15] akosiaris: folks generate from hadoop [14:45:20] anyway, I get we want ppl to have access to /srv/datasets.wikimedia.org [14:45:22] for consumption by their teams, managers, public, whoever [14:45:25] but in labs [14:45:29] akosiaris: not really. [14:45:41] random data then ? [14:46:20] newly generated data sounds better btw [14:46:23] haha, i just read this page: https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue [14:46:26] 'Banish 'Random' from Your Vocabulary [14:46:27] ' [14:46:27] heheheh [14:47:00] akosiaris: use case: a researcher is tasked by lila to generate a graph. [14:47:18] they use hadoop to aggregate some data, this data has no PII and can be public [14:47:32] (03PS6) 10Giuseppe Lavagetto: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 (owner: 10Ori.livneh) [14:47:36] now they want to use some fancy custom thing to share the graph [14:47:39] and that they can do in labs [14:48:08] currently, they have to scp to laptop and then scp back to labs [14:48:21] so it has nothing to do with halfak's last comment in https://bn.wikipedia.org/w/index.php?title=%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A6%A7%E0%A6%BE%E0%A6%A8_%E0%A6%AA%E0%A6%BE%E0%A6%A4%E0%A6%BE&oldid=1889460 [14:48:24] sigh [14:48:26] wrong link [14:48:27] annoying, but fine for small datasets. this ticket was motivated by a larger one that ellery wanted to copy [14:48:48] I meant https://phabricator.wikimedia.org/T107576#1519641 [14:49:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "I added inotify.IN_ATTRIB so that it's possible to trigger a reload of the config by using touch(1)." [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 (owner: 10Ori.livneh) [14:49:16] akosiaris: that would be a way to solve this problem, but i think it would be less than ideal. [14:49:22] that workflow would be: [14:49:27] well, first [14:49:41] datasets is less for one offs, even though there are probbably one offs inthere [14:49:47] but, that workflow would be: [14:50:02] ok ok one offs [14:50:09] cp into /srv/aggregate-datasets (or whatever it is), wait for cron to rsync -> stat1001. wait for cron on stat1001 to rsync to labstore [14:50:09] I think I got your point now [14:50:35] much better to allow direct rsync to labs of public data [14:50:37] aye :) [14:51:14] !llog rebooting each of analytics1050-analytics1056 to apply newer kernel [14:52:45] * akosiaris still pondering what to do [14:53:43] RECOVERY - RAID on ms-be2009 is OK optimal, 13 logical, 13 physical [14:55:36] (03CR) 10Tim Landscheidt: Tools: Puppetize updatetools (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [14:56:14] ottomata: i swapped the disk if you want to remount [14:56:20] oo! [14:56:21] k [14:56:49] cmjohnson1: i assume it is unformatted? [14:57:00] i would assume that [14:57:11] hmmm cmjohnson1 fdisk: unable to open /dev/sdg: No such device or address [15:02:51] cmjohnson1: you lookin? [15:04:42] (03CR) 10Merlijn van Deen: Tools: Puppetize updatetools (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [15:08:52] oh, no I wasn't. Had to go p/up some new servers [15:09:20] i see it on the controller [15:09:25] cmjohnson1: should I reboot the node? [15:09:32] that worked last time iirc [15:09:44] k, will try. on the way i'm going to turn on hyperthreading :) [15:09:54] k [15:10:09] !log rebooting kafka broker analytics1021 to hopefully reload /dev/sdg with new disk, also will turn on hyperthreading [15:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:57] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560628 (10Dzahn) a:3Dzahn [15:12:32] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:47] pssh, missed the bios prompt in time i guess....oh well. we are decomming this node next week [15:13:53] RECOVERY - Host analytics1021 is UPING OK - Packet loss = 0%, RTA = 2.20 ms [15:14:50] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560631 (10Dzahn) I disabled it. I ran the commands from our new script to disable lists in a unified manner. That means: advertised=0 (not on listinfo page anymore) emergency=1 (emergency mode... [15:14:53] RECOVERY - Kafka Broker Server on analytics1021 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [15:17:49] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560634 (10Dzahn) @JohnLewis testing our script under "real life" conditions (on sodium which has tons of data in ./data) it seemed to hang at first and just not finish or throw an error. so at... [15:18:00] (03CR) 10Ori.livneh: [C: 032] "Yay, thanks _joe_" [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 (owner: 10Ori.livneh) [15:18:17] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1560635 (10Papaul) @Fgiunchedi will have drive on site on Monday. [15:19:21] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560640 (10Dzahn) 5Open>3Resolved @Darkdadaah see above no more mails should be sent for this list and it's not advertised anymore. yet the archives are still here for history [15:19:34] (03CR) 10Tim Landscheidt: Tools: Puppetize updatetools (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [15:24:47] (03Merged) 10jenkins-bot: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 (owner: 10Ori.livneh) [15:26:31] (03PS1) 10Giuseppe Lavagetto: Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 [15:27:38] cmjohnson1: sdg up and going, thank you :) [15:27:55] 6operations, 10Wikimedia-Mailing-lists: Delete unused list wiktionary-fr - https://phabricator.wikimedia.org/T109817#1560648 (10Darkdadaah) That was quick, thank you! [15:28:43] (03CR) 10Tim Landscheidt: Tools: Puppetize updatetools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [15:39:27] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1560668 (10Dzahn) 3NEW a:3Dzahn [15:41:27] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1560668 (10Dzahn) /var/lib/mailman/data on sodium has an extreme number of files root@sodium:/var/lib/mailman/data# find . | wc -l 585015 584975 of them are h... [15:46:11] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1560695 (10GWicke) @bblack, I'm curious about your thoughts on enabling the zero detection in text varnishes. Do you think that's a good... [15:52:05] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1560705 (10BBlack) Yeah, I think it is, and I think we're going there for reasons orthogonal to RB anyways (see also T89177). Basically,... [15:56:32] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1560726 (10BBlack) >>! In T109286#1559894, @MaxSem wrote: > Oops, noticed just now: > >>>! In T109286#1554694, @BBlack wrote: >> 3. The "disable" Cookie doesn't seem to be in us... [15:57:12] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK Less than 1.00% above the threshold [1.0] [15:58:37] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1560729 (10Dzahn) also see: T83967 for a similar task in the past --- 08:53 i'm afraid i can't paste that on phab 08:54 it's 17MB plai... [15:59:58] 6operations, 10ops-codfw: mw2180 has a faulty disk - https://phabricator.wikimedia.org/T109687#1560732 (10Papaul) @Joe I will have the drive on site on Monday. [16:01:52] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1560734 (10Dzahn) talking with Robh we agreed it's best to delete all messages older than X and also we could find out which lists don't have active admins/mods... [16:02:06] grrrit-wm: needs coffee [16:02:35] (I think it's been borked since it rejoined half an hour ago?) [16:03:23] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1560737 (10GWicke) @bblack: Thanks, that's great news. I'll mark this task as a dupe of T89177, then. [16:03:43] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK Less than 1.00% above the threshold [1.0] [16:03:46] * Krenair will poke it [16:04:12] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 25.93% of data above the critical threshold [100000000.0] [16:06:43] !log checksumming dewiki database, higher write rate/dbstore lag expected temporarily [16:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:37] ^this has the same impact than a batched update, and it throttles automatically in case of lag, but you never know... [16:08:20] I am testing it site-wide manually before confirming it can be run unattended [16:10:52] (you probably do not know what I am talking about, but ELI5: this is like Christmas for DBAs) [16:13:10] merry christmas! [16:14:42] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1560763 (10greg) @JAlexander: This is in regards to Google Webmaster Tools, hence the ping for you. [16:23:52] RECOVERY - Host mc2001 is UPING OK - Packet loss = 0%, RTA = 52.21 ms [16:24:04] (03CR) 10Alex Monk: "test 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232920 (https://phabricator.wikimedia.org/T109045) (owner: 10Alex Monk) [16:24:11] okay, worked the second time [16:24:17] (03CR) 10Tim Landscheidt: "I didn't want to set up a complete toolsbeta-static, so I tested the principle on toolsbeta-webproxy-01 with:" [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [16:24:59] 6operations, 10ops-eqiad: Failed disk analytics1021 Kafka Broker - https://phabricator.wikimedia.org/T109832#1560791 (10Cmjohnson) 5Open>3Resolved Replaced the disk with good spare and will swap the failed disk with new when it arrives on Monday. [16:27:57] (03PS2) 10Andrew Bogott: Add root keypair connecting labvirt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/232948 [16:28:32] PROBLEM - Host mc2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:05] (03CR) 10Andrew Bogott: [C: 032] Add root keypair connecting labvirt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/232948 (owner: 10Andrew Bogott) [16:31:38] (03CR) 10Merlijn van Deen: "Shouldn't it still also require both?" [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [16:34:26] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra internode TLS encryption - https://phabricator.wikimedia.org/T108953#1560850 (10fgiunchedi) > 4K is puppet's default unless `keylength` is specified, https://docs.puppetlabs.com/references/latest/configuration.html#keylength (default changed in https:/... [16:38:12] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [16:45:52] RECOVERY - Host mc2001 is UPING OK - Packet loss = 0%, RTA = 51.93 ms [16:51:20] (03CR) 10Tim Landscheidt: "I thought so once as well, but @ori pointed out to me that subscribe is a superset of require; cf. http://docs.puppetlabs.com/references/3" [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [17:02:57] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1561000 (10mobrovac) 3NEW [17:03:06] akosiaris: ^^ [17:04:59] 6operations, 10ops-eqiad: Change racktables entries for renamed analytics -> kafka names - https://phabricator.wikimedia.org/T109856#1561012 (10Ottomata) 3NEW a:3Cmjohnson [17:07:09] 6operations, 10ops-eqiad: Failed disk analytics1021 Kafka Broker - https://phabricator.wikimedia.org/T109832#1561023 (10Ottomata) Thanks so much for the quick turnaround on this! Having a working disk asap saved me a lot of dangerous work I didn't want to start on a Friday. :) Is there anything particularly... [17:07:43] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Replace dbrant with mholloway for MobileApps production access - https://phabricator.wikimedia.org/T109857#1561024 (10mobrovac) 3NEW [17:10:22] PROBLEM - Host mc2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:12] (03CR) 10Merlijn van Deen: [C: 031] Tools: Only execute cdnjs-packages-gen on changes [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [17:11:22] PROBLEM - puppet last run on analytics1056 is CRITICAL puppet fail [17:12:53] (03CR) 10Merlijn van Deen: [C: 031] Tools: Puppetize updatetools [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [17:14:01] (03CR) 10Merlijn van Deen: "Hm, could you test one more thing? Does this correctly handle utf8 toolsinfo.json contents?" [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [17:17:02] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [17:17:48] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561062 (10RobH) So update from out of band email converstations with Joel/HR and our hangout this morning. Joel (as OIT) is currently implementing a script into the google sheets with HR... [17:17:48] RECOVERY - Host mc2001 is UPING OK - Packet loss = 0%, RTA = 53.28 ms [17:19:03] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561074 (10RobH) Also since this will be automated, we do not need to add any more ops members to viewing those sheets. Once this process is fully ironed out and the automatic notificatio... [17:19:15] RECOVERY - puppet last run on analytics1056 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:19:34] 6operations, 10ops-codfw: mc2001 not coming up after reboot - https://phabricator.wikimedia.org/T102222#1561078 (10Papaul) 5Open>3Resolved After reboot the system was going back to PXE boot. changed the boot order not to but from PXE but HDD. This system is back up. [17:22:15] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561090 (10JKrauska) @robh I disagree about limited access to the sheets. As Daniel points out, the scripts will eventually/occasionally fail, and we need robust backup options to get as... [17:23:58] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561100 (10RobH) @Jkrauska: You are saying you think every single ops should have access? That was my original stance as well, but then we're asking HR to maintain a ~20 person access lis... [17:27:01] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561115 (10RobH) Also, I didn't realize that IT was asking for Ops support in writing the scripts. If thats the case we'll have to bring this up at the ops meeting. [17:27:49] (03CR) 10Tim Landscheidt: "If you point me to/provide me with an example, sure :-)." [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [17:32:03] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1561159 (10eliza) Hello Daniel, Just confirmed with Manprit that she would like the Trademark list to be handled on the Google side of things. But before you delete - let's wait for @JKrauska to say when. He's ou... [17:32:39] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561160 (10JKrauska) @robh Not everyone, but anyone who regularly handles off boarding tasks, for sure. If Ops is going to depend on these scripts, it would be good for them to understand... [17:32:41] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1561161 (10eliza) 5Resolved>3Open [17:33:33] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561175 (10RobH) The entire stance of HR was they didn't want to give ops access as a whole group, and that was the impression i just got from them during our meeting. It seems we aren't... [17:34:06] (03PS1) 10Andrew Bogott: Use ssh::userkey rather than trying to get the pub key installed by hand. [puppet] - 10https://gerrit.wikimedia.org/r/232960 [17:34:43] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1561187 (10RobH) Again, I don't think anyone in ops wants to start maintaining google sheet scripts. I know I certainly do not. The reason I supported this idea was I thought IT was supp... [17:34:48] cajoel: if you are areound we can actually chat [17:34:51] rather than back and forth on the task [17:34:55] (03CR) 10Merlijn van Deen: "I added some unicode emoticons plus an emoji to tools.gerrit-reviewer-bot:" [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [17:35:07] but i didnt get the impression you wanted ops to help with the script [17:35:20] i dont think we have any interest in supporting HR on that level [17:35:22] not ops domain. [17:35:33] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1561197 (10eliza) Sorry Daniel - just grasping on how to use PHAB. Wanted to re-open this ticket in order to message you on the new developments. (above). Let me know if I'm doing this correctly. :) Eliza [17:39:31] 6operations, 7Mail: Move trademark@ alias to Google Mail - https://phabricator.wikimedia.org/T109868#1561222 (10JohnLewis) 3NEW a:3JKrauska [17:39:38] (03CR) 10Andrew Bogott: [C: 032] Use ssh::userkey rather than trying to get the pub key installed by hand. [puppet] - 10https://gerrit.wikimedia.org/r/232960 (owner: 10Andrew Bogott) [17:41:11] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1561236 (10JohnLewis) 5Open>3Resolved @eliza Keeping tasks focused on a single goal is best and helps manage work flows easier. As such I've created T109868 for moving the alias to Google Mail and assigned it... [17:42:33] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:43:00] (03PS1) 10Andrew Bogott: Avoid duplicate definition for Ssh::Userkey[root] [puppet] - 10https://gerrit.wikimedia.org/r/232961 [17:43:04] (03PS1) 10Legoktm: Use wfLoadExtension() directly for loading some extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232962 [17:43:06] (03PS1) 10Legoktm: Use wfLoadSkin(s) to load all skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232963 [17:44:09] legoktm: Omm-gee. [17:44:14] :) [17:44:42] legoktm: Why not VisualEditor? ;-P [17:45:03] more patches are coming :) [17:45:10] Good. :-) [17:45:33] PROBLEM - puppet last run on labvirt1005 is CRITICAL puppet fail [17:45:44] (03PS3) 10Thcipriani: Add servicedeploy user; Modifiy keyholder service [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [17:46:33] PROBLEM - puppet last run on labvirt1006 is CRITICAL puppet fail [17:49:12] (03Abandoned) 10Andrew Bogott: Avoid duplicate definition for Ssh::Userkey[root] [puppet] - 10https://gerrit.wikimedia.org/r/232961 (owner: 10Andrew Bogott) [17:49:33] (03PS1) 10Andrew Bogott: Revert "Use ssh::userkey rather than trying to get the pub key installed by hand." [puppet] - 10https://gerrit.wikimedia.org/r/232965 [17:50:23] PROBLEM - puppet last run on labvirt1009 is CRITICAL puppet fail [17:51:26] (03CR) 10Andrew Bogott: [C: 032] Revert "Use ssh::userkey rather than trying to get the pub key installed by hand." [puppet] - 10https://gerrit.wikimedia.org/r/232965 (owner: 10Andrew Bogott) [17:53:22] RECOVERY - puppet last run on labvirt1005 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:53:54] PROBLEM - puppet last run on labvirt1007 is CRITICAL puppet fail [17:56:12] PROBLEM - NTP on mc2001 is CRITICAL: NTP CRITICAL: No response from NTP server [17:57:45] (03PS1) 10Legoktm: Load some more extensions directly through wfLoadExtension() (B-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232966 [17:59:22] James_F: ok, there were a lot more of these than I was expecting :P [18:06:03] RECOVERY - puppet last run on labvirt1006 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:17:26] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1561358 (10Dzahn) @eliza no worries. yea, we have to wait for @jkrauska to tell us when we can delete it or it would break. let's use T109868 to continue. [18:17:43] RECOVERY - puppet last run on labvirt1009 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:18:32] 6operations, 7Mail: Move trademark@ alias to Google Mail - https://phabricator.wikimedia.org/T109868#1561361 (10Dzahn) [18:21:13] RECOVERY - puppet last run on labvirt1007 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:21:31] (03CR) 1020after4: [C: 031] Add servicedeploy user; Modifiy keyholder service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [18:23:51] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete entries from host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) [18:26:27] (03CR) 10Tim Landscheidt: "tools-login wasn't in "qconf -sh", and I could safely remove it from "qconf -ss" and check that submitting jobs from tools-login.wmflabs.o" [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt) [18:33:27] (03CR) 10Tim Landscheidt: "http://tools.wmflabs.org/#toollist-gerrit-reviewer-bot looks okay to me. Were you only concerned with the input part, i. e. if updatetool" [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [19:03:49] ottomata: ping [19:04:37] hiya [19:05:01] hey, I have some questions about the event bus work next quarter [19:05:09] mind a quick h-o? [19:05:23] not at all! [19:05:32] wanna come to analytics batcave? :) [19:05:32] cool [19:05:38] https://plus.google.com/hangouts/_/wikimedia.org/a-batcave [19:15:18] 6operations, 10ops-ulsfo: RIPE Atlas Anchor @ ulsfo is down - https://phabricator.wikimedia.org/T107691#1561579 (10RobH) It seems that we lost power on the B side tower of cabinet 1.22 in ulsfo on the 1st, resulting in the atlas going offline. I've put in a ticket to get power restored, as its an unplanned fa... [19:15:39] 6operations, 10ops-ulsfo: RIPE Atlas Anchor @ ulsfo is down - https://phabricator.wikimedia.org/T107691#1561580 (10RobH) I'm intentionally leaving it in place, as its the easiest way to confirm power is restored. [19:18:22] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#1561582 (10Dzahn) a:3Dzahn [19:18:40] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1561584 (10Dzahn) [19:18:42] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#881579 (10Dzahn) [19:20:41] 6operations, 10Wikimedia-Mailing-lists: go through all directories in /var/lib/mailman and decide if migration is needed - https://phabricator.wikimedia.org/T109399#1561586 (10JohnLewis) Below is my final compiled list of what should and shouldn't go. ``` archives; yes bin; no cgi-bin; no cron; no data; only... [19:24:43] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1561598 (10Dzahn) [19:24:45] 6operations, 10Wikimedia-Mailing-lists: go through all directories in /var/lib/mailman and decide if migration is needed - https://phabricator.wikimedia.org/T109399#1561597 (10Dzahn) 5Open>3Resolved [19:35:49] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1561632 (10Dzahn) [19:36:07] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1450894 (10Dzahn) [19:37:02] PROBLEM - Certificate expiration on nembus is CRITICAL: SSL CRITICAL - Certificate ldap-codfw.wikimedia.org valid until 2015-09-20 19:36:03 +0000 (expires in 29 days) [19:41:32] PROBLEM - Certificate expiration on neptunium is CRITICAL: SSL CRITICAL - Certificate ldap-eqiad.wikimedia.org valid until 2015-09-20 19:41:02 +0000 (expires in 29 days) [19:41:40] ^ https://phabricator.wikimedia.org/T103590 [19:41:50] upped prio yesterday because of that [19:43:26] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1561658 (10Dzahn) 12:38 < icinga-wm> PROBLEM - Certificate expiration on nembus is CRITICAL: SSL CRITICAL - Certificate ldap-codfw.wikimedia.org valid until 2015-09-20 19:36:03 +0000... [19:52:33] (03PS1) 10Ori.livneh: Drop inotify; improve documentation and handling of configuration formats [debs/pybal] - 10https://gerrit.wikimedia.org/r/233034 [19:52:35] (03PS1) 10Ori.livneh: Make util.py PEP8-compliant [debs/pybal] - 10https://gerrit.wikimedia.org/r/233035 [19:52:38] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1561686 (10Dzahn) https://rt.wikimedia.org/Ticket/Display.html?id=9504 [19:53:00] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1561687 (10Dzahn) 3NEW a:3Dzahn [19:53:23] 6operations, 10Wikimedia-Mailing-lists: announce mailman downtime - https://phabricator.wikimedia.org/T109891#1561694 (10Dzahn) 3NEW a:3Dzahn [19:57:54] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK Less than 1.00% above the threshold [1.0] [19:57:55] (03CR) 10Ori.livneh: Drop inotify; improve documentation and handling of configuration formats (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/233034 (owner: 10Ori.livneh) [20:08:42] RECOVERY - Host ripe-atlas-ulsfo is UPING OK - Packet loss = 0%, RTA = 73.37 ms [20:12:27] robh: ^ success [20:12:56] we have the power! [20:14:12] 6operations, 10ops-ulsfo: RIPE Atlas Anchor @ ulsfo is down - https://phabricator.wikimedia.org/T107691#1561768 (10RobH) 5Open>3Resolved Back online after we fixed the b tower power. [20:14:26] the fact we lost power to one side for that long and unitedlayer didnt notice is not good [20:14:29] but meh, working witht hem to fix [20:14:40] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1561770 (10JohnLewis) @akosiaris can we modify and reinstall the current VM or do we need to delete it and re-create it from scratch with a vlan? [20:14:59] * Nemo_bis likes ripe-atlas icinga checks [20:16:46] (03PS1) 10Ori.livneh: Make configuration parsing maximally forgiving of minor errors [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 [20:17:02] (03CR) 10jenkins-bot: [V: 04-1] Make configuration parsing maximally forgiving of minor errors [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 (owner: 10Ori.livneh) [20:17:36] 6operations, 10Wikimedia-Mailing-lists: announce mailman downtime - https://phabricator.wikimedia.org/T109891#1561783 (10Dzahn) who to notify: wikitech-l - https://lists.wikimedia.org/mailman/listinfo/wikitech-l list of listadmins - https://lists.wikimedia.org/mailman/listinfo/listadmins tech newsletter (Gui... [20:17:52] (03PS2) 10Ori.livneh: Make configuration parsing maximally forgiving of minor errors [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 [20:18:13] 6operations, 10Wikimedia-Mailing-lists: announce mailman downtime - https://phabricator.wikimedia.org/T109891#1561788 (10Dzahn) maintenance window, max length: 4 hours ? [20:20:02] 6operations, 10Wikimedia-Mailing-lists: announce mailman downtime - https://phabricator.wikimedia.org/T109891#1561792 (10JohnLewis) 4 hours seems fair of a maintenance window. Looking at the plan we've drafted [need to make public :)]; we're just doing a 'what has changed' rsync after stopping mailman and hold... [20:26:02] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, and 2 others: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1561814 (10DannyH) [20:27:14] (03CR) 10BBlack: [C: 031] "Looks conceptually sane, and I think the switch from inotify() is a positive() one." [debs/pybal] - 10https://gerrit.wikimedia.org/r/233034 (owner: 10Ori.livneh) [20:30:28] (03CR) 10Ori.livneh: [C: 032] Drop inotify; improve documentation and handling of configuration formats [debs/pybal] - 10https://gerrit.wikimedia.org/r/233034 (owner: 10Ori.livneh) [20:31:32] bblack: how good would gdnsd cope with a TTL of less than a second ;) [20:31:38] mutante: ^ :P [20:31:42] *how well [20:31:55] ori: bah fine [20:32:00] :P [20:32:28] awesomely, because it will ignore you by default :) [20:32:49] if we change it of course :) [20:32:54] (the default) [20:33:27] there's an options => { min_ttl => X } which defaults to 5 [20:33:42] so if you don't change that, it will clamp anything <5 to 5. [20:34:17] (03Merged) 10jenkins-bot: Drop inotify; improve documentation and handling of configuration formats [debs/pybal] - 10https://gerrit.wikimedia.org/r/233034 (owner: 10Ori.livneh) [20:34:18] but also, the DNS protocol in general doesn't have resolution lower than 1s, and 0s is likely to just break something or other. [20:34:46] min_ttl doesn't even allow the zero value in gdnsd [20:35:10] so we break things? awesome - mutante I think I found our TTL ;) [20:35:14] 30 seconds or something should be good enough ?:) [20:35:15] (03CR) 10Ori.livneh: [C: 032] "Purely cosmetic change." [debs/pybal] - 10https://gerrit.wikimedia.org/r/233035 (owner: 10Ori.livneh) [20:35:27] JohnFLewis: haha [20:35:30] (03Merged) 10jenkins-bot: Make util.py PEP8-compliant [debs/pybal] - 10https://gerrit.wikimedia.org/r/233035 (owner: 10Ori.livneh) [20:35:42] he said 0 isn't allowed :) [20:35:53] it will just warn you in syslog and clamp it to 5 basically [20:36:06] either that or fail loading the zonefile update, depending on other settings [20:37:18] seriously; I think 1m would be enough if we like playing things extremely safe but the usual 5m we use would be sufficient [20:37:22] mutante: ^ [20:37:41] yea, agreed [20:37:46] !log ori@tin Synchronized php-1.26wmf19/includes: I1eb8dfc: Revert Count API and hook calls, with 1:1000 sampling (duration: 01m 09s) [20:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:04] even down to 10s or so is probably sane, but below that it's not worth it IMHO in the general case [20:38:20] just don't forget the TTL continues to apply for whoever last fetched it [20:38:50] mutante: shall we leave TTL as-is for now or merge a patch to bring it down to 5m so that's one less thing we need to focus on later? [20:39:10] eitherway actually, I'll make a patch now so it can be merged when ever [20:39:12] so you want to reduce that 1H TTL at least 1H *before* the real change of address [20:39:33] probably better to do it in stages considerably before, though, because some broken caches basically ignore TTLs [20:39:49] maybe drop it to 5 or 10 minutes the day before? [20:39:58] JohnFLewis: yes, uploading a patch and one day before [20:40:02] what both of you said :) [20:40:06] ok [20:40:43] you could drop it down further to e.g. 5s or 10s, several minutes before the real switch, and make it smooth for conforming clients. [20:41:30] ok, let's do that. 1 hour to 5 minutes to 10 seconds [20:41:35] the whole reason for the default 5s floor is that, below that you can run into real issues. like adding up a bad client latency + packet loss scenario and normal retry-backoff or TCP fallback.... [20:41:57] you could get a client who can't refresh a 5s TTL fast enough to prevent total loss of lookup [20:42:25] *nod*. thanks for the advice. i'd stick to 10s [20:42:28] and then throw in possibly multiple intermediary caches. it's a mess [20:42:49] yeah [20:43:18] (03PS1) 10John F. Lewis: lists: lower A[AAA] records to 5M [dns] - 10https://gerrit.wikimedia.org/r/233049 [20:44:26] mutante: so dns wise - thats is apart from we need 2 A and 2 AAAAs when fermium is reinstalled [20:45:13] bblack: ptr wise - we can have 2 A / 2 AAAAs returning lists.wikimedia.org right with zero chance of any issues correct? :) [20:45:30] eh? [20:46:00] mutante: eh? :) [20:46:01] JohnFLewis: yes, correct. [20:46:12] nevermind, i was just slow parsing that [20:46:22] I'd go ahead and just set that part up ahead of time [20:46:26] half listening to the talk [20:47:14] mutante: on that note; shall I steal [well reserve :)] service IPs for list.wm.o for use on fermium then we can patch hiera then we're just waiting on fermium to reinstall really [20:47:45] JohnFLewis: yes, so we know the right row meanwhile [20:47:51] looking it up [20:48:07] public1-eqiad-c according to alex [20:48:15] yes, ack [20:51:32] JohnFLewis: i'll go back to the "clean up ./data/ dir" stuff then [20:52:06] finding the worst offenders, most messages who never got moderated [20:54:12] okay :) [20:54:39] and renaming those 2 lists with bad names [20:54:53] we dont want those issues on the actual final import [20:55:21] why did we want to copy "qfiles" again? [20:55:44] oh, and a small script that lists what we decided to copy and whatnot so we don't have to check that again [20:55:58] the rsync commands [20:58:13] (03PS1) 10John F. Lewis: lists: add service IPs for lists on fermium [dns] - 10https://gerrit.wikimedia.org/r/233050 [20:59:14] mutante: mailman uses qfiles for queuing. technically if we hold exim then let mailman run for a few more seconds after - we won't need qfiles [21:00:30] (03Abandoned) 10John F. Lewis: fermium: override role default IPs [puppet] - 10https://gerrit.wikimedia.org/r/230240 (https://phabricator.wikimedia.org/T108080) (owner: 10John F. Lewis) [21:07:55] 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1561905 (10BBlack) [21:10:11] (03PS1) 10John F. Lewis: fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 [21:11:44] (03CR) 10Tim Landscheidt: "@ori: validate_array_re() doesn't allow the customized error message that the class statsdlb currently uses, so I stayed with validate_re(" [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:11:56] (03PS2) 10John F. Lewis: fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 [21:14:59] (03PS1) 10BBlack: pybal: switch healthchecks to Special:BlankPage [puppet] - 10https://gerrit.wikimedia.org/r/233053 [21:15:11] (03PS1) 10Ori.livneh: Make pybal accept 30[12] for ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/233054 (https://phabricator.wikimedia.org/T102393) [21:15:31] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1561937 (10Dzahn) lists with the highest number of held messages: 71729 ./heldmsg-wikiru 66819 ./heldmsg-wikinews 43642 ./heldmsg-maps 40495 ./heldmsg-... [21:15:41] JohnFLewis: https://phabricator.wikimedia.org/T109838#1561937 [21:15:42] (03CR) 10Chad: [C: 031] pybal: switch healthchecks to Special:BlankPage [puppet] - 10https://gerrit.wikimedia.org/r/233053 (owner: 10BBlack) [21:16:17] we can leave that for monday for risk purposes. we've been living this way a while, a few more days won't hurt [21:16:54] Sounds like a plan [21:25:44] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1562070 (10Dzahn) ``` 66557 ./heldmsg-wikinews-l 64079 ./heldmsg-wikiru-l 43642 ./heldmsg-maps-l 40208 ./heldmsg-wikimedia-in 38696 ./heldmsg-wikifa-l... [21:27:24] 6operations, 6Performance-Team, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1562072 (10ori) a:3ori [21:30:22] 6operations, 6Performance-Team, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1562097 (10ori) The logs corroborate bblack's suspicion: `fluorine:/a/mw-log/slow-parse.log`: ``` 2015-08-21 16:33:37 mw2077 enwiki slow-parse INFO: 5.76 Main_Page {"private":fal... [21:31:49] ottomata: setting up rsync daemon is just this, right? Or are there other bits? https://etherpad.wikimedia.org/p/rsyncnova [21:34:05] (03CR) 10BBlack: [C: 031] Make pybal accept 30[12] for ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/233054 (https://phabricator.wikimedia.org/T102393) (owner: 10Ori.livneh) [21:35:06] (03PS1) 10Andrew Bogott: Revert "Add root keypair connecting labvirt hosts." [puppet] - 10https://gerrit.wikimedia.org/r/233058 [21:36:01] (03PS2) 10Andrew Bogott: Revert "Add root keypair connecting labvirt hosts." [puppet] - 10https://gerrit.wikimedia.org/r/233058 [21:37:18] andrewbogott, isn't it wmnet instead of wmflabs for those hosts? [21:37:51] Krenair: yes :) In any case that list needs to be automatically generated, somehow [21:38:12] (03CR) 10Andrew Bogott: [C: 032] Revert "Add root keypair connecting labvirt hosts." [puppet] - 10https://gerrit.wikimedia.org/r/233058 (owner: 10Andrew Bogott) [21:39:37] 6operations, 7Monitoring: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#1562129 (10RobH) 3NEW [21:40:35] (03CR) 10Merlijn van Deen: [C: 031] "Yes, mostly whether the entire pipeline worked for non-ascii characters, and it was mainly the read->mysql part I was worrying about. But " [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [21:40:37] ja i think that's it andrewbogott [21:40:42] then you rsyncto it like [21:41:08] oh, that would allow you to rsync from labvirt1001, etc., so [21:41:19] ottomata: and that lets any random user rsync a file to root? [21:41:39] myhostwhereIwantstuff $ rsync -av labvirt1001.eqiad.wmnet::novainstances/whatever/path/ /whatever/path/ [21:41:40] (03CR) 10Merlijn van Deen: [C: 031] Tools: Remove obsolete entries from host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt) [21:42:24] ja, the rsync daemon will run as root, i think you can restrict what users it can copy files as, but if you don't, -a will 'archive' the files you are copying, i.e. try to keep everything the same. owner, group perms, mtimes, etc. [21:42:40] does the :: indicate ‘use server port instead of ssh’? [21:42:44] i'm about 95% sure about that [21:42:49] yeah, :: means use rsync protocol [21:42:58] and rsync to the module name ::novainstances [21:42:59] where? [21:43:09] so ::novainstances == /var/lib/nova/instances [21:43:24] on the node where the rsync daeon is running [21:43:49] oh I see heh [21:44:04] oh, andrewbogott and you don't need read_only => no unless you are trying to rsync TO a module [21:44:14] sorry [21:44:16] uhhh [21:44:19] yeah, either way yeah [21:44:30] you can do from labvirt1001 [21:44:34] ottomata: well, I need to be able to go from any labvirt host to any other labvirt host [21:44:51] then you can include that on all labvirts and specific all labvirts as in hosts_allow [21:44:57] !log had to reset list creator password for mailman - ask me if you think you should have it and don't (this is not the master pass) [21:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:41] so yeah, readonly no is good if you want to write, if you want to be more careful, you could make it so that you only allow pulls [21:45:46] with read_only yes [21:46:11] that way you would need sudo perms to run the rsync command that does the pull [21:46:16] read_only yes implies that I’m pulling the files rather than pushing them [21:46:22] right [21:46:23] but pulling only works if I’m running as root [21:46:27] right [21:46:45] well, you could probably pull as another user, but you wouldn't be able to keep the same ownership when it is writing them [21:46:49] So I /want/ to push because that allows me to run the command as a non-root user, hence not need a root login to launch the whole process. [21:46:52] so rsync -a pull as non root would probably fail [21:47:05] aye ok [21:47:44] Confusingly this will all be launched from yet another box, labcontrol1001. Which has a keypair to connect as user ‘nova’ on labvirt* [21:51:42] (03CR) 10BBlack: [C: 04-1] "Actually now that I looked back at this again, I don't think this is right. We don't want to follow the redirect (switch hostnames, resol" [debs/pybal] - 10https://gerrit.wikimedia.org/r/233054 (https://phabricator.wikimedia.org/T102393) (owner: 10Ori.livneh) [21:56:24] ottomata: can I apply multiple rsync::server::modules to a given host, with each one specifying a single hosts_allow? [21:56:39] (That would let me use exported resources to share among every node with a given role) [21:59:20] yes [21:59:43] ottomata: https://etherpad.wikimedia.org/p/rsyncnova <- look right? [21:59:48] the rsync puppet module will make module config fragments in a temp dir, and then assemble them into a single /etc/rsyncd.conf file [22:00:12] haha, so you don't have to list all the hostnames!? interesting! [22:00:41] i think that would work, yeah, makes the config kinda ugly, but i guess it makes puppet nicer [22:00:59] hmmm, andrewbogott [22:01:17] wouldn't that collect all instances of exported/virtual(?) rsync::server::modules? [22:01:42] Yeah, maybe. [22:01:42] maybe you should target them somehow? can you do something like name =>novainstances_*? [22:01:43] dunno [22:01:51] So I might need to make a wrapper class that’s specific to this cause [22:01:53] i don't use that puppet feature much [22:01:58] me neither [22:02:19] andrewbogott: i think what you are trying will work, but it might be simpler to just stick all your labvirt hostnames in hiera [22:02:59] that seems like setting a trap for later, though — I add a new virt node and then later on this one feature doesn’t work there because I forgot to update the list [22:03:16] Unless I’m misunderstanding? [22:03:47] yeah, true! [22:03:50] happens to me all the time [22:03:52] that exact thing [22:03:57] have to document what to do when adding new nodes [22:04:02] your way is fancy and slick for sure :) [22:04:06] ok i gotta run [22:04:09] good luck, seeya! [22:04:16] thanks, so long [22:06:22] (03CR) 10Tim Landscheidt: "Testing shows: Parameters with default values, parameters with explicit values and paramters overwritten by hieradata/ have "type($paramet" [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [22:12:38] (03PS1) 10Andrew Bogott: Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 [22:13:27] (03CR) 10jenkins-bot: [V: 04-1] Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 (owner: 10Andrew Bogott) [22:16:06] (03PS1) 10Ori.livneh: Add wpt.wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/233069 [22:17:06] (03PS2) 10Ori.livneh: Add wpt.wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/233069 [22:18:16] (03CR) 10Tim Landscheidt: "Further testing: args[0].class is "String" for default parameter and explicit parameter, but "Fixnum" for hieradata/ parameter." [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [22:29:00] (03CR) 10Andrew Bogott: [C: 04-2] "This is super wrong, but I'd like to do something like this with exported resources" [puppet] - 10https://gerrit.wikimedia.org/r/233068 (owner: 10Andrew Bogott) [22:31:19] (03CR) 10Andrew Bogott: "Also https://phabricator.wikimedia.org/T108987 is of some concern." [puppet] - 10https://gerrit.wikimedia.org/r/233068 (owner: 10Andrew Bogott) [22:33:37] (03CR) 10BBlack: [C: 032] Add wpt.wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/233069 (owner: 10Ori.livneh) [22:33:51] bblack: thanks :) [22:35:32] !log deleting held messages on mailman that are older than 1 year [22:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:48] ..which takes foreeeever [22:38:55] 6operations, 10Wikimedia-Mailing-lists: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1562568 (10Dzahn) starting to delete held messages that are older than 1 year.. starting with wikiru-l.. just super slooow [22:47:34] (03PS1) 10BryanDavis: Change memcached icinga alert from anomaly to threshold [puppet] - 10https://gerrit.wikimedia.org/r/233071 (https://phabricator.wikimedia.org/T69817) [22:50:06] (03PS1) 10Tim Landscheidt: cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) [22:52:24] (03PS4) 10Thcipriani: Add servicedeploy user; Modifiy keyholder service [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [22:54:33] (03PS2) 10Tim Landscheidt: cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) [22:56:16] (03CR) 10Tim Landscheidt: "Redone in I1b5c95f078aa7709f660abeb1b5c5e197a0feacf." [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [23:06:21] (03PS1) 10Thcipriani: Pass user to deploy functions [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/233074 [23:06:49] (03CR) 10jenkins-bot: [V: 04-1] Pass user to deploy functions [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/233074 (owner: 10Thcipriani) [23:07:49] (03Abandoned) 10Thcipriani: Pass user to deploy functions [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/233074 (owner: 10Thcipriani) [23:11:49] !log synced kartotherian [23:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:19] 6operations, 6Collaboration-Team-Backlog, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1562728 (10DannyH) p:5High>3Normal [23:14:27] 6operations, 6Collaboration-Team-Backlog, 10Flow, 7WorkTypeMaintenance: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1562731 (10DannyH) p:5Normal>3High [23:22:07] (03CR) 10Ori.livneh: [C: 04-1] "Instead of adding an additional keyholder instance, I think it would be better to make ssh-agent-proxy capable of serving multiple keys. I" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [23:22:50] thcipriani: let me know if you want to chat about keyholder [23:23:48] ori: was just googling for SO_PEERCRED :) [23:25:02] I can explain [23:25:33] that'd be helpful for me [23:25:40] basically right now we assume that ssh-agent-proxy serves one key to members of some particular group, and we restrict access to the agent by controlling the file permissions of the socket [23:26:10] so if you look in tin's /run/keyholder, you'll see: [23:26:11] srwxrwx--- 1 keyholder wikidev 0 Mar 11 10:06 proxy.sock [23:26:23] yes indeed [23:26:52] now there's a nifty trick you can do with unix domain sockets, and that is to get the user and group of the other party [23:27:19] we already use this trick to log the user / group of people who use the socket -- see the get_peer_credentials() method: https://github.com/wikimedia/operations-puppet/blob/production/modules/keyholder/files/ssh-agent-proxy#L60-L64 [23:28:27] but with a little bit of work, we could make proxy.sock world readable/writeable, and use the SO_PEERCRED trick to enforce access control (i.e., to ensure that only members of group X can use key Y) [23:29:17] ah, so instead of just logging that information, we can use it to restrict requests [23:29:22] yep, exactly [23:30:15] gotcha, yeah, that does seem nicer than expanding to a bunch of agent/proxy sockets [23:31:23] +1 [23:32:04] i'd be happy to collaborate on that, so feel free to delegate me part of the work if you like, or poke me for CR [23:32:13] plus tinkering with this file seems like more fun :P [23:33:22] I'm going to take an initial stab at it because I'm a masochist, but I may have to tag you if I fall on my face with it. [23:33:51] sweet [23:34:07] thanks! [23:34:24] thank you. [23:34:41] !log restarting Cassandra on restbase1001 to restore baseline settings [23:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master