[00:03:26] (03PS1) 10Yuvipanda: tools: Setup a cdnjs mirror [puppet] - 10https://gerrit.wikimedia.org/r/205788 (https://phabricator.wikimedia.org/T96799) [00:03:48] (03PS2) 10Yuvipanda: tools: Setup a cdnjs mirror [puppet] - 10https://gerrit.wikimedia.org/r/205788 (https://phabricator.wikimedia.org/T96799) [00:04:07] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup a cdnjs mirror [puppet] - 10https://gerrit.wikimedia.org/r/205788 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [00:05:15] (03PS1) 10Dzahn: dumps: add htmldumps admins in role::dumps::zim [puppet] - 10https://gerrit.wikimedia.org/r/205789 (https://phabricator.wikimedia.org/T94093) [00:15:09] (03CR) 10Dzahn: [C: 032] "setting everything up but adding group members will be the last step" [puppet] - 10https://gerrit.wikimedia.org/r/205789 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [00:30:39] (03PS1) 10Dzahn: dumps: let htmldumps-admin own HTML dumps docroot [puppet] - 10https://gerrit.wikimedia.org/r/205790 (https://phabricator.wikimedia.org/T94093) [00:34:06] still/again no bug bot? [00:34:25] oh, just no component on it [00:34:36] 6operations: puppet-compiler fails with "Unrecognized operating system" - https://phabricator.wikimedia.org/T96802#1226462 (10Dzahn) [00:36:17] randomcat: not afaik [00:37:24] springle: I can assign you https://phabricator.wikimedia.org/T85266 ;) [00:39:35] randomcat: great! so long as you don't expect that will have any effect :) [00:39:51] well, maybe it will after the new hire starts [00:39:56] heh [00:40:48] haha :) [00:42:13] (03PS1) 10Dzahn: dumps::zim: include role in Hiera-friendly way [puppet] - 10https://gerrit.wikimedia.org/r/205795 (https://phabricator.wikimedia.org/T94093) [00:42:47] (03PS2) 10Dzahn: dumps::zim: include role in Hiera-friendly way [puppet] - 10https://gerrit.wikimedia.org/r/205795 (https://phabricator.wikimedia.org/T94093) [00:43:21] * randomcat can't do too much more with the style of slave lag checks in JobRunner [00:43:34] (03CR) 10Dzahn: [C: 032] dumps::zim: include role in Hiera-friendly way [puppet] - 10https://gerrit.wikimedia.org/r/205795 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [00:43:41] * randomcat is tempted to serialize the commit step with a "wait on a slave" check [00:46:59] (03PS3) 10Dzahn: dumps::zim: include role in Hiera-friendly way [puppet] - 10https://gerrit.wikimedia.org/r/205795 (https://phabricator.wikimedia.org/T94093) [00:49:56] randomcat: like wfWaitForSlaves() ? [00:52:00] (03CR) 10Dzahn: "why do i still not get that group on francium?" [puppet] - 10https://gerrit.wikimedia.org/r/205795 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [00:52:29] (03PS1) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) [00:52:35] (03CR) 10Dzahn: "why does this group not exist on francium yet?" [puppet] - 10https://gerrit.wikimedia.org/r/205790 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [00:53:20] springle: e.g. GET_LOCK()...MASTER_POST_WAIT(newpos) on a generic non-0 load slave, COMMIT, RELEASE_LOCK() ;) [00:53:25] (03CR) 1020after4: "I've proposed a better solution in Ia4afa0759cfadc078ce74da6194cb4531e448b61" [puppet] - 10https://gerrit.wikimedia.org/r/205338 (https://phabricator.wikimedia.org/T548) (owner: 10Bartosz Dziewoński) [00:53:50] maybe there's a way to simply cripple the master via my.conf to match the slaves [00:54:44] sucks to have so much master parallelism and have to throttle writes to a one-write-at-time equivalent system [00:55:09] it's like some misleading nosql db :) [00:56:06] your problem is you want to have a cake and eat it too :) [00:56:06] the current lag checks have the problem that a bunch of workers see low lag, all do something slow and *then* try to wait or bail out [00:56:12] the lag is already there by then [00:56:47] (03PS1) 10Dzahn: htmldumps: include admins on francium [puppet] - 10https://gerrit.wikimedia.org/r/205799 (https://phabricator.wikimedia.org/T94093) [00:56:48] well, what I want is mariadb 10 :) [00:57:17] you'll be pleased to note the last S3 slave upgraded yesterday, so we can start on masters soon [00:58:15] S1 is about ready too [01:12:39] that should help with the "16 runners do something that takes 2 seconds at once" case [01:12:54] of course truly slow single queries will still need fixing in MW [01:15:56] true [01:48:04] (03PS1) 10Springle: repool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205805 [01:49:12] (03CR) 10Springle: [C: 032] repool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205805 (owner: 10Springle) [01:49:32] (03Merged) 10jenkins-bot: repool db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205805 (owner: 10Springle) [01:50:49] !log springle Synchronized wmf-config/db-eqiad.php: repool db1019, warm up (duration: 00m 13s) [01:51:01] Logged the message, Master [01:59:57] (03CR) 10Dzahn: [C: 032] htmldumps: include admins on francium [puppet] - 10https://gerrit.wikimedia.org/r/205799 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [02:01:10] (03CR) 10Dzahn: "Notice: /Stage[main]/Admin/Admin::Hashgroup[htmldumps-admin]/Admin::Group[htmldumps-admin]/Group[htmldumps-admin]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/205799 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [02:05:10] (03CR) 10Dzahn: [C: 032] "it does now" [puppet] - 10https://gerrit.wikimedia.org/r/205790 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [02:22:07] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 45s) [02:22:16] Logged the message, Master [02:26:43] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-22 02:25:40+00:00 [02:26:49] Logged the message, Master [02:46:31] YuviPanda: can tools-cdn or whatever also include some free fonts? [02:46:49] i'm using Open Sans from google cdn on http://performance.wikimedia.org/ but i don't think that's allowed [02:48:09] It's not. [02:48:29] ...he said, severely. [02:48:41] * James_F grins. [02:48:49] It's not, o glorious ori. [02:48:59] (03PS1) 10Negative24: Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) [02:49:16] a minor indiscretion [02:49:19] i'll rectify it [02:49:30] ori: it does have free fonts I think? [02:49:59] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 08m 31s) [02:50:05] If not I'll find a stash somewhere [02:50:07] YuviPanda: what is it? [02:50:08] Logged the message, Master [02:50:25] google's CDN? Yes, but we are generally not allowed to link to third-party resources [02:50:36] since it is in effect disclosing visitors to our sites to third parties [02:50:41] (IIRC) [02:50:48] Oh no tools-cdn [02:50:48] Yeah totally [02:50:53] I meant tools-cdn should have fotns [02:50:54] Fonts [02:51:12] oh did you mirror cdnjs in full? [02:55:25] ori: yes :D [02:55:34] ori: not fully set up yet [02:55:34] oh, cool! [02:56:07] ori: I'm doing 'small things to make tools authors life better' this week [02:56:48] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-22 02:55:44+00:00 [02:56:54] Logged the message, Master [02:58:18] YuviPanda: no fonts, though [02:58:36] that's OK, though -- having cross-dependencies across labs / prod is probably a bad idea [03:01:35] ori: true that. [03:04:47] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [03:15:25] (03PS1) 10Ori.livneh: Increment a counter on fatals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205812 [03:15:56] (03PS2) 10Ori.livneh: Increment a counter on fatals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205812 [03:16:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:16:08] (03CR) 10Ori.livneh: [C: 032] Increment a counter on fatals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205812 (owner: 10Ori.livneh) [03:16:13] (03Merged) 10jenkins-bot: Increment a counter on fatals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205812 (owner: 10Ori.livneh) [03:16:59] (03PS1) 10Yuvipanda: tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) [03:17:09] (03PS2) 10Yuvipanda: tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) [03:20:40] !log ori Synchronized hhvm-fatal-error.php: I528e5384c: Increment a counter on fatals (duration: 00m 12s) [03:20:50] Logged the message, Master [03:21:20] (03PS3) 10Yuvipanda: tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) [03:21:32] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [03:22:24] 6operations, 10ops-eqiad, 10Wikimedia-Logstash: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1226699 (10bd808) [03:24:11] ori: I think I’ll mirror google fonts too [03:24:56] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [03:24:58] (03PS4) 10Yuvipanda: tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) [03:25:20] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix nginx order matching for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/205813 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [03:25:33] YuviPanda: SULF is causing issues per. http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User:UBX_-.3E_User:UBX.7Eenwiki_Broke_Everything [03:25:45] * YuviPanda waves magic wand [03:25:48] It’s all fixed now [03:25:50] Who do ask about fixes? [03:25:50] err [03:25:54] legoktm: ^ [03:26:26] T13|mobile: um, someone should have migrated UBX to a global account? :P [03:26:55] Probably, but how do we move it all back? [03:27:05] use a bot? [03:27:29] To rename the account? [03:27:30] these role accounts really suck. [03:27:31] no [03:27:32] to move pages [03:27:56] Shouldn't the account be renamed back? [03:28:01] no.... [03:28:56] Ugh. [03:29:17] I'll move pages back tomorrow. [03:29:35] Bed time now [03:29:43] T13|mobile: how bad is the breakage? [03:29:52] as in, can I have dinner first and then fix it? [03:30:02] 7000+ userboxes [03:30:04] "7,243 userbox templates" [03:30:13] Not 'that' important [03:30:24] Just userboxes [03:30:28] ok, I'll be back in an hour or two and can move them back then [03:30:30] . [03:30:33] Kk [03:30:35] eat, then write the 4 line pybot to fix [03:30:44] bd808: moveBatch.php ;) [03:31:14] * bd808 only pretends to know what you wiki folks really do ;) [03:32:25] (03PS1) 10Yuvipanda: tools: Set 30d expiry for all things from tools-static cdn [puppet] - 10https://gerrit.wikimedia.org/r/205814 (https://phabricator.wikimedia.org/T96799) [03:32:38] (03CR) 10jenkins-bot: [V: 04-1] tools: Set 30d expiry for all things from tools-static cdn [puppet] - 10https://gerrit.wikimedia.org/r/205814 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [03:32:50] (03PS2) 10Yuvipanda: tools: Set 30d expiry for all things from tools-static cdn [puppet] - 10https://gerrit.wikimedia.org/r/205814 (https://phabricator.wikimedia.org/T96799) [03:33:31] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Set 30d expiry for all things from tools-static cdn [puppet] - 10https://gerrit.wikimedia.org/r/205814 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [03:36:00] ori: Krinkle yay http://tools-static.wmflabs.org/cdnjs/ajax/libs/ [03:36:08] hmm, so I do gzip + set a 30d expires [03:36:11] I guess that should be enough [03:36:15] (and CORS) [03:36:58] not the most attractive way to browse, of course :| [03:42:47] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:43:52] YuviPanda: tools-static has broken favicon.ico btw :) [03:43:57] Would be nice to fix that :) [03:44:23] oh, yeah :) [03:45:01] YuviPanda: It seems tools-static already has ETag 304 [03:45:10] so additional max-age seems redundant [03:45:32] ETag is hard to do for dynamic content which is why Wikimedia usually does max-age w/ Last-Modified [03:45:50] right [03:45:52] so that we can update non-content while still maintaining the same cache/timestamp [03:45:57] But for static, ETag is awesome [03:46:42] Actually, scrap that. [03:46:57] ETag requires clients to roundtrip with the server to verify the hash is still the same [03:46:58] if you put in expires [03:47:02] there’s no roundtrip [03:47:03] heh :) [03:47:08] Exactly [03:48:38] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:49:36] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:50:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:51:07] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:05:38] (03CR) 10Glaisher: Change project name to 'Wikipedia' at astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201897 (https://phabricator.wikimedia.org/T94341) (owner: 10Glaisher) [04:30:26] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [04:46:46] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [04:59:54] 6operations, 6Labs: Shinken down - https://phabricator.wikimedia.org/T96817#1226829 (10yuvipanda) 3NEW [05:01:11] 6operations, 6Labs: Shinken down - https://phabricator.wikimedia.org/T96817#1226836 (10yuvipanda) There seem to be two entries for these hosts on ldap? @Andrew is this possibly due to the stuff you've been doing about new virt* hosts? [05:05:12] (03PS1) 10Yuvipanda: shinken: Make sure hosts are unique [puppet] - 10https://gerrit.wikimedia.org/r/205817 (https://phabricator.wikimedia.org/T96817) [05:05:22] (03PS2) 10Yuvipanda: shinken: Make sure hosts are unique [puppet] - 10https://gerrit.wikimedia.org/r/205817 (https://phabricator.wikimedia.org/T96817) [05:05:37] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Make sure hosts are unique [puppet] - 10https://gerrit.wikimedia.org/r/205817 (https://phabricator.wikimedia.org/T96817) (owner: 10Yuvipanda) [05:07:22] 6operations, 6Labs, 5Patch-For-Review: Shinken down - https://phabricator.wikimedia.org/T96817#1226852 (10yuvipanda) ^ is a temporary fix only, however. [05:39:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [05:49:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:29:57] PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:30:16] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on db1067 is CRITICAL Puppet has 1 failures [06:30:36] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on ms-fe1004 is CRITICAL Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on db1046 is CRITICAL Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:16] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:36:16] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:39:16] (03CR) 1020after4: [C: 031] Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [06:40:26] (03Abandoned) 1020after4: Parameterize the path to /var/lib/l10nupdate (References T95564) [puppet] - 10https://gerrit.wikimedia.org/r/203286 (https://phabricator.wikimedia.org/T95564) (owner: 1020after4) [06:41:03] (03CR) 1020after4: "Can we get a +2 on" [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [06:42:27] (03CR) 1020after4: "I'm sorry not I8693f52bc266d5590fa7cde89c5933c25c304e56," [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [06:45:57] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on db1046 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:16] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:48:16] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:36] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:27] (03CR) 10Nikerabbit: "I would like +1 on this. Then I will put this into a SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [07:05:50] (03CR) 10Aaron Schulz: [C: 031] Re-enable Special:SupportedLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [07:06:36] RECOVERY - puppet last run on db1067 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:28:44] !log SULF is done, post-rename notifications are being sent out on the last large wikis [07:28:55] Logged the message, Master [07:30:09] <_joe_> legoktm: \o/ [07:30:19] :D [07:30:27] <_joe_> legoktm: kudos, impressive work [07:30:35] thanks :) [07:39:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 22 07:38:22 UTC 2015 (duration 38m 21s) [07:39:32] Logged the message, Master [07:43:22] 6operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1226962 (10akosiaris) [07:43:24] 6operations, 5Patch-For-Review: Introduce Virtualization in our infrastructure - https://phabricator.wikimedia.org/T87258#1226960 (10akosiaris) 5Open>3Resolved This has been done for a while now [07:44:32] Could someone add the two users I cc'ed in https://phabricator.wikimedia.org/T96548 to the view/edit policy? Thanks! [07:49:22] valhallasw`cloud: done [07:49:24] ori: thanks! [07:56:16] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 761714 msg: ocg_render_job_queue 3669 msg (=3000 critical) [07:56:16] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 761714 msg: ocg_render_job_queue 3668 msg (=3000 critical) [07:56:27] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 761742 msg: ocg_render_job_queue 3508 msg (=3000 critical) [08:01:17] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 762255 msg: ocg_render_job_queue 125 msg [08:01:17] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 762255 msg: ocg_render_job_queue 125 msg [08:01:36] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 762274 msg: ocg_render_job_queue 0 msg [08:02:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [08:05:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The approach seems fine to me, various nitpicks inline. I am also not fond of the -> ordering syntax while the rest of the file (and modul" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [08:06:58] that telia link is 4 hrs late... [08:26:49] (03PS1) 10Filippo Giunchedi: reimage xenon, cerium and praseodymium as jessie [puppet] - 10https://gerrit.wikimedia.org/r/205827 (https://phabricator.wikimedia.org/T90955) [08:27:27] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [08:27:47] paravoid: ^ good to go with reimage re: new d-i release? [08:28:01] godog: no, I haven't updated d-i yet [08:28:14] in the middle of something, can this wait say 30'? [08:28:37] paravoid: sure no problem, +1 the change when good to go [08:28:42] k [08:30:06] (03PS2) 10Filippo Giunchedi: logging: update CirrusSearch thresholds [puppet] - 10https://gerrit.wikimedia.org/r/205603 (https://phabricator.wikimedia.org/T84163) [08:30:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] logging: update CirrusSearch thresholds [puppet] - 10https://gerrit.wikimedia.org/r/205603 (https://phabricator.wikimedia.org/T84163) (owner: 10Filippo Giunchedi) [08:32:35] 6operations, 5Patch-For-Review: adjust CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1227015 (10fgiunchedi) 5Open>3Resolved change merged, resolving for now. will reopen if it crops up again [08:41:13] PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail [08:41:40] 6operations, 7Graphite, 5Patch-For-Review: something (reqstats?) puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#1227022 (10fgiunchedi) [08:58:53] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:01:50] 6operations, 7Graphite, 5Patch-For-Review: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1227052 (10fgiunchedi) @yuvipanda tried extended counters in labs and are working as expected AFAICT (?) enabling extended counters will require some renaming, namely m... [09:03:23] (03PS2) 10Alexandros Kosiaris: Create a shim module for citoid around service::node [puppet] - 10https://gerrit.wikimedia.org/r/204744 [09:03:49] (03CR) 10Alexandros Kosiaris: "Addressed comments and removed the WIP designation" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/204744 (owner: 10Alexandros Kosiaris) [09:05:59] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1227054 (10fgiunchedi) 5Open>3stalled [09:12:53] 6operations: etcd evaluation - https://phabricator.wikimedia.org/T96825#1227056 (10Joe) 3NEW a:3Joe [09:19:11] 6operations: etcd evaluation - https://phabricator.wikimedia.org/T96825#1227072 (10fgiunchedi) +1, thanks! [09:23:02] godog: good morning. Krinkle found out we forgot to provide a Zuul package for trusty-wikimedia :D [09:23:08] (03CR) 10Mobrovac: [C: 031] Create a shim module for citoid around service::node [puppet] - 10https://gerrit.wikimedia.org/r/204744 (owner: 10Alexandros Kosiaris) [09:23:44] akosiaris: was about to ask you if i can go ahead and make the changes (re citoid shim) :) [09:23:47] thnx [09:26:28] hashar: ah, indeed, I take it debian/trusty-wikimedia branch is up to date? [09:27:23] godog: yes sir! [09:27:35] godog: want me to rebuild them and put them on terbium? [09:27:49] actually I just did [09:27:59] http://people.wikimedia.org/~hashar/debs/zuul/ [09:28:12] or terbium.eqiad.wmnet:/home/hashar/public_html/debs/zuul/ [09:28:20] sorry should have noticed it :( [09:30:19] <_joe_> mh marko fell off [09:34:45] 6operations: etcd evaluation - https://phabricator.wikimedia.org/T96825#1227094 (10mobrovac) Great news! Thnx, @Joe . And, yes, please, gimme access :) [09:37:34] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227102 (10hashar) I have poked @fgiunchedi about the Trusty packages. Rebuild it out of the integration/zuul.git deb... [09:43:41] can somone disable this stuff on commons: https://commons.wikimedia.org/wiki/Commons:Village_pump#Help_topicons_linking_to_external_help_pages_soon_displayed_everywhere.3F [09:43:41] ? [09:43:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [09:45:39] jzerebecki or robh? [09:46:12] (03Abandoned) 10Bartosz Dziewoński: Preserve order of 'maniphest.statuses' in Phabricator settings [puppet] - 10https://gerrit.wikimedia.org/r/205338 (https://phabricator.wikimedia.org/T548) (owner: 10Bartosz Dziewoński) [09:46:41] marktraceur? [09:48:59] hashar: kk I'll take a look later [09:49:52] godog: thx. Just poke https://phabricator.wikimedia.org/T48552#1227102 whenever you have uploaded the package :) [09:55:49] (03CR) 10Alexandros Kosiaris: [C: 032] "Catalogcompiler says noop, comments addressed, merging" [puppet] - 10https://gerrit.wikimedia.org/r/204744 (owner: 10Alexandros Kosiaris) [09:57:42] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1227139 (10Nemo_bis) >>! In T88662#1224304, @fgiunchedi wrote: > I've restored median and changed p99 with p95, p98 seems redudant and I'm not sure about p75 either. thoug... [09:58:33] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [10:00:03] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [10:03:46] i hate it when wmf is adding crap to prodwikis [10:03:46] * Steinsplitter is going to remove it via global .css [10:05:02] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:06:55] 6operations: consul evaluation - https://phabricator.wikimedia.org/T96832#1227154 (10Joe) 3NEW [10:17:27] 6operations, 6Labs, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227179 (10Aklapper) [10:19:33] RECOVERY - Apache HTTP on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.092 second response time [10:21:56] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1227188 (10fgiunchedi) FWIW I think we should log/collect metrics locally on the machines, reason being that we're fine with per-machine stats and then aggregate at the graphite layer, no external dependencies and easy to in... [10:22:18] (03CR) 10Faidon Liambotis: [C: 031] reimage xenon, cerium and praseodymium as jessie [puppet] - 10https://gerrit.wikimedia.org/r/205827 (https://phabricator.wikimedia.org/T90955) (owner: 10Filippo Giunchedi) [10:27:08] Steinsplitter: neither robh nor I are probably the correct person to talk to. anyone "can" remove/change it, as in submit a patch. who did the original change? [10:27:48] can't find the patch [10:28:59] Steinsplitter: it is probably a change in core, as wikidata.org also has it [10:33:59] Steinsplitter: there it is https://gerrit.wikimedia.org/r/#/c/194414/ so perhap talk to Nemo_bis the author [10:34:34] thx for the link :) [10:34:57] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1227193 (10faidon) I'm not very fond of a pipeline of: varnish -> varnishkafka -> kafka brokers -> kafkatee -> statsd -> carbon to collect metrics. It sounds fragile and relying unnecessarily on multiple pieces of centralize... [10:38:16] RECOVERY - NTP on mw2128 is OK: NTP OK: Offset 0.006875634193 secs [10:51:37] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227204 (10akosiaris) >>>! In T95229#1224477, @Anomie wrote: >> Above Gabriel claimed "v1" was the name. If it's the version then it's not following the `/api/{... [10:53:58] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227205 (10fgiunchedi) 5Open>3Resolved {{done}} [11:05:15] (03CR) 10Alexandros Kosiaris: [C: 032] graphite: introduce carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [11:05:24] (03CR) 10Alexandros Kosiaris: [C: 031] graphite: introduce carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [11:08:09] (03PS2) 10Filippo Giunchedi: reimage xenon, cerium and praseodymium as jessie [puppet] - 10https://gerrit.wikimedia.org/r/205827 (https://phabricator.wikimedia.org/T90955) [11:08:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] reimage xenon, cerium and praseodymium as jessie [puppet] - 10https://gerrit.wikimedia.org/r/205827 (https://phabricator.wikimedia.org/T90955) (owner: 10Filippo Giunchedi) [11:11:13] !log begin reimagining xenon, cerium and praseodymium [11:11:18] Logged the message, Master [11:16:00] 6operations: Multiple PHP security issues - https://phabricator.wikimedia.org/T96586#1227224 (10akosiaris) As far as the Zend PHP runtime goes, php5 packages have been upgraded throughout the fleet the whatever the latest version is depending on the distro version. Wikimedia specific packages have been built and... [11:28:53] (03PS1) 10Filippo Giunchedi: wmf-reimage: handle unset IPMI_PASSWORD [puppet] - 10https://gerrit.wikimedia.org/r/205835 [11:29:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] wmf-reimage: handle unset IPMI_PASSWORD [puppet] - 10https://gerrit.wikimedia.org/r/205835 (owner: 10Filippo Giunchedi) [11:30:37] paravoid: mhh I'm getting the double nic detection again on xenon, tried rebooting a couple of times, I've jumped off the console if you want to take a look [11:30:51] grumble grumble [11:33:48] looking... [11:35:48] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1227257 (10fgiunchedi) >>! In T88662#1227139, @Nemo_bis wrote: >>>! In T88662#1224304, @fgiunchedi wrote: >> I've restored median and changed p99 with p95, p98 seems redud... [11:36:07] thanks! [11:36:12] <- lunch [11:39:51] PROBLEM - puppet last run on mw2061 is CRITICAL puppet fail [11:40:26] (03PS4) 10Hashar: Labs: Mute client-side notifications for wikitech Puppet status [puppet] - 10https://gerrit.wikimedia.org/r/203062 [11:43:11] RECOVERY - salt-minion processes on mw2128 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:43:11] RECOVERY - nutcracker port on mw2128 is OK: TCP OK - 0.000 second response time on port 11212 [11:43:13] (03CR) 10Hashar: "Cherry picked on integration labs puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [11:43:30] RECOVERY - nutcracker process on mw2128 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:43:31] RECOVERY - HHVM processes on mw2128 is OK: PROCS OK: 1 process with command name hhvm [11:44:01] RECOVERY - configured eth on mw2128 is OK - interfaces up [11:44:10] RECOVERY - RAID on mw2128 is OK no RAID installed [11:44:22] RECOVERY - dhclient process on mw2128 is OK: PROCS OK: 0 processes with command name dhclient [11:44:31] RECOVERY - DPKG on mw2128 is OK: All packages OK [11:44:41] RECOVERY - Disk space on mw2128 is OK: DISK OK [11:48:41] RECOVERY - puppet last run on mw2128 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:49:00] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 63792 bytes in 5.092 second response time [11:52:44] 6operations: zookeeper evaluation - https://phabricator.wikimedia.org/T96839#1227293 (10Joe) 3NEW a:3Joe [11:59:51] RECOVERY - puppet last run on mw2061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:12:04] 6operations, 6Labs, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227315 (10Andrew) new instances have two ldap entries -- one with the ec2 id, and valid associated domains, one with the fqdn and invalid associated domains. Does shinken search for both dns? [12:12:17] Hiya... it appears that meta might be... challenged. https://meta.wikimedia.org/wiki/Wikimedia_Conference_2015 [12:13:05] Yeah, a local admin broke everything I think [12:13:38] https://phabricator.wikimedia.org/T96835 [12:13:39] That takes talent. [12:14:26] On this page it's the languages template at the top [12:14:40] 10:47, 22 April 2015 MarcoAurelio (talk | contribs) imported Template:Languages from another wiki (68 revisions imported from commons:Module:Languages: per request) [12:14:41] etc. [12:14:49] I don't intend to clean it up from the server-side [12:15:06] D'oh, that's a steward. [12:15:10] 6operations, 6Labs, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227316 (10Andrew) Ah, yeah, since you're searching by instancename I probably broke things for you :( Those instances are there to test the new ldap host-entry generation and I presumed them to be har... [12:15:21] Yeah, it's not a server side problem, I guess. Quite right. [12:16:12] Well, if nobody has fixed it by the time JAlexander wakes, I'll ask him to. [12:17:43] they're a local admin and I assume this was not really a steward action [12:17:57] they should probably know better [12:18:04] I dislike the way the import function works [12:18:14] Yeah, not a steward action. [12:19:42] I'm about to go away for a few hours, so no point me going around deleting/selectively restoring stuff [12:20:14] at that scale.. [12:22:01] Looks like Vituzzo did it :) [12:22:27] https://meta.wikimedia.org/w/index.php?title=Special:Log/import not the first time i see this, recursive report schould be disabled on "non new" wikis (imho) [12:22:35] Philippe, for that one page [12:22:54] and the nonsense revisions will still be in the history [12:28:56] godog: fix was wrong, I found the new bug and reopened https://bugs.debian.org/765577 (see there for more) [12:29:12] godog: I also applied the new patch to the image and confirmed it works for xenon [12:29:36] godog: (but d-i on xenon stops at some RAID misconfig or something, I'll leave that for you ;)) [12:30:24] hi Philippe [12:30:49] it seems like you guys have this handled already but do let me/us know if we can help [12:31:06] (this is Faidon, btw) [12:31:12] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227327 (10Yurik) I second @akosiaris sentiment - #1 (`/api/content/v1/...`) seems like a much cleaner choice out of the once mentioned above. Alternatively, un... [12:31:15] paravoid: ok thanks! I'll take a look [12:31:43] I don't think there's much for operations to do. It's just a local admin causing a big mess that the admins will have to clean up... [12:31:47] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1227329 (10BBlack) >>! In T83580#1227193, @faidon wrote: > varnish -> varnishkafka -> kafka brokers -> kafkatee -> statsd -> carbon I think it's worse than that actually. Locally on the machine, there's also a chain of var... [12:42:36] (03CR) 10Negative24: "Whoops. Forgot that you don't have +2." [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [12:43:29] akosiaris: any progress on the graphoid front perhaps? [12:43:29] :D [12:43:50] mobrovac: yup, I 'll be posting some patches shortly for IPs/DNS and such [12:43:55] then I am reviewing your patch [12:44:02] yuhu [12:44:03] cool [12:44:06] akosiaris: thnx [12:45:50] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227337 (10Anomie) >>! In T95229#1227327, @Yurik wrote: > Alternatively, unless there is a good reason not to, we could put the `api` into the domain, e.g. `//a... [12:47:44] paravoid: still on xenon's console by chance? [12:48:04] godog: ah, yes, sorry; logged off now [12:49:27] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227338 (10Yurik) Thanks @anomie. So back to `/api/content/v1/`, or a shorter `/a/c/1/` since some browsers might not support header compression, and it will be... [12:50:09] np [12:50:28] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227340 (10Anomie) Personally I'd much rather see the reasonably clear /api/content/ than the cryptic /a/c/. [12:53:18] 6operations, 10ops-eqiad, 10Wikimedia-Logstash: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1227341 (10Manybubbles) @bd808 are you still taking the post-ops setup? I can certainly help. I know your pretty crazy busy at this point. [13:10:52] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1227377 (10fgiunchedi) 3NEW a:3Christopher [13:20:24] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1227390 (10BBlack) 3NEW [13:25:39] !log switched eqiad<->ulsfo link to Giglinx [13:25:45] Logged the message, Master [13:27:25] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1227398 (10faidon) At esams, the aggregator would be better to be moved from hooft to nescio/maerlant. These are recdns backends & NTP, so Ganglia fits better there. These are already jessie. [13:29:35] (03PS2) 10Rush: Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [13:37:31] !log ms-be101[678] weight to 2820 [13:37:37] Logged the message, Master [13:39:58] (03CR) 10Rush: [C: 04-1] Initialize phd directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [13:40:36] 6operations, 10Traffic: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#1227434 (10BBlack) 3NEW [13:43:38] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1227445 (10BBlack) 3NEW [13:44:31] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [13:47:51] 6operations, 10Traffic: Support ALPN + HTTP/2 - https://phabricator.wikimedia.org/T96848#1227458 (10BBlack) 3NEW [13:52:47] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x - https://phabricator.wikimedia.org/T96850#1227475 (10BBlack) 3NEW [13:53:48] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x - https://phabricator.wikimedia.org/T96850#1227483 (10BBlack) [13:53:50] 6operations, 10Traffic: Support ALPN + HTTP/2 - https://phabricator.wikimedia.org/T96848#1227482 (10BBlack) [13:58:00] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1227492 (10BBlack) 3NEW [13:58:23] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1227499 (10BBlack) [13:58:25] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x - https://phabricator.wikimedia.org/T96850#1227500 (10BBlack) [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T1400). Please do the needful. [14:00:42] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1227511 (10BBlack) p:5Triage>3Low [14:00:50] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1227514 (10BBlack) [14:00:51] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1227492 (10BBlack) [14:02:33] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1227521 (10Ottomata) > We could easily create a pipeline of "varnishncsa | parse-and-send-stats" or even create a thin varnishstats daemon specially built for this. +1 > Locally on the machine, there's also a chain of varni... [14:02:40] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:56] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1227550 (10fgiunchedi) >>! In T83580#1227521, @Ottomata wrote: >> We could easily create a pipeline of "varnishncsa | parse-and-send-stats" or even create a thin varnishstats daemon specially built for this. > +1 +1, also n... [14:15:58] 6operations, 10Traffic: Deploy recdns + ntp @ ulsfo - https://phabricator.wikimedia.org/T96852#1227571 (10BBlack) 3NEW [14:19:44] 6operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#1227616 (10BBlack) 3NEW [14:19:54] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1227625 (10BBlack) [14:19:55] 6operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#1227624 (10BBlack) [14:24:07] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1227653 (10akosiaris) There are a couple of things missing from this ticket. Those are: - A clear description of what this service does. Preferably... [14:27:41] (03PS1) 10Alexandros Kosiaris: Assign LVS IPs to the graphoid service [dns] - 10https://gerrit.wikimedia.org/r/205856 (https://phabricator.wikimedia.org/T90487) [14:29:46] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1227662 (10BBlack) 3NEW [14:30:52] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1227673 (10BBlack) [14:33:21] (03CR) 10Anomie: [C: 04-1] "There are still multiple objections to "v1" as a name on the associated task." [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [14:35:53] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Fix ipv6 autoconf issues - https://phabricator.wikimedia.org/T94417#1227684 (10BBlack) [14:40:09] (03PS1) 10Cmjohnson: Adding dns entries for logstash1004-6 (https://phabricator.wikimedia.org/T96692) [dns] - 10https://gerrit.wikimedia.org/r/205859 [14:40:54] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1227717 (10BBlack) [14:41:31] 6operations, 10Traffic: implement better failure-scenario geoip mapping in gdnsd - https://phabricator.wikimedia.org/T94697#1227725 (10BBlack) [14:41:43] 6operations, 10Traffic, 7HTTPS: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1227728 (10BBlack) [14:41:55] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1227730 (10mobrovac) >>! In T90487#1227653, @akosiaris wrote: > There are a couple of things missing from this ticket. Those are: > > - A clear desc... [14:42:04] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1227733 (10BBlack) [14:42:27] 6operations, 10Traffic, 7Varnish: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#1227741 (10BBlack) [14:42:44] 6operations, 10Traffic, 7Mobile, 7Varnish: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#1227744 (10BBlack) [14:42:53] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for logstash1004-6 (https://phabricator.wikimedia.org/T96692) [dns] - 10https://gerrit.wikimedia.org/r/205859 (owner: 10Cmjohnson) [14:43:11] 6operations, 10Traffic, 7Varnish: Varnish: the lower the Age value, the slower the request - https://phabricator.wikimedia.org/T84980#1227752 (10BBlack) [14:43:46] mobrovac: which endpoints does graphoid provide ? [14:44:11] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1227761 (10BBlack) [14:44:13] 6operations, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1227763 (10BBlack) [14:44:37] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1227766 (10BBlack) [14:44:44] akosiaris: {domain}/v1/{title}/{revid}/{id}.png [14:44:50] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1227769 (10BBlack) [14:45:05] hm on that note, that should probably be changed to {id}/png [14:45:14] .png is no bueno really [14:45:17] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1227773 (10BBlack) [14:45:29] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS performance & UA adoption metrics - https://phabricator.wikimedia.org/T86664#1227776 (10BBlack) [14:45:41] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1227778 (10BBlack) [14:46:40] mobrovac: hmmm so no / ? [14:46:52] akosiaris: ? [14:46:52] it is going to make it difficult to monitor [14:47:00] at the end you mean? [14:47:01] we need a GET /something [14:47:07] for monitoring purposes [14:47:10] ah yeah right [14:47:18] /{domain}/v1/blabla [14:47:26] forgot the initial one sorry [14:47:35] blabla ? [14:47:41] haha [14:47:42] that is a variable ? [14:47:48] or an actuall blabla ? [14:47:55] no, was lazy re-writing it all [14:48:00] lemme c/p [14:48:07] lol [14:48:14] 6operations, 10Traffic, 7HTTPS, 10Security-Core: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298#1227789 (10BBlack) [14:48:18] akosiaris: /{domain}/v1/{title}/{revid}/{id}.png [14:48:31] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1227791 (10BBlack) [14:48:32] I am anyway commenting on your change so better take it there [14:48:43] akosiaris: i'll try to change that into /png instead of .png (at the end) [14:48:44] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1227792 (10BBlack) [14:48:55] k [14:51:22] manybubbles, marktraceur, ^d, thcipriani: Who wants to SWAT this morning? [14:51:48] Hrm [14:51:58] I can, but I'd prefer not to [14:52:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall looks quite good, here's a first round of comments. The most important seems to be the split in 3 patches so we can do a staged ou" (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [14:52:38] mobrovac: thanks! [14:52:48] that is a quite nice commit btw^ [14:52:54] hehe thnx [14:53:00] it took me a while to put that together [14:53:03] <^d> jouncebot: next [14:53:03] In 0 hour(s) and 6 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T1500) [14:53:08] I can imagine ;-) [14:53:17] and i agree that all of that should be split in various commits [14:53:29] <^d> anomie: I can [14:53:34] i was kind of curious how big a commit that would be if we put everything there :) [14:53:41] now i know [14:53:42] :P [14:53:59] yeah, Roan made that mistake too with citoid :P [14:54:09] and me before him with mathoid IIRC [14:54:16] hehe [14:54:22] <^d> James_F|Away: Ping for swat in ~5m [14:54:36] akosiaris: speaking of mathoid, imho we shold create a deploy repo for it just as for other services [14:55:10] akosiaris: now we have the ability to generate exact trusty/jessie deps for services in a container to be sure the versions match prod [14:55:10] hmmm I am not against it [14:55:48] that is for binary node modules, right ? [14:55:55] and afaik, it has already integrated service-runner so we might be able to wrap it in a nice service::node class too [14:55:59] akosiaris: yes [14:56:07] I 'd love that! [14:56:31] akosiaris: also, some node modules install diff versions depending on the nodejs installed on the system [14:56:54] so having a build env that's got all the same things as prod is rather important [14:57:23] hmm, so depending on whether you run node 0.8 or 0.10 npm is capable of that ? Nice [14:57:32] hopefully in the long term we will be able to test and build them automatically using hashar 's isolated CI [14:57:49] akosiaris: in some instances yes, it's not a real general rule of thumb [14:58:03] (you need to say so in the package.json or some shim like that) [14:58:09] yeah, makes sense [14:58:45] ^d: Yup, here. [14:58:59] <^d> Starting the jenkins dance now [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, James_F: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T1500). Please do the needful. [15:01:37] <_joe_> mobrovac: I'll take your puppet patch as a base for what needs to be simplified in setting up a service [15:02:01] _joe_: taking words out of my mouth :) [15:02:09] thnx [15:02:21] <_joe_> mobrovac: my dream is you just set up the service::node instance, and include a service description in hiera, for now [15:02:47] ah we dream the same dreams it seems [15:03:04] <_joe_> which means, I need to do the big refactor of the lvs classes after all [15:03:07] 6operations, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1227837 (10BBlack) 3NEW [15:03:13] (03CR) 1020after4: Move maniphest status settings into custom/wmf-defaults.php (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [15:03:59] btw akosiaris _joe_ you can probably expect some patches for service::node and base::service_unit , i'd like service::node to be able to serve systemd init configs as well, not only upstart ones [15:04:04] (probably not today though) [15:04:17] too may "probably" here lol [15:04:22] s/may/many/ [15:04:23] 6operations, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1227850 (10BBlack) a:3BBlack [15:07:50] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1227892 (10Eevans) >>! In T92471#1223912, @fgiunchedi wrote: >>>! In T92471#1220602, @GWicke wrote: >> We can enforce localhost-only access from a specif... [15:08:17] (03CR) 10Alexandros Kosiaris: WIP: Graphoid: Puppet bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [15:09:48] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1227905 (10hashar) A status update: The initial Debian packaging is ready for review https://gerrit.wikimedia.... [15:10:12] (03PS2) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) [15:10:20] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 914.408012717 [15:12:08] !log demon Synchronized php-1.26wmf1/extensions/WikiEditor/: (no message) (duration: 00m 13s) [15:12:13] Logged the message, Master [15:12:26] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227916 (10hashar) 3NEW [15:13:31] <^d> James_F: wmf1 was ok. wmf2 didn't actually update any submodules after pulling to tin [15:13:35] ^d: WikiEditor seems ffine. [15:13:41] ^d: Huh. Odd. [15:14:05] ^d: Neither of them? [15:14:09] <^d> Nope [15:14:15] Hmmmmm. [15:14:19] <^d> WikiEditor is still wmf2 branch point + 1 cherry pick [15:14:26] 6operations, 10Traffic: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#1227935 (10BBlack) [15:14:31] Maybe they were accidentally pushed early by someone else? [15:14:39] <^d> VE is your A/B commits on top [15:15:08] Bleh. Do I need to re-make the commits? [15:15:38] <^d> Why the fuck is wmf2 tracking master on tin?!? [15:15:46] Oh, what? [15:15:49] That's not good. [15:15:50] <^d> # On branch master [15:15:50] <^d> # Your branch and 'origin/master' have diverged, [15:15:50] <^d> # and have 36 and 139 different commits each, respectively. [15:16:00] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227948 (10hashar) I have poked our internal ops list to get some tips and hints. [15:16:08] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1227949 (10Cmjohnson) IP's have been setup logstash1004 1H IN A 10.64.0.162 logstash1005 1H IN A 10.64.16.185 logstash1006 1H IN A 10.64.48.109 logstash1004 1H... [15:16:09] For MW core? [15:16:14] <^d> Yes [15:16:20] <^d> demon@tin /srv/mediawiki-staging/php-1.26wmf2 (master)$ [15:16:46] * James_F sighs. [15:16:52] Well, that might break things. [15:17:02] <^d> srsly [15:17:26] (03CR) 10Alexandros Kosiaris: WIP: Graphoid: Puppet bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [15:17:28] can you reflog and figure out what happened? [15:17:31] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 4843.4698005 [15:18:12] * bd808 guesses it would be related to a submodule bump gone horribly wrong in deploy [15:18:31] <^d> 8ad1344 HEAD@{11}: checkout: moving from wmf/1.26wmf2 to master [15:19:23] !log demon Synchronized php-1.26wmf2/extensions/WikiEditor/: (no message) (duration: 00m 11s) [15:19:26] Logged the message, Master [15:19:43] !log demon Synchronized php-1.26wmf2/extensions/VisualEditor/: (no message) (duration: 00m 12s) [15:19:45] <^d> James_F: You good now [15:19:46] Logged the message, Master [15:19:52] Aha. Awesome. Thanks, ^d. [15:20:47] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1227966 (10bd808) >>! In T96692#1227341, @Manybubbles wrote: > @bd808 are you still taking the post-ops setup? I can certainly help. I know your pretty crazy bus... [15:20:51] <^d> bd808: deploying master -- living on the edge :p [15:21:04] :-) [15:21:06] <^d> I wonder if it's worth a scap to make sure everything's back where it should be [15:21:19] <^d> Can't hurt [15:21:51] !log demon Started scap: 1.26wmf2 was tracking master. should be fixed, being paranoid and doing full sync + i18n rebuild [15:21:56] Logged the message, Master [15:22:43] (03PS3) 1020after4: Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [15:23:30] (03CR) 1020after4: Initialize phd directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [15:23:44] (03CR) 1020after4: [C: 031] Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [15:25:49] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1227969 (10GWicke) @fgiunchedi, the point is that we are making this user / group-based in any case: In one case by making the password file readable to... [15:26:21] (03PS2) 10Andrew Bogott: Remove icinga rights for a couple of departed employees. [puppet] - 10https://gerrit.wikimedia.org/r/202759 [15:27:50] (03CR) 10Andrew Bogott: [C: 032] Remove icinga rights for a couple of departed employees. [puppet] - 10https://gerrit.wikimedia.org/r/202759 (owner: 10Andrew Bogott) [15:28:26] (03PS6) 10Andrew Bogott: puppetmaster: Cleanup autosigner script some more [puppet] - 10https://gerrit.wikimedia.org/r/198790 (owner: 10Yuvipanda) [15:28:32] (03CR) 1020after4: "Isn't there a way to ship just a list of the dependencies and pull those in at runtime?" [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [15:29:57] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1227981 (10mobrovac) [15:30:03] !log demon Finished scap: 1.26wmf2 was tracking master. should be fixed, being paranoid and doing full sync + i18n rebuild (duration: 08m 11s) [15:30:07] Logged the message, Master [15:30:30] !log stopped restbase on restbase1002 in preparation for cmjohnson1 checking the hardware [15:30:33] Logged the message, Master [15:30:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [15:32:37] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1227993 (10akosiaris) >> - Who's running point >> I am assuming the services team, but I may be wrong and we definitely want to have that documented... [15:32:57] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1227995 (10mobrovac) Current state of affairs: we are getting ready for deployment, but we need these blocking tasks to be resolved first: - {T96126} - It contains some impo... [15:37:08] 6operations, 5Patch-For-Review: Upgrade xenon, cerium and praseodymium to jessie - https://phabricator.wikimedia.org/T90955#1228002 (10fgiunchedi) checking for SSDs in T96841 [15:37:32] (03CR) 10Andrew Bogott: [C: 031] "This looks great. Do you mind adding one more feature? Break out the code that purges certs with no match in ldap and wrap it with 'pupp" [puppet] - 10https://gerrit.wikimedia.org/r/198790 (owner: 10Yuvipanda) [15:38:07] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1228004 (10fgiunchedi) a:5Christopher>3Cmjohnson [15:38:48] (03CR) 10Andrew Bogott: [C: 032] Labs: Mute client-side notifications for wikitech Puppet status [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [15:44:02] 6operations, 10Traffic: reinstall/rename dysprosium as cp1099 (upload eqiad) - https://phabricator.wikimedia.org/T96873#1228018 (10BBlack) 3NEW [15:46:21] (03CR) 10Andrew Bogott: [C: 04-1] mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [15:49:27] (03CR) 10Andrew Bogott: "Other than a minor whitespace complaint, I think this can be merged whenever it suits you." (031 comment) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) (owner: 10Hashar) [15:50:03] (03CR) 10Andrew Bogott: [C: 032 V: 032] Merge remote-tracking branch 'wikimedia/upstream' into debian [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203960 (owner: 10Hashar) [15:50:37] (03PS1) 10BBlack: remove amssq*/dysprosium from installer custom stuff [puppet] - 10https://gerrit.wikimedia.org/r/205873 [15:54:06] (03CR) 10BBlack: [C: 032] remove amssq*/dysprosium from installer custom stuff [puppet] - 10https://gerrit.wikimedia.org/r/205873 (owner: 10BBlack) [15:57:10] (03PS4) 10Rush: Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [15:57:25] (03CR) 10Rush: [C: 032 V: 032] Initialize phd directory [puppet] - 10https://gerrit.wikimedia.org/r/205810 (https://phabricator.wikimedia.org/T96567) (owner: 10Negative24) [15:57:57] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x or higher - https://phabricator.wikimedia.org/T96850#1228063 (10BBlack) [15:59:44] 6operations: zookeeper evaluation - https://phabricator.wikimedia.org/T96839#1228084 (10Joe) As dynamic reconfiguration is only supported in ZK 3.5, we need to use that, and jessie only has 3.4.5 at the moment. [16:01:46] 6operations, 10ops-esams, 10Traffic: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1228091 (10BBlack) [16:06:34] PROBLEM - Host restbase1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:38] (03PS12) 10Ottomata: Puppetize impala [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205446 (https://phabricator.wikimedia.org/T96329) [16:10:33] (03PS13) 10Ottomata: Puppetize impala [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205446 (https://phabricator.wikimedia.org/T96329) [16:11:45] RECOVERY - Host restbase1002 is UPING OK - Packet loss = 0%, RTA = 1.03 ms [16:13:25] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1228110 (10Cmjohnson) Yes I have 7 or 8 spare on-site SSDSA2M160G2GN [16:14:57] 6operations, 7network: setup network switch ports / vlans for db2053-2070 - https://phabricator.wikimedia.org/T96385#1228113 (10RobH) So, any and all changes seem to invoke these errors: {master:2}[edit] robh@asw-d-codfw# show | compare [edit interfaces] + ge-6/0/0 { + description db2052; + ena... [16:16:36] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1228116 (10fgiunchedi) thanks @cmjohnson, we'd need to swap one of the HDDs in xenon, cerium and praseodymium and replace with the SSD, there should be another HDD per machine which should b... [16:20:29] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1228118 (10fgiunchedi) also machines are good to be powered off, not in service [16:24:28] (03PS1) 10BBlack: update pybal weight descriptions [puppet] - 10https://gerrit.wikimedia.org/r/205878 [16:24:30] (03PS1) 10BBlack: Add storage cfg for cp1099 T96873 [puppet] - 10https://gerrit.wikimedia.org/r/205879 [16:24:38] (03CR) 10Tim Landscheidt: "Ah, finally:" [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [16:24:55] (03CR) 10BBlack: [C: 032 V: 032] update pybal weight descriptions [puppet] - 10https://gerrit.wikimedia.org/r/205878 (owner: 10BBlack) [16:25:30] (03CR) 10BBlack: [C: 032 V: 032] Add storage cfg for cp1099 T96873 [puppet] - 10https://gerrit.wikimedia.org/r/205879 (owner: 10BBlack) [16:27:30] 6operations, 10Traffic: reinstall/rename dysprosium as cp1099 (upload eqiad) - https://phabricator.wikimedia.org/T96873#1228132 (10BBlack) [16:29:23] 7Puppet, 6Phabricator: Initialize phd working directory for phd services - https://phabricator.wikimedia.org/T96567#1228138 (10Negative24) [16:29:32] 7Puppet, 6Phabricator: Initialize phd working directory for phd services - https://phabricator.wikimedia.org/T96567#1228140 (10Negative24) 5Open>3Resolved [16:31:07] 7Puppet, 6Phabricator: Initialize phd working directory for phd services - https://phabricator.wikimedia.org/T96567#1220826 (10Negative24) [16:37:56] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium for gwicke,mobrovac,eevans (htmldumps-admins) - https://phabricator.wikimedia.org/T94093#1228143 (10Dzahn) via the changes above we now have: - a new group "htmldump-admins" on francium: [francium:/srv/www]... [16:40:05] PROBLEM - puppet last run on bast2001 is CRITICAL puppet fail [16:43:15] (03PS1) 10Dzahn: admin: make gwicke,mobrovac,eevans htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/205883 (https://phabricator.wikimedia.org/T94093) [16:52:22] hasharAway: o/ [16:55:47] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium for gwicke,mobrovac,eevans (htmldumps-admins) - https://phabricator.wikimedia.org/T94093#1228215 (10Dzahn) @RobH @ArielGlenn @mobrovac Here's what i did: - created a new empty admin group (https://gerrit.wik... [16:56:35] RECOVERY - puppet last run on bast2001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:57:43] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1228222 (10GWicke) The problem with 'content' is that it is not general enough, as this API will expose non-content information as well. Using `v1`... [16:57:50] (03CR) 10Ottomata: [C: 032] Puppetize impala [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205446 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [17:01:47] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1228242 (10Anomie) I'd still prefer something more descriptive than "rest"—as you said on IRC, it's the underlying technology rather than any indic... [17:03:29] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium for gwicke,mobrovac,eevans (htmldumps-admins) - https://phabricator.wikimedia.org/T94093#1228243 (10Dzahn) Once that is merged the first 2 parts of the request should be resolved. And the third part "ability... [17:04:15] James_F: heh, damn you are fast :) [17:06:04] 6operations, 7network: setup network switch ports / vlans for db2053-2070 - https://phabricator.wikimedia.org/T96385#1228254 (10RobH) a:5mark>3RobH mark fixed this via irc conversation, so stealing this back [17:10:06] (03PS4) 10Dzahn: sshd: don't use NIST key exchange protocols [puppet] - 10https://gerrit.wikimedia.org/r/185321 [17:10:19] heya _joe_, really unsure of where to set a hiera variable. [17:10:35] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium for gwicke,mobrovac,eevans (htmldumps-admins) - https://phabricator.wikimedia.org/T94093#1228262 (10mobrovac) > Wanna confirm that looks sane by reviewing the last one? LGTM (together with the others) > And... [17:10:38] i've got this: [17:10:39] https://gist.github.com/ottomata/80c4877c901ee67314ac [17:10:53] 1. not sure if I should scrap the role parameter and just set the cdh module one directly in hiera [17:10:54] and [17:10:59] 2. no idea where the proper place to do that would be [17:11:01] this will not be a main role [17:11:07] (03CR) 10Dzahn: "Replaced 'else' with '<% elsif scope.function_os_version(['ubuntu == precise'])%>' as on the other patch" [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [17:14:12] 6operations: install/setup/deploy db2052-db2070 - https://phabricator.wikimedia.org/T96383#1228272 (10RobH) [17:14:13] 6operations, 7network: setup network switch ports / vlans for db2053-2070 - https://phabricator.wikimedia.org/T96385#1228270 (10RobH) 5Open>3Resolved and these are all set to go [17:14:51] 6operations: install/setup/deploy db2052-db2070 - https://phabricator.wikimedia.org/T96383#1215693 (10RobH) [17:15:40] 6operations: install/setup/deploy db2052-db2070 - https://phabricator.wikimedia.org/T96383#1215693 (10RobH) @papaul: Please set the raid10 up with them set for all disks per host in each array, read ahead, write back, 256kb stripe [17:15:57] (03CR) 10Mobrovac: [C: 031] "LGTM (including the previous work done on this)" [puppet] - 10https://gerrit.wikimedia.org/r/205883 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [17:22:05] (03CR) 10Dzahn: "@Giuseppe: How about this patch nowadays? Is your -1 still current?" [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [17:23:52] (03CR) 10Dzahn: "@JohnLewis any idea which services were lacking holes for this?" [puppet] - 10https://gerrit.wikimedia.org/r/194802 (owner: 10Dzahn) [17:25:47] (03PS16) 10GWicke: Set up /api/v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) [17:28:47] (03PS17) 10GWicke: Set up /api/rest_v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) [17:29:15] (03PS3) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 [17:30:14] (03CR) 10Ori.livneh: "but should it be 'rest_v1' or 'rest-v1'? (I'm kidding, I'm kidding!)" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [17:30:47] (03CR) 10Dzahn: "won't be dangerous until base::firewall is on nodes" [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [17:32:28] (03CR) 10GWicke: "@Ori: Hey, you owe us a +1 for that trolling!" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [17:32:45] (03PS4) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 [17:33:21] (03PS5) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 [17:37:24] (03CR) 10Dzahn: "can go anytime but it's more like a reminder that we still have to create a dump of it. https://phabricator.wikimedia.org/T90679#1143780" [puppet] - 10https://gerrit.wikimedia.org/r/205457 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [17:39:23] (03CR) 10Dzahn: "so it's currently used as a work around on the integration puppetmaster (where it works) but not in actual labs and i don't know what to d" [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [17:48:02] ottomata https://phabricator.wikimedia.org/P481 - which I can't reproduce despite of all try :D [17:48:13] while akosiaris can :) [17:48:58] kart_: tbh i was looking at something else while you were talking! muchos sorry, soooo what's up? where did this come from? is this something that jenkins is trying to build automated? [17:53:21] kart_: fill me in! [17:55:20] ottomata: nope. This is local build. We've no jenkins there. [17:56:22] (03PS1) 10Dzahn: dumps: move hiera data to correct location [puppet] - 10https://gerrit.wikimedia.org/r/205888 [17:56:35] you build on local machine? [17:56:48] ottomata: yes [17:56:49] is this a patch for review? [17:57:13] ottomata: one minute. let me find. old dusty patch now. [17:57:52] ottomata: https://gerrit.wikimedia.org/r/#/c/195897/ [17:57:56] apergos: hey! around? [17:58:34] (03PS2) 10Dzahn: dumps: move hiera data to correct location [puppet] - 10https://gerrit.wikimedia.org/r/205888 [17:58:53] (03PS3) 10Dzahn: dumps: move hiera data to correct location [puppet] - 10https://gerrit.wikimedia.org/r/205888 [18:00:15] 10Ops-Access-Requests, 6operations: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1228368 (10Mholloway) 3NEW [18:00:29] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T1800). [18:00:37] (03PS1) 10Alex Monk: Allow wikitech to use local username blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205889 [18:00:53] YuviPanda: yes though about to eat (early here), what's up? [18:01:12] apergos: dewiki dumps broken in some form (rsync to labs or dumps itself) [18:01:14] reported on labs-l [18:01:38] apergos: oh, damn, forgot it’s probably very late for you [18:01:40] the rsync if it were broken would be broken for all of em [18:01:47] (03CR) 10Andrew Bogott: [C: 031] "plus one!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205889 (owner: 10Alex Monk) [18:01:54] what was the report exactly? [18:02:11] YuviPanda: [18:02:25] apergos: 20150321/ was last dewiki dump available on labs [18:02:27] that’s a month ago [18:02:33] woo exactly almost even [18:02:38] ok, I'll see if it just didn't complete a good run [18:02:46] cool :) [18:02:47] and take care of that (tomorrow) [18:02:47] 6operations, 5Continuous-Integration-Isolation: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1228381 (10RobH) a:5RobH>3chasemp I'm going to assign this to chase, only while the discussion is pending about the networking. (Since he is discussing with @mark).... [18:03:58] (03PS7) 10Yuvipanda: puppetmaster: Cleanup autosigner script some more [puppet] - 10https://gerrit.wikimedia.org/r/198790 [18:04:04] apergos: thanks :) [18:04:07] (03CR) 10Dzahn: [C: 032] dumps: move hiera data to correct location [puppet] - 10https://gerrit.wikimedia.org/r/205888 (owner: 10Dzahn) [18:04:21] yw, it will be ni fact rerunning one step so no worries :-) [18:04:33] *in fact [18:04:40] :D [18:04:42] cool [18:04:48] we should setup monitoring for these in some form [18:04:49] hmm [18:04:50] yes, that change _looks_ like it can influence dumps rsync stuff and you happen to be talking about it , but it's not :) [18:05:07] (03PS8) 10Yuvipanda: puppetmaster: Cleanup autosigner script some more [puppet] - 10https://gerrit.wikimedia.org/r/198790 [18:05:14] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Cleanup autosigner script some more [puppet] - 10https://gerrit.wikimedia.org/r/198790 (owner: 10Yuvipanda) [18:05:51] folk who want to know what the most recent successful dumps run is should check the index page [18:06:12] it will list failures too, really the most helpful in case they don't see something showingup [18:06:55] yeah but we should have an automated script [18:10:14] I guess I'm to used to people camping on that page (or the rss feeds) and grabbing a file as soon as it becomes available [18:10:50] well, I'm off to try to eat... not so early I guess :-) [18:11:30] ottomata: https://gerrit.wikimedia.org/r/#/c/202801/ fyi [18:11:34] (my review, I mean) [18:12:49] (03PS1) 10Dzahn: dumps: use role-based hiera lookup on dumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/205893 [18:13:07] (03PS2) 10Dzahn: dumps: use role-based hiera lookup on dumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/205893 [18:15:20] (03PS1) 10Yuvipanda: toollabs: Include ldap labs config before using it [puppet] - 10https://gerrit.wikimedia.org/r/205894 [18:15:26] (03CR) 10Dzahn: "this needs https://gerrit.wikimedia.org/r/#/c/205893 so we can use role-based hiera lookup (i moved the data back to hieradata/role/common" [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [18:15:41] (03PS2) 10Yuvipanda: toollabs: Include ldap labs config before using it [puppet] - 10https://gerrit.wikimedia.org/r/205894 [18:15:55] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Include ldap labs config before using it [puppet] - 10https://gerrit.wikimedia.org/r/205894 (owner: 10Yuvipanda) [18:16:58] ori: revert of a revert you once reverted https://gerrit.wikimedia.org/r/#/c/178280/ ( i guess it's fine :) [18:17:40] eh, wrong link. https://gerrit.wikimedia.org/r/#/c/184637/ [18:17:48] that reverts the former [18:19:18] hm ori, i don' think we want multiple X-Analytics headers, no idea what varnish stuff would do there [18:19:20] i'm fine with the exception [18:20:15] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [18:22:57] (03PS6) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [18:23:05] (03PS1) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap. [puppet] - 10https://gerrit.wikimedia.org/r/205897 [18:23:54] (03CR) 10Chad: "PS6 restores PS4 + a rebase. While PS5 is probably a nicer end result it's a harder diff to review and could easily be done after this lan" [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [18:23:58] <^d> YuviPanda: ^ [18:24:10] I wonder if _joe_ is around [18:25:25] does this look right to get role-based hiera lookup and not change things? https://gerrit.wikimedia.org/r/#/c/205893/2/manifests/site.pp [18:26:27] hrmm, always this slight difference in ssh/userkeys that makes puppet compiler runs be not really identical when everything else would be [18:26:30] http://puppet-compiler.wmflabs.org/738/change/205893/html/ms1001.wikimedia.org.html [18:26:48] cant have an OK, would be nice [18:28:09] well the rest is identical so should be fine [18:30:27] (03PS2) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap. [puppet] - 10https://gerrit.wikimedia.org/r/205897 [18:31:25] (03CR) 10Andrew Bogott: "Todo:" [puppet] - 10https://gerrit.wikimedia.org/r/205897 (owner: 10Andrew Bogott) [18:32:59] (03CR) 10Dzahn: [C: 032] "checked in compiler http://puppet-compiler.wmflabs.org/738/change/205893/" [puppet] - 10https://gerrit.wikimedia.org/r/205893 (owner: 10Dzahn) [18:33:08] ottomata: can you comment on the patch? [18:34:46] ori ja [18:36:06] (03CR) 10Dzahn: "also watched on dataset1001 and ms1001 - noop" [puppet] - 10https://gerrit.wikimedia.org/r/205893 (owner: 10Dzahn) [18:36:25] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:36:53] (03CR) 10Ottomata: "I can reproduce this:" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [18:38:15] (03CR) 10Dzahn: [C: 032] "and now that https://gerrit.wikimedia.org/r/#/c/205893 is merged this should work just fine" [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [18:41:05] (03PS1) 10Dzahn: dumps: re-add prefix when naming variable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/205900 [18:41:25] (03PS1) 10Cmjohnson: Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 [18:41:28] (03PS2) 10Dzahn: dumps: re-add prefix when naming variable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/205900 [18:41:57] (03CR) 10Dzahn: [C: 032] dumps: re-add prefix when naming variable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/205900 (owner: 10Dzahn) [18:42:50] (03CR) 10Dzahn: [V: 032] dumps: re-add prefix when naming variable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/205900 (owner: 10Dzahn) [18:43:04] PROBLEM - puppet last run on dataset1001 is CRITICAL puppet fail [18:43:22] i'm on that, no worries [18:43:28] it's running like now [18:44:12] ori: do you remember where the code that writes to graphite from eventlogging is? I couldn’t find it in the mw repo or puppet [18:44:35] RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:48:22] YuviPanda: to whisper or to graphite? [18:48:48] ori: graphite. the event success / failure counts [18:49:08] (03PS2) 10Cmjohnson: Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 [18:49:17] YuviPanda: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/bin/eventlogging-reporter [18:49:49] aha! I’m an idiot and thank you [18:51:51] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 (owner: 10Cmjohnson) [18:51:53] (03CR) 10RobH: [C: 031] admin: make gwicke,mobrovac,eevans htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/205883 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [18:52:07] (03PS1) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 [18:52:19] (03PS2) 10Dzahn: dumps: put base::firewall on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/205903 [18:52:52] (03CR) 10RobH: [C: 032] admin: make gwicke,mobrovac,eevans htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/205883 (https://phabricator.wikimedia.org/T94093) (owner: 10Dzahn) [18:54:00] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium for gwicke,mobrovac,eevans (htmldumps-admins) - https://phabricator.wikimedia.org/T94093#1228669 (10RobH) 5Open>3Resolved I agree this all looks excellent. Thank you @dzahn for the patchsets/setup/puppeti... [18:54:02] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1228672 (10RobH) [18:54:04] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1228671 (10RobH) [18:54:14] robh: cool:) [18:54:31] are you looking at it on francium or should i [18:54:39] im about to run on francium [18:54:48] ok, *nod* [18:55:02] (03PS1) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 [18:55:09] thx for all the patchsets dude, it really straightened out htat task [18:55:35] and i just watched it make all the shell accounts [18:55:44] gwicke: ^ you guys are set on francium [18:55:45] :) [18:56:20] it took a few attempts to get the role-based lookup from hiera right, glad it works [19:01:20] robh: hooray! [19:01:22] (03PS1) 1020after4: Remove 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205905 [19:01:24] (03PS1) 1020after4: Add 1.26wmf3 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205906 [19:01:26] (03PS1) 1020after4: Wikipedias to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205907 [19:01:28] (03PS1) 1020after4: Group0 to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205908 [19:01:40] mutante: thanks! [19:02:04] gwicke: yw! [19:03:13] anyone around with hiera thoughts? i'm really confused about where I should put a parameter [19:04:36] ottomata: if you can, use role/common/foo.yaml [19:04:52] well, except. [19:04:56] ottomata: but for it to work you need to use the role keyword in site.pp vs. traditional include [19:05:04] right [19:05:41] ok so, i have 3 module classes [19:05:46] w of which require the first [19:05:47] ottomata: would you like my thoughts on hiera as a performance engineer, as a wikimedia foundation engineer, as a resident of san francisco, as a native speaker of hebrew, or as a human being? [19:05:48] and the first [19:06:03] is kinda like a client class [19:06:10] installs base config and client packages [19:06:13] it has one parameter [19:06:14] $master_host [19:06:30] so, i was going to make 3 role classes, one for each of these classes [19:06:41] client (base), worker, master [19:06:52] and i suppose, parameterise the client base class with $master_host as well [19:07:01] so I can set a hiera role variable as you suggest mutante [19:07:06] but, ok [19:07:27] only one or two nodes will really only have the client base class as a real role, the other ones include either of the other classes [19:07:29] like this: [19:07:35] https://gist.github.com/ottomata/80c4877c901ee67314ac [19:07:38] so, e.g. [19:07:42] stat1002 could do [19:07:42] (03PS1) 10Merlijn van Deen: Use exim4-heavy for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205910 (https://phabricator.wikimedia.org/T74867) [19:07:47] role analytics::impala [19:07:59] and i could set master_host in hierdata/role/common/analyitcs/impala.yaml [19:08:15] but, how does that work on nodes where the non base class would be included [19:08:18] do I have to do [19:08:22] role analytics::impala [19:08:25] on them, and then do [19:08:32] include role::analytics::impala::worker [19:08:32] ? [19:09:14] what happens with role hiera lookup if a role class inherits another one [19:09:25] say, if my ::worker class inhertied the base client class? [19:09:30] (03Abandoned) 10Tim Landscheidt: Tools: Include ldap::role::config::labs in toollabs::bastion [puppet] - 10https://gerrit.wikimedia.org/r/204543 (https://phabricator.wikimedia.org/T96266) (owner: 10Tim Landscheidt) [19:09:44] where would the parameter be looked up? [19:10:17] (03CR) 10Yuvipanda: [C: 032 V: 032] Use exim4-heavy for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205910 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [19:10:53] (03CR) 10coren: "That's toollabs::mailrelay and you need to remove the related package{} stanzas from there (otherwise you'll get multiple definitions)" [puppet] - 10https://gerrit.wikimedia.org/r/205910 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [19:10:54] ori, how about as a human being in the context of the fact that all of your evolutionary ancestors survived and here you are. [19:11:25] none of my evolutionary ancestors survived! except my parents and one grandmother. [19:11:32] hah [19:11:35] oh true [19:11:37] but they reproduced. gross. [19:11:38] you are right [19:11:43] maybe you will be the one to finally survive? [19:11:51] i hope so. it's been a long time coming. [19:12:00] (03CR) 1020after4: [C: 032] Remove 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205905 (owner: 1020after4) [19:12:13] (03CR) 1020after4: [C: 032] Add 1.26wmf3 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205906 (owner: 1020after4) [19:12:22] bblack, i think you have done some non insignificant hiera work. got a min? ^^^ [19:12:31] (03PS1) 10Yuvipanda: Revert "Use exim4-heavy for tools-mail" [puppet] - 10https://gerrit.wikimedia.org/r/205911 [19:12:39] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Use exim4-heavy for tools-mail" [puppet] - 10https://gerrit.wikimedia.org/r/205911 (owner: 10Yuvipanda) [19:16:43] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1228790 (10Tgr) [19:16:47] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests, 3Multimedia-Sprint-2015-03-25: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1228789 (10Tgr) [19:17:37] ottomata: a min for? [19:17:46] oh I see discussion up above... [19:17:58] give me a few to get stopped on what I'm staring at [19:18:18] k danke [19:19:08] (03PS1) 10Merlijn van Deen: Use exim4-heavy for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [19:19:36] (03PS1) 10Yuvipanda: tools: Ensure that exim-heavy only is on tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205915 [19:19:59] (03PS2) 10Merlijn van Deen: Use exim4-heavy for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [19:26:25] 10Ops-Access-Requests, 6operations: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1228833 (10RobH) @Mholloway, We'll need a few other items from you to make this happen, all outlined on https://wikitech.wikimedia.org/wiki/Requesting_shell_access. - We need your... [19:27:11] !log zuul gearman server is stalled [19:27:19] Logged the message, Master [19:27:55] hashar: I was just finishing a long question about what's up with zuul.... :) [19:28:19] (03PS1) 10RobH: mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) [19:28:28] (03CR) 10jenkins-bot: [V: 04-1] Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [19:28:59] SMalyshev: sorry :D [19:29:25] so I assume now it will be ok soon, great [19:31:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1228850 (10Mholloway) [19:31:07] hopefully [19:31:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1228368 (10Mholloway) Added @dr0ptp4kt for approval. [19:31:35] Zuul flushed all its queue and should be processing again soonish [19:32:16] (03CR) 10jenkins-bot: [V: 04-1] dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (owner: 10Dzahn) [19:34:02] ottomata: so, first off: role inheritance + hiera don't mix well. Try to use an include pattern instead there. [19:34:28] (03PS3) 10Merlijn van Deen: Manually configure exim4 class for for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [19:34:30] ya it feels bad to me on its own [19:34:34] i'm not using inheritance here atm [19:35:06] ok [19:35:08] so, bblack, i would like to be able to do: [19:35:15] include role::analytics::impala [19:35:22] inside of role::analytics::clients [19:35:33] and then on specific nodes [19:35:41] include role::analytics::impala::worker [19:35:42] 6operations, 10Wikimedia-Logstash, 10hardware-requests: eqiad: (3) servers for logstash service - https://phabricator.wikimedia.org/T84958#1228865 (10RobH) 5stalled>3Resolved Resolving this request, as the servers were ordered and are now on-site being setup. Setup of systems can be followed on T96692. [19:35:44] and then on one node [19:35:45] either [19:35:54] (03PS4) 10Merlijn van Deen: Manually configure exim4 class for for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [19:36:00] include role::analytilcs::impala::master [19:36:00] or [19:36:00] role analytics::impala::master [19:36:01] well it's hard to understand from the gist, without seeing the ::cdh::impala classes... [19:36:04] 6operations, 10Wikimedia-Logstash, 10hardware-requests: eqiad: (3) servers for logstash service - https://phabricator.wikimedia.org/T84958#1228870 (10RobH) [19:36:06] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1224681 (10RobH) [19:36:07] looking for those now [19:36:13] but, the parameter [19:36:14] https://github.com/wikimedia/puppet-cdh/blob/master/manifests/impala.pp [19:36:17] and [19:36:17] https://github.com/wikimedia/puppet-cdh/tree/master/manifests/impala [19:36:19] but, the parameter [19:36:22] is on the main base class [19:36:24] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1228871 (10Cmjohnson) Restoring the backup switch to factory config. [19:36:28] ::impala [19:36:29] (03PS2) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 [19:37:59] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/739/change/205903/diff/dataset1001.wikimedia.org.diff.formatted" [puppet] - 10https://gerrit.wikimedia.org/r/205903 (owner: 10Dzahn) [19:38:00] bblack, i haven't merged that submodule change into ops/puppet yet [19:38:02] so you wont' find it there [19:38:29] yeah I don't know that cdh::impala stuff is right to begin with. How is it intended that consumers of cdh::impala::{worker|master} set the $master_host param of the required class cdh::impala? [19:39:11] (03CR) 10Dzahn: "to show this will work once base::firewall is enabled, see last lines of this compiler run on change 205903" [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [19:39:40] (03CR) 10jenkins-bot: [V: 04-1] puppetsigner: Clean up certs for instances we can't find in ldap. [puppet] - 10https://gerrit.wikimedia.org/r/205897 (owner: 10Andrew Bogott) [19:40:04] (03PS3) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap. [puppet] - 10https://gerrit.wikimedia.org/r/205897 [19:41:04] bblack, it is intended that uses of those classes include cdh::impala [19:41:18] actually, those role classes in my gist probably shoudl do that [19:41:19] (03CR) 10Dzahn: "better, here are the actual ferm rules:" [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [19:41:34] (03CR) 10jenkins-bot: [V: 04-1] Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 (owner: 10Cmjohnson) [19:41:37] but, if they only wnated to used default, the reuqire will work [19:42:00] (03PS3) 10Cmjohnson: Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 [19:42:33] (03CR) 10Dzahn: "search for "ferm::service" in http://puppet-compiler.wmflabs.org/739/change/205903/compiled/puppet_catalogs_3_205903/dataset1001.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/205903 (owner: 10Dzahn) [19:42:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [19:42:40] but ja, i'm not 100% on this layout either bblack, so suggestions are very welcome [19:42:41] why not have cdh::impala::{master,worker} inherit cdh::impala, and then they all have a $master_host classparam, and users of those classes don't need to include cdh::impala? [19:43:21] i'm getting 503s on enwiki [19:43:33] hm, that would work too bblack, you think that is better? [19:43:36] (03CR) 10Dzahn: "to confirm this, search for "ferm::service" in http://puppet-compiler.wmflabs.org/740/change/205904/compiled/puppet_catalogs_3_205904/ms10" [puppet] - 10https://gerrit.wikimedia.org/r/205904 (owner: 10Dzahn) [19:43:39] but still, how would that work with hiera? [19:43:40] (03CR) 10jenkins-bot: [V: 04-1] mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [19:43:44] (and then I think with all 3 classes having a $master_host, the role+hiera -level problem might be easier, but I need to think that through, too) [19:43:49] oh [19:43:52] all three with that param? [19:44:03] !log Killing Jenkins cause .... we know [19:44:07] (03CR) 10jenkins-bot: [V: 04-1] Manually configure exim4 class for for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [19:44:07] well the inheriting ones get it from the base one [19:44:08] (03CR) 10jenkins-bot: [V: 04-1] dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 (owner: 10Dzahn) [19:44:09] Logged the message, Master [19:44:18] ja, but you can't set it on the children [19:44:19] hm. [19:44:22] 19:43 < thedj> i'm getting 503s on enwiki [19:44:23] hm [19:44:23] but [19:44:24] hm [19:44:32] you think i could then do [19:44:37] role analytics::impala::worker [19:44:39] but set [19:44:41] shhhhhhhh [19:44:42] (03CR) 10jenkins-bot: [V: 04-1] Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 (owner: 10Cmjohnson) [19:44:45] jenkins-bot the nihilist [19:44:57] anyone else confirm enwiki problems or 503s in general, and from where? [19:45:12] role/analytics/impala.yaml's master_host [19:45:13] ? [19:45:14] (03PS4) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 [19:45:17] https://gdash.wikimedia.org/dashboards/reqerror/ [19:46:00] bblack, I added a statsd logging call to HHVM's fatal handler yesterday: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1429731924.797&target=MediaWiki.errors.fatal [19:46:04] it is not showing a comparable spike [19:46:22] which suggests to me that the errors are originating elsewhere, in one of the outer layers [19:46:27] it's possible this is like the who wants to be a millionaire thing yesterday [19:46:52] what "who wants to be a millionaire" thing? [19:46:59] (03PS1) 10RobH: setting new codfw databases into partitioning [puppet] - 10https://gerrit.wikimedia.org/r/205920 (https://phabricator.wikimedia.org/T96383) [19:47:45] a huge spike of traffic 1 or a handful of image requests. A DDoS launched accidentally by a popular gameshow. It has happened many times in the past, and just happened earlier this week. [19:47:55] s/requests/URLs/ [19:48:07] got one again [19:48:10] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.echo.badge%7Cext.gadget.BugStatusUpdate%2CCollapsibleNav%2CHotCat%2CNavigation_popups%2CNoAnimations%2CPrintOptions%2CReferenceTooltips%2CTwinkle%2CUTCLiveClock%2Caddsection-plus%2Ccharinsert%2Cedittop%2Cfeatured-articles-links%2Cgeonotice%2Cmetadata%2Cpurgetab%2CrefToolbar%2Crevisionjumper%2Cteahouse%2Cwidensearch%7Cext.uls.nojs%7Cext.visualEditor.viewPageTa [19:48:16] bits??? [19:48:24] seems so [19:48:33] hm. loaded for me. anecdotally, people have been complaining about bits. [19:48:46] x-cachecp3020 miss (0) [19:48:47] * ori consults ganglia [19:48:50] thedj you're in europe right? [19:48:51] x-varnish2448841190 [19:48:54] bblack: yup [19:49:54] it's just some of the bits requests. many go just fine [19:50:36] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Bits+caches+esams&h=cp3020.esams.wmnet&jr=&js=&v=585542325&m=varnish.SMA.s0.c_req&vl=N%2Fs&ti=Allocator+requests [19:51:00] !log Zuul / Jenkins back up and processing the 1+ hour backlog of changes. Will take a while. Multiple causes: Zuul gearmand being stalled on a socket that has no more data to emit and Jenkins being deadlocked due to an IRC plugin [19:51:04] (03PS3) 10Dzahn: dumps: put base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/205904 [19:51:05] Logged the message, Master [19:51:15] graph is similar for all of the esams bits (3019-22) [19:51:22] (03CR) 10Mholloway: "copied from inline: I'm assuming this failed because when I created my Gerrit username I made it 'mholloway-shell'. Sorry, I don't rememb" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [19:51:44] but not eqiad [19:52:17] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+caches+esams&h=cp3020.esams.wmnet&jr=&js=&v=268030&m=kafka.rdkafka.brokers.analytics1022-eqiad-wmnet_9092.22.rtt.min&vl=microseconds&ti=kafka.rdkafka.brokers.analytics1022-eqiad-wmnet%3A9092.22.rtt.min [19:52:24] are they all going to the same page? [19:52:30] kafka reports increase in RTT to eqiad [19:52:51] yeah could be link issues again, then [19:53:39] seems to settle down now ? [19:53:54] rtt graph that is [19:54:05] yeah [19:54:16] gtt acting up again [19:54:21] http://smokeping.wikimedia.org/?displaymode=n;start=2015-04-22%2016:53;end=now;target=ESAMS.Hosts.hooft [19:54:27] yeah :/ [19:55:00] (03CR) 1020after4: [V: 032] Remove 1.25wmf21 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205905 (owner: 1020after4) [19:55:01] we may really need to come up with a quicker plan for avoiding/killing that GTT link :/ [19:55:09] this is getting ridiculous [19:55:33] yeah... [19:55:50] and alerts for cross-dc rtt or packet loss, no? [19:55:51] (03CR) 10Ottomata: [C: 032 V: 032] Use inheritance to include impala base class [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205957 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [19:55:57] *rtt spike [19:56:05] (03CR) 10Dzahn: "@robh @mholloway re: renaming users needs some steps in LDAP and in Gerrit and in Wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [19:56:15] and? [19:56:17] (03PS2) 10RobH: mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) [19:56:23] I think he meant "any" [19:56:55] oh no, he meant and, as in "in addition to trying to kill the link, let's alert it better" [19:57:02] yeah :) [19:58:44] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [19:58:54] probably for the same reason [19:58:56] (03CR) 10jenkins-bot: [V: 04-1] puppetsigner: Clean up certs for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 (owner: 10Andrew Bogott) [19:59:04] yeah [19:59:07] it's still at 122ms [19:59:16] compared to the normal 90ms or so [19:59:28] well it was at 87 just a bit before, 122 is new the past minute or two [19:59:35] now back to 87 [19:59:37] (03CR) 10Gergő Tisza: "Sure: put them in requirements.txt, make a virtualenv (or not) and run pip. The vagrant module for sentry does that:" [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [19:59:45] not exactly [19:59:52] it depends a lot on the payload [20:00:00] I guess. I'm just watching tiny pings [20:00:05] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T2000). [20:00:05] e.g. pings are 116ms now [20:00:09] UDP are 87ms [20:00:13] (aside: i wonder if kafka's behavior (that is, to buffer locally when a remote host is unreachable and then flush buffered data when connectivity is re-established) makes the impact worse) [20:00:31] if it flushes as fast as possible, then probably [20:00:32] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1228950 (10Dzahn) p:5Triage>3Normal [20:01:29] hm, it would probably do that, ya [20:01:41] varnishkafka's behavior, aj [20:01:42] ja [20:01:47] 4444444444444444444444444444444-4444444444444444444444444444444- Good IPv4 [20:01:50] XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- Good Xmit [20:01:53] RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR-RRRRRRRRRRRRRRRRRRRRRR-RRRRRRRR- Good Recv [20:01:56] HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HHHHHHHHHHHHHHHHHHHHHH-HHHHHHHH- Happy [20:02:01] ^ varnish backend health debug, sliding window with "-" as failure [20:02:08] it's still seeing intermittent loss [20:02:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:03:53] (somewhere on the order of 1-3 per 64 checks failing) [20:04:11] (03PS5) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 [20:05:35] I'm ready to scap ... should I wait or is now a good time [20:05:38] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1229009 (10RobH) @Jrogers-WMF: You guys are saying this needs to be removed? Doing so will cause a downtime window for ALL of our mailing lists. Removing items from archives also doesn't r... [20:05:59] (03PS1) 10Ottomata: Update cdh module with hiera and puppetize for Analytics Cluster [puppet] - 10https://gerrit.wikimedia.org/r/205960 [20:06:33] (03PS2) 10Ottomata: Update cdh module with hiera and puppetize for Analytics Cluster [puppet] - 10https://gerrit.wikimedia.org/r/205960 [20:06:52] twentyafterfour: what's in scap? [20:06:57] (03PS3) 10Ottomata: Update cdh module with impala and puppetize for Analytics Cluster [puppet] - 10https://gerrit.wikimedia.org/r/205960 [20:07:07] just pushing out the new branch to testwiki [20:07:56] then the next scap would be updating everything to 1.26wmf2 shortly after [20:08:04] !log deployed parsoid version 3311936a [20:08:08] (new branch is wmf3) [20:08:15] Logged the message, Master [20:08:23] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T2000 <- parsoid/ocg? [20:09:05] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:09:22] train deplot delayed by zuul backup [20:09:27] ah [20:09:43] well I don't see any harm in messing with testwikis [20:09:50] maybe hold off a bit on anything else [20:10:43] ok cuz this scap takes almost an hour, wouldn't want to delay it much further unless it's really necessary. I'll double check that things have calmed down before I bump the other branch [20:10:51] ok, thanks! [20:10:57] !log twentyafterfour Started scap: testwiki to php-1.26wmf3 and rebuild l10n cache [20:11:01] Logged the message, Master [20:12:15] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60601 bytes in 2.138 second response time [20:12:39] 6operations, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1229076 (10BBlack) FYI, I've been actively working on this VCL this week. I expect to have something ready to test/eval in beta by the end of the week. [20:12:44] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1229078 (10Dzahn) Is there a list/etherpad/pastebin or something which are done/have to be done? Is this a thing where anyone can just do a few on the side? Other depool steps needed before typing reboot? [20:13:07] bblack, done with parsoid deploy [20:13:08] !log twentyafterfour scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.KaXyRl6UJi" ' returned non-zero exit status 1 (duration: 02m 10s) [20:13:11] Logged the message, Master [20:14:11] twentyafterfour: running that manually should tell you what is messed up. Usually an extension that is being dropped/added [20:14:46] bblack, paravoid: btw, if you guys do a postmortem, another follow-up action may be to have a log-scale composite graph of 5xxs and mw exceptions/fatals somewhere prominent, so that error spikes can quickly be isolated to a particular layer [20:15:04] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:17:20] (03CR) 10Ottomata: [C: 032] "Ok, let's try this!" [puppet] - 10https://gerrit.wikimedia.org/r/205960 (owner: 10Ottomata) [20:18:14] PROBLEM - puppet last run on mw2040 is CRITICAL puppet fail [20:18:36] ori: yeah... it's really annoying that we have great global 5xx live stats on the dashboard, but can't drill into that to see site/cluster/host [20:19:27] ori: https://phabricator.wikimedia.org/T83580 [20:19:44] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1229117 (10Jrogers-WMF) @RobH It does warrant taking down the servers (though it hurts me to say it). Maybe we can meet up in person after this and discuss better ways to handle something li... [20:19:50] (I should have said, also, in addition to the fatal-vs-varnish bit) [20:21:00] ottomata: I don't think that 205960 is right. We shouldn't be setting module classparams in hiera, IMHO. [20:21:04] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1229125 (10ori) Suggestion from IRC: it'd be good to have a log-scale composite graph of 5XXs and MediaWiki exceptions/fatals somewhere prominent, so that error spikes can be quickly isolated to a particular layer. [20:21:52] bblack, no? [20:21:57] although I see others doing it in the tree [20:22:00] beats me, ask _joe_ [20:22:01] aye ok, i'm all for changing whatever needs to be done. [20:22:12] something isn't working right with my module anyway!!! it worked in labs. GRRRRR [20:22:22] has to do with my hadoop class inheritiance trickery [20:22:28] it just doesn't look right to me personally. The whole point of the module is to be abstract from our specific config and have a role set its parameters for our config [20:22:47] when hiera sets a module classparam with our config data, that seems to bypass all of that separation, to me. [20:22:47] yeah, i kinda think that too, except...role classes were created pre-hiera [20:22:56] and hiera almsot seems to take over for them...kinda [20:23:00] i dunno [20:23:01] beats me too [20:23:10] I think we still need role classes with hiera [20:23:17] to combine modules? [20:23:20] (which you have, anyways, right?) [20:23:22] yes [20:23:36] to combine modules and to do WMF-specific things [20:23:45] which I guess should mostly be configuration, but still! :) [20:23:54] ohHHh i know why this isnt' working, its cool. [20:23:59] labs is different(tm) also when it comes to the hiera lookup, there is hieradata/labs/* but can't have role based lookup there? no .labs/role/common/ [20:24:04] yeah, bblack, makes sense too [20:24:16] oh. [20:24:17] interseting. [20:24:20] interesting. [20:24:21] eally? [20:24:25] no role based lookup in labs? [20:24:32] kinda.... [20:24:34] ha, that's a reason to not use the role lookup here bblack [20:24:41] YuviPanda, legoktm can one of you review my gerrit bot patch? [20:24:43] i dunno how exactly, but it's different from prod [20:24:52] there is a wiki page to set hiera data too [20:24:56] PROBLEM - puppet last run on analytics1026 is CRITICAL puppet fail [20:25:03] and then there is hieradata/labs/stuff [20:25:06] and they get merged [20:25:08] if you look at hieradata/labs.yaml , you can see where I've stuffed role-class-parameters like role::cache::base::cluster_tier: 'one' in there [20:25:20] that's apparently a _joe_-approved pattern [20:25:48] ah, ok [20:26:26] well, but i kind of expected it would be ./labs/role/common/somerole.yaml because prod is ./role/common/ [20:26:35] role::cache::upload::upload_domain: 'upload.beta.wmflabs.org' [20:26:43] ^ is probably a closer match for your $master_host thing [20:27:20] (03PS1) 10Ottomata: Include analytics::clients on analytics1026 so that hadoop is configured propertly for impala [puppet] - 10https://gerrit.wikimedia.org/r/205963 [20:27:41] (03CR) 10Ottomata: [C: 032 V: 032] Include analytics::clients on analytics1026 so that hadoop is configured propertly for impala [puppet] - 10https://gerrit.wikimedia.org/r/205963 (owner: 10Ottomata) [20:27:44] so the way I've put things categorically in my head for now, is that I have 3 basic places to stuff data in hiera terms: [20:27:54] and then there is also this: "Hiera support for labs is still being completed. Currently, you can set only project-wide hiera data. You can do this by creating/editing a wiki page on wikitech, with the page name Hiera: " [20:28:36] 1) Actually role data that only machines inside that role itself can see: in role/common/foo.yaml (this seems to apply to labs hosts which have that role as well) [20:28:37] aye, i've used that [20:28:50] (but I'm not 100% sure about that!) [20:28:58] "Per-host hiera data can only be specified on ops/puppet repo" [20:29:00] yeahi think the common stuff gets applied in labs too? [20:29:33] subbu: in an hour maybe? [20:29:45] sure. no rush. [20:29:55] 2) role-class parameters in common/role/foo.yaml [20:30:38] 3) global data at e.g. common/foo/bar.yaml (whatever layout makes sense) [20:31:38] (and 3 is accessed with hiera() lookups) [20:32:14] hm, aye [20:33:15] PROBLEM - DPKG on analytics1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:34:21] so I would think you'd (1) have cdh::impala::foo inherit cdh::impala (which gives them a $master_host as well) (2) have your worker/master role include the basic role (3) have the basic role have a classparam, and (4) set the classparam in common/role/analytics.yaml:master_host + labs equiv in labs.yaml:role::analytics::master_host [20:34:28] but that's just my best guess! [20:34:45] RECOVERY - DPKG on analytics1026 is OK: All packages OK [20:34:57] oh wait the include thing doesn't work right in that pattern. but something close-ish to that... [20:36:14] RECOVERY - puppet last run on analytics1026 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:36:49] bblack, why would the worker/master roles include the base role if the classes they are including inherit from the base class included include in the base role? [20:36:54] I guess you have to choose whether to inherit down in modules/cdh, or include multiple module classes in roles/analytics/impala [20:36:57] (i can't believe I just wrote that as a sentence) [20:37:33] what I don't yet get, is how we'd define $master_host exactly once (per realm) and apply it to all 3 roles [20:37:39] yea [20:37:43] since role-inheritance is a nono apparently [20:37:45] RECOVERY - puppet last run on mw2040 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:37:57] hiera lookup? [20:38:00] I guess this would be a place where you'd use global data via hiera, yeah [20:38:03] in the role class [20:38:32] hm [20:38:48] patch coming....i think [20:39:04] data at e.g. common/analytics/cdh/impala.yaml:master_host , and role::analytics::impala classes all referencing hiera('analytics::cdh::impala::master_host') [20:39:34] and then labs-equiv at hieradata/labs.yaml with the full "analytics::cdh::impala::master_host: foo" [20:40:12] well, i don't need to commit anything to ops/puppet for labs [20:40:16] in that case I"d use the wikitech interface [20:40:19] but ja [20:40:28] hm, still not sure where the hiera var woudl go bblack? [20:40:31] common/analyticls/cdh? [20:40:34] I thought we did that as a rule, and e.g. betalabs picks it up there? [20:40:43] there's no hadoop betalabs cluster [20:40:44] soooo [20:40:47] no? [20:40:50] could be though! [20:40:57] but, ja, as a manualhiera() lookup [20:40:58] hm [20:41:02] class role::analytics::impala { [20:41:03] class { cdh::impala: [20:41:03] master_host => hiera('role::analytics::impala::master_host') [20:41:03] } [20:41:03] } [20:41:04] ? [20:41:07] or some other thing? [20:41:11] well, we should have a standard about that pattern regardless. There could be. If nothing else, labs data in puppet serves as a template for what to stuff in wikitech? [20:41:27] how is it our conversational line ratios are always like 10:1 ? :P [20:41:44] ha, hm? [20:42:49] ottomata: yes that role I think is what I was saying. [20:43:11] but, in this case, we wouldnt' be using the role keyword, right? [20:43:18] and ditto for the other two roles, which means the other two cdh::impala::{master,worker} need classparams as well, which they could get by inheriting instead of including? [20:43:19] to incldue the class, we'd just do normal include? [20:43:26] WHAAAA [20:43:27] oh [20:43:30] that is already done bblack [20:43:43] https://github.com/wikimedia/puppet-cdh/blob/master/manifests/impala/worker.pp [20:43:43] (03PS3) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [20:43:44] well I haven't had time to catch up on your torrent of typing and committing [20:43:47] haha [20:43:49] I still had shit to say back when esams blew up [20:44:03] haha, i know, that's when i torrenttypedcommitted :p [20:44:42] so um, like this? [20:44:43] https://gist.github.com/ottomata/80c4877c901ee67314ac [20:44:57] unsure abou tthe hiera lookup variable name [20:45:03] that would make me put a file in [20:45:14] hieradata/eqiad/analytics/impala.yaml [20:45:16] with master_host there [20:45:18] (03CR) 10Dzahn: "@Tim Landscheidt how to test it on tools-beta now?" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [20:45:24] hang on [20:45:27] k [20:45:34] I'm still trying to find the stupid submodule on github again :P [20:45:44] bblack [20:45:44] https://github.com/wikimedia/puppet-cdh/blob/master/manifests/impala/worker.pp [20:46:28] why this on L10 there ^ [20:46:36] $master_host = $::cdh::impala::master_host [20:47:05] oh, that's just to make the templating easier [20:47:08] could be necessary due to puppet wierdness + references to master-host in worker-specific templates/ [20:47:21] but I would think after inherits, the worker class itself would already have the param as well... [20:47:25] so I don't have to do...what is it? scope.lookup('...') [20:47:35] oh yes [20:47:36] have you tried just using it like it was declared locally in the worker class? [20:47:36] sorry [20:47:39] you are right [20:47:43] (03PS1) 10BryanDavis: logstash: Remove redis input [puppet] - 10https://gerrit.wikimedia.org/r/205968 [20:47:45] (03PS1) 10BryanDavis: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 [20:47:46] i didn't change it [20:47:47] (03PS1) 10BryanDavis: logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) [20:47:47] wow [20:47:49] (03PS1) 10BryanDavis: logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) [20:47:54] i'm suprised that even ran [20:48:02] that is reassinging a puppet var [20:48:12] papaul: Ok, so I think I'm missing the network config for the row c db systems [20:48:13] checking now [20:48:15] you are totally right about that bblack [20:48:43] (03CR) 10BryanDavis: [C: 04-1] "This needs to wait until we have attached logstash100[4-6] to the Elasticsearch cluster and moved all of the shards over." [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [20:49:07] papaul: yea, i dont have db2043+ setup in network, lemme fix that [20:49:15] and then in your gist paste of the roles, I think you want e.g. class { cdh::impala::worker: master_host => hiera('analytics::impala::master_host') }, in the bottom two, so they're like the top one? and then drop the inclusion of the more-basic role? [20:49:48] (03CR) 10Dzahn: "sigh, every time a maintenance cron jobs is changed this thing requires a manual rebase of the ugliest kind, since i am trying to delete t" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [20:49:54] nooo, i don't; think puppet will let you do that, will it? inheritance with parameters is weird [20:49:57] otherwise the way the gist is now, I don't think the $master_host value would show up correctly in worker/master-specific template usage [20:50:03] you can inherit from a paremterized class and get those as local vars [20:50:26] ottomata: we're not inheriting [20:50:28] but, you can't set parameters when you include a class, unless that class itself hast those parameters [20:50:31] pretty suuure [20:50:36] class { cdh::impala::worker: master_host => hiera('analytics::impala::master_host') } [20:50:37] you're doing it, at the top of your gist [20:50:42] cdh::impala::worker [20:50:45] inherits from cdh::impala [20:50:53] yes [20:51:02] unless I explicitly make cdh::impala::worker have a master_host parameter [20:51:06] i'm pretty sure I can't set that paramter on it [20:51:21] otherwise, what would it mean if I tried to do [20:51:23] cdh::impala::worker should already have a $master_host parameter, which it got by inheritance [20:51:26] right? [20:51:32] it has the value of it inherited from the class [20:51:40] but it doesn't have a settable parameter itself [20:51:52] cause if i could do that [20:51:53] then i could do [20:52:00] if that's true, that's nearly the dumbest puppetism I've ever heard of [20:52:20] class { 'cdh::impala': master_host => 'foo' } [20:52:20] class { 'cdh::impala::worker': master_host => 'bar' } [20:52:22] and what would that mean? [20:52:35] what would it mean anyways, to instantiate a class and its parent together? [20:53:00] i'm actually doing that in a nasty trick in that ::config class bblack [20:53:02] check the comment there [20:53:02] either way it makes no sense [20:53:05] it is nasty [20:53:18] https://github.com/wikimedia/puppet-cdh/blob/master/manifests/impala/config.pp [20:53:36] Ok, so we're going to have two conversations in here at once (just fair warning). Papaul and I will be installing the db servers in codfw and I'm walking him through it here [20:53:43] (cuz folks may not have done an install, so its useful ;) [20:54:00] well this cdh conversation is over anyways :) [20:54:05] hhah [20:54:32] because I'm done wasting time on this puppetmess and I need to look at other things. puppet sucks, I guess bend what could've been an elegant solution to "whatever works in puppet" [20:54:41] haha [20:54:43] papaul: So, the two tasks for these installs are: https://phabricator.wikimedia.org/T89365 & https://phabricator.wikimedia.org/T96383 [20:54:45] bblack, ok , thanks for your help [20:54:48] i do have something better than before [20:54:55] Since these are all identical, just rows, im closing the first older one [20:54:56] i'm going to try to change this to use hiera lookup directly [20:55:00] and updating the newer one to include both [20:55:09] FWIW, I run through this same kind of bullshit with all of my recent changes. Usually testing in puppet-compiler before merging lets me sort out the mess... [20:55:26] but I think submodules break compiler, don't they? [20:55:49] 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1229305 (10RobH) [20:55:56] (03CR) 10Dzahn: "still tries to use gerrit to ask for reviews instead of pinging on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [20:56:13] 6operations: deploy db2043-2066 - https://phabricator.wikimedia.org/T89365#1229310 (10RobH) 5Open>3Invalid this is now tracked on task T96383 [20:56:24] (03PS1) 10Ottomata: Now that ::worker inherits, I don't need to set a local variable for $master_host [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205972 [20:56:29] 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1215693 (10RobH) [20:56:35] (03CR) 10Ottomata: [C: 032] Now that ::worker inherits, I don't need to set a local variable for $master_host [puppet/cdh] - 10https://gerrit.wikimedia.org/r/205972 (owner: 10Ottomata) [20:56:54] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1215693 (10RobH) [20:57:08] papaul: So, lets start with db2043 [20:57:17] ok [20:57:33] https://wikitech.wikimedia.org/wiki/HP_DL3N0#Reboot_and_boot_from_network_then_console [20:57:48] You'll want to follow the directions there to have it start to pxe boot [20:58:02] i'll login to the dhcp server and ensure it hits it [20:58:05] (03PS1) 10Ottomata: Use direct hiera lookup of impala master_host variable [puppet] - 10https://gerrit.wikimedia.org/r/205973 [20:58:08] (eventually you'll have shell to do this) [20:58:20] ok [20:58:43] So, there are a few network boot protocols out there to use [20:58:43] har har bblack, i test all this stuff in labs and in vagrant and had somethign working, but just nothing ideal [20:58:51] still not ideal, but better [20:58:57] but i've only ever really worked in depth with pxe (so dont ask me about the others ;) [20:59:15] when you added the mac addresses, it told it what the hsotnames are [20:59:22] the fqdn hostname [20:59:28] so it got the ip and network info from that [20:59:54] our install server module includes all the dhcpd config, plus our customized pxe install images (which are served via tftp) [21:00:05] so, we have that split up a bit for ease of administration [21:00:11] ALL of our dhcp requests go to eqiad. [21:00:29] I exclude labs/frack from that statement. [21:01:39] papaul: and we have some issue... [21:01:45] i can see 40:a8:f0:35:08:00 hit dhcpd/carbon [21:01:53] but gets no free leases, so something isnt right, typically its dns [21:02:14] (03PS2) 10Ottomata: Use direct hiera lookup of impala master_host variable [puppet] - 10https://gerrit.wikimedia.org/r/205973 [21:02:56] the error on the dhcpd server is: Apr 22 21:01:07 carbon dhcpd: DHCPDISCOVER from 40:a8:f0:35:08:00 via 10.192.32.2: network 10.192.32.0/22: no free leases [21:03:09] (03CR) 10Ottomata: [C: 032 V: 032] Use direct hiera lookup of impala master_host variable [puppet] - 10https://gerrit.wikimedia.org/r/205973 (owner: 10Ottomata) [21:03:14] so we can see the server hit, and it hits on the right network 10.192.32.2 [21:03:17] which is row c private [21:03:28] so i wonder if we didnt set dns... [21:03:50] (recall these were in row c and not on the ticket at the start, so we may have overlooked them ;) [21:04:26] ... bah. [21:04:29] does anyone know why MobileFrontend/Gather does not seem be getting updated on beta labs? [21:05:01] (03PS18) 10BBlack: Set up /api/rest_v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [21:05:22] !log Starting deployment-prep rolling reboots [21:05:27] Logged the message, Master [21:05:55] (03CR) 10BBlack: "PS18 is rebase + removal of the erroneous modules/cdh update that slipped into PS17 (stupid submodules...)" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [21:08:15] hahah [21:08:23] why does everyone have such troubles! [21:08:59] stopped using "-a" and adds every single file manually because of the suckmodules [21:10:00] ottomata: because master updates to submodule SHAs become uncomitted changes in the local checkout when you pull from origin, and then they get swept up in "git add ." unless you "git submodule update" first [21:10:44] it's a stupid pattern that it works like that at all. git pull on a clean previous checkout with nothing local, should leave you in a state with a clean local checkout and nothing local :P [21:10:53] submodules break that [21:11:53] (03CR) 10Dzahn: [C: 032] "already cherry-picked and working on https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-build and labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) (owner: 10JanZerebecki) [21:13:16] but anyways, it shouldn't be an issue, because submodules like cdh are basically stable upstream sources that rarely ever change as WMF's needs evolve, right? :) [21:14:14] papaul: we really need to get the wifi working for you there =] [21:14:25] yep [21:14:55] So, when I combined the tickets, i never actually had a task for you to set the dns for the row c databases [21:15:06] no [21:15:14] but i did [21:15:30] so we need to setup dns for db2043-2066 [21:15:31] you mean mgmt dns? [21:15:32] oh? [21:15:37] nope, production =] [21:15:44] the mgmt is indeed set and good [21:15:46] no [21:15:47] not production [21:15:55] so feel up to setting the production for those and flagging me for review? [21:16:06] we want them in the private1-c-codfw subnet [21:16:19] so the 10.blahblahblah and wmnet [21:16:32] ok [21:16:35] 10.in-addr.arpa [21:17:14] (03CR) 10Dzahn: "bump" [puppet] - 10https://gerrit.wikimedia.org/r/177427 (https://phabricator.wikimedia.org/T71604) (owner: 10Yuvipanda) [21:18:04] (03CR) 10Dzahn: "here's an example change automatically created by this: https://gerrit.wikimedia.org/r/#/c/205566/" [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) (owner: 10JanZerebecki) [21:19:14] http://en.wikipedia.org/wiki/DARPA is asking someone to fire warning missiles across your desk, to express their displeasure at your blahblahblah-ing of their name :P [21:19:28] bblack, don't you run git status before you commit? [21:19:33] that's usually when I catch it [21:19:35] also [21:19:43] before you add, you mean [21:19:46] i use liquidprompt, which shows me if things are dirty [21:19:47] sure [21:19:50] ya [21:19:56] before you add or git commit -a [21:20:07] and no, I don't tend to assume that my pulls will dirty my checkout. why would they, after all? [21:20:11] except for submodules :P [21:20:15] :) [21:20:45] it's less typing. I don't want to type more just because submodules suck. I know what's in my checkout, except when submodules screw it up. [21:21:11] Krinkle: is graphite no longer providing the data (where) this expects? http://codepen.io/Krinkle/full/zyodJ/ [21:22:54] the typical flow on a fast commit that screws it up goes something like this: "git status" (shows clean + up to date), "git pull" (re-updates me from master, injects horrible local change to modules/cdh), "vi blah/blah/blah.txt; git add .; git commit -m '....'". Then I do "git pull -r" again to see if I can rebase onto someone else's commit while I was editing. Then I push up and see that it [21:23:00] updates modules/cdh. [21:23:35] Nemo_bis: Which graph? [21:23:42] mehhhh, just get in the habit of running git submodule update [21:23:57] but there's similar and much harder-to-work-around issues with any kind of "git-review -d NNNNNN" checkout, too [21:24:03] Krinkle: all I could see stop at December [21:24:21] or! make a git hook that git submodule updates after git pull [21:24:39] or just don't use submodules, they're broken-by-design :P [21:25:25] I don't even know where you'd place a git hook to solve it for the git-review + rebase + amend case, etc [21:25:56] I could put "git submodule update >/dev/null 2>&1" as a command that's auto-run by bash before every prompt :P [21:27:35] "git submodule update" issues aside, the other problem is reviewing related work. You're clearly doing two interrelated things in your role::analytics + modules::cdh commits today that are intertwined, but nobody can see them easily in one patch, as in: https://gerrit.wikimedia.org/r/#/c/205973/ [21:27:59] https://gerrit.wikimedia.org/r/#/c/205973/2/modules/cdh,unified -> go find that in modules/cdh separately [21:29:03] bblack: I have been playing with git subtree with the goal of using that for the deployed branches [21:29:10] and people looking at the puppet repo (emails, or git log, or especially "git log -S" and such) can't see what happened in real code/template terms in your modules/cdh update either to debug [21:29:54] Nemo_bis: The statsd tx upgrade changed the graphite property path [21:29:59] and "git grep master_host" on ops/puppet only shows half the picture, because git grep doesn't descend submodules either [21:30:03] I could keep going.... [21:30:07] it's just awful [21:30:36] and then people ask for reviews of a submodule change and you dont see what it actually changes, just a hash [21:30:57] (03PS1) 10Andrew Bogott: Set live_migration_bandwidth to 300Mbps. [puppet] - 10https://gerrit.wikimedia.org/r/205978 [21:30:59] (03PS1) 10Andrew Bogott: Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 [21:31:18] hha [21:31:25] hmm the eventlogging alert i just got from icinga isn't output in here or in #wikimedia-analytics [21:31:29] !log reboot round of deployment-prep done [21:31:33] that seems suboptimal [21:31:33] Logged the message, Master [21:32:00] wikibugs: are you alive? [21:32:14] bblack, your complaints are valid, but i feel it is less awful than coupling repos [21:32:24] but, i am a loner in that viewpoint, i know :) [21:32:51] it's no more "coupling repos" than the other 10,000 things coupled in ops/puppet today [21:33:03] together, they form the functional config of our environment, they belong together. [21:34:06] modules/stdlib is probably the most stable/upstream sort of thing we have in ops/puppet that's not really "ours", and even that isn't a submodule. [21:34:13] yet: [21:34:16] bblack-mba:puppet bblack$ grep submodule .git/config [21:34:16] [submodule "modules/cassandra"] [21:34:17] [submodule "modules/cdh"] [21:34:17] [submodule "modules/jmxtrans"] [21:34:17] [submodule "modules/kafka"] [21:34:19] [submodule "modules/kafkatee"] [21:34:22] [submodule "modules/mariadb"] [21:34:24] [submodule "modules/nginx"] [21:34:26] [submodule "modules/varnishkafka"] [21:34:29] [submodule "modules/wikimetrics"] [21:34:31] [submodule "modules/zookeeper"] [21:35:01] I don't have any insight into any of those easily when grepping code and/or history to find related things for debugging, refactoring, etc [21:35:42] (03CR) 10Andrew Bogott: "Testing shows that this is, at worst, harmless." [puppet] - 10https://gerrit.wikimedia.org/r/205978 (owner: 10Andrew Bogott) [21:35:51] (03CR) 10Andrew Bogott: [C: 032] Set live_migration_bandwidth to 300Mbps. [puppet] - 10https://gerrit.wikimedia.org/r/205978 (owner: 10Andrew Bogott) [21:36:15] jgage: what's the name of it [21:36:23] ** PROBLEM alert - graphite1001/Difference between raw and validated EventLogging overall message rates is CRITICAL ** [21:36:42] it's cuz i'm in contactgroup analytics [21:38:30] is wikibugs buggy or actively configured not to say it here [21:38:57] tests [21:39:17] (03PS2) 10RobH: setting new codfw databases into partitioning [puppet] - 10https://gerrit.wikimedia.org/r/205920 (https://phabricator.wikimedia.org/T96383) [21:39:28] arghhhh, i clicked wrong button, nnoooooo [21:39:30] (03PS5) 10Yuvipanda: Manually configure exim4 class for for tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [21:39:31] bleh, wait on gerrit. [21:41:19] CUSTOM - Host mw2212 is UPING OK - Packet loss = 0%, RTA = 42.94 ms [21:41:34] gah i saw that one the other day. UPING? wtf? [21:41:35] jgage: ^ eh, icinga-wm of course, but it still does stuff [21:41:58] missing space [21:42:10] it wants to say "is UP - PING OK" [21:42:15] more than that, "is UP ING OK" doesn't make sense either [21:42:39] so it's missing " - P"? weird. [21:42:45] (03CR) 10RobH: [C: 032] setting new codfw databases into partitioning [puppet] - 10https://gerrit.wikimedia.org/r/205920 (https://phabricator.wikimedia.org/T96383) (owner: 10RobH) [21:44:27] jgage: and the actual question. what does "Difference between raw and validated EventLogging overall message rates" make a person do to fix it:) [21:44:42] yeah. afaict the answer is "nothing" [21:45:04] if there is no action, then probably it should not page [21:45:04] luckily it doesn't trigger as much as it used to, a few weeks ago i was getting 12-20/day [21:45:05] My usual thing is - notify nuria and milimetric :) [21:45:13] because that basically means lots of events failing validation [21:45:16] might be client side issue, or not [21:45:39] (03PS1) 10Papaul: added db2043-db2070 production dns in private1-c and d [dns] - 10https://gerrit.wikimedia.org/r/205982 [21:45:59] fwiw i think today is nuria's last day before maternity leave [21:47:01] robh: you can check for review [21:47:26] ah [21:47:36] papaul: checking [21:47:41] ok [21:49:35] (03CR) 10RobH: [C: 032] added db2043-db2070 production dns in private1-c and d [dns] - 10https://gerrit.wikimedia.org/r/205982 (owner: 10Papaul) [21:49:45] papaul: looks good, let me merge and we can try pxe again [21:49:50] ok [21:50:20] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1229493 (10RobH) [21:50:27] maybe we should shortcut the notification thing [21:50:36] by notifying the people that we would notify when we get notified [21:52:58] papaul: ok, so lets try pxe bootig the db2043 again [21:53:03] i merged the dns changes live [21:53:09] ok [21:53:14] rebooting [21:55:03] there are so many things hitting dhcp constantly [21:55:12] its been over a year since i went hunting and killing systems and devices doing that [21:55:14] it seemsits overdue [21:55:19] trying to see this system hit dhcp is difficult [21:55:28] papaul: let me know if it starts to work and load installer [21:55:51] oh, i saw it fail.. hrmm [21:55:57] maybe bad dns cache, checkign [21:56:18] ok [21:56:53] papaul: yep, was bad dns cache [21:56:58] so when it hit the first time, it created a miss [21:57:01] robh@chromium:~$ sudo rec_control wipe-cache db2043.codfw.wmnet [21:57:01] wiped 0 records, 2 negative records [21:57:04] and a negative record [21:57:09] which i just wiped out, reboot once more =] [21:57:17] ok [21:58:50] papaul: i think a bunch of htese are rebooting [21:58:54] i just saw db2055 hit [21:59:10] you may want to power them off [21:59:13] since we arent to them yet [21:59:20] ok [21:59:35] so they'll all have created negative cache hits in dns [21:59:38] i'll wipe them out now [22:00:08] ok i am powering them off [22:01:26] ok, i wiped the negative cache out [22:01:32] so when you get to them, they should work [22:01:44] i think db2043 is good now [22:01:56] thats also why the dhcp server had a ton of stuff hitting it [22:02:05] all of them were hititng it at same time, made it hard to read it all, heh [22:02:21] so, basically we want to see the isntall complete, but once you see it get past partitioning, you are likely fine [22:02:31] most issues crop up in the network setup or partitioning of the install [22:03:08] so I'd advise we ensure one gets started through the install without issue [22:03:17] once the first is fully done, you can just queue up the rest to run [22:03:22] make sure you set the pxe boot to back off though [22:04:02] papaul: basically you need to run set /system1/bootconfig1/bootsource5 bootorder=5 [22:04:07] BEFORE it finishes its next reboot [22:04:10] or it reboots and pxe installs again [22:04:19] so, once you see the partitioning stuff go past, and its installing the os [22:04:24] back out to the mgmt cli and run that [22:04:28] (this is HP specific) [22:04:42] ok [22:06:01] so vsp is like console com2? [22:07:01] ok I need to scap again, is now ok? [22:07:19] I fixed a problem in CiteThisPage extension that was breaking things [22:07:32] ( https://gerrit.wikimedia.org/r/#/c/205988/ ) [22:08:21] not aware of any ongoing (app)server work. seems ok? [22:09:14] jouncebot: next [22:09:14] In 0 hour(s) and 50 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T2300) [22:09:24] twentyafterfour: i suppose as long as it's done before that ^ [22:09:29] papaul: yep [22:09:50] well deployment train takes priority doesn't it? [22:09:59] it should run fast now [22:10:18] most of the localization stuff is already done I think [22:10:24] !log twentyafterfour Started scap: testwiki to php-1.26wmf3 and rebuild l10n cache [22:12:45] 6operations, 10Analytics-EventLogging: Add icinga-wm bot to #wikimedia-analytics - https://phabricator.wikimedia.org/T96928#1229552 (10Gage) 3NEW a:3Dzahn [22:13:59] huh weird wikibugs' output is colorized in here but not in #wikimedia-analytics [22:14:02] papaul: so when successfully installed and rebooted, it should be sitting on a login prompt for the system [22:14:10] once we are there, we can do the rest [22:14:45] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [22:14:58] jgage: there is a channel mode that filters color [22:15:19] https://freenode.net/using_the_network.shtml -> "color filter" [22:15:35] yeah they use +c :( [22:15:55] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 43.86 ms [22:16:25] colorize analytics channel .. :)? [22:16:32] robh: ok [22:17:37] jgage: Krenair: Zark stole them. (ref. https://en.wikipedia.org/wiki/Wizball#Gameplay) [22:19:06] robh: on my laptop when i type vsp it just stays at starting virtual serial port . pess ESC ( to to return to cli [22:19:23] it worked before right? you watched it boot into the isntaller? [22:19:35] on the server i have the monitor plug to it [22:19:38] oh [22:19:45] you have to test the vsp [22:19:49] as part of the onsite setup and testing [22:19:57] just like on the dells testing the console com2 [22:20:12] lemme connect and try it [22:20:52] (03PS1) 10Dzahn: ganglia: yaml file for zirconium, use ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/205997 [22:20:54] papaul: so, i thought you were doing the install via vsp [22:21:01] like the instructions said to [22:21:09] so, when we rack the hp's in the future [22:21:11] and you setup mgmt [22:21:16] you must test vsp beofre you say its done [22:21:20] its part of the onstie testing, ok? [22:21:34] onsite even [22:21:37] i just test power reset [22:21:46] did know about vsp [22:21:47] ok, well, on all systems, dell and hp [22:21:48] onto today [22:21:54] you need to test the mgmt serial console [22:21:58] ok [22:21:59] so console com2 on dell, and vsp on hp [22:22:05] ok [22:22:07] right now it says you are on vsp on db2043 [22:22:10] so i cannot connect to it [22:22:15] you need to detact from vsp [22:22:18] (not logout of mgmt) [22:22:24] esc + ( [22:22:34] i am out [22:22:53] ok, so we have no serial output [22:23:04] lets check out the bios settitngs, lemme build my tunnel to there... [22:25:03] (03CR) 10Dzahn: [C: 032] ganglia: yaml file for zirconium, use ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/205997 (owner: 10Dzahn) [22:27:06] papaul: we did the folowing right on these: [22:27:08] BIOS Serial Console & EMS [22:27:08] change to COM2 [22:27:09] change baud rate to 115200 [22:27:10] EMS Console change to COM2; [22:27:19] cuz if that isnt set right, we'd see no output on serial like this [22:27:24] (it may not be this though, just checking) [22:27:46] that is the setting i have in [22:27:50] but lets check [22:27:53] ok, let me reboot it and stay on console [22:27:55] lets see what it does [22:28:14] it should reboot now [22:28:22] im just checking to see if i get serial output during post [22:28:28] once the OS loads, its the OS settings [22:28:33] when its post, its bios serial redirection settings [22:28:39] so this will tell us which is right... [22:29:59] YuviPanda: we do get those alarms, The "failing validation" [22:30:08] aaah cool :) [22:30:19] YuviPanda: the ones our team doesn't get is sometimes "disk full" or "hardware failure" [22:30:19] papaul: is it rebooting there? im getting nothing on console [22:30:26] yes [22:30:33] something isnt right [22:30:34] YuviPanda: the ones that deal with infrastructure [22:30:39] go ahead and drop into bios and check settings [22:30:43] i'm disconnecting off vsp [22:30:44] nuria: oh, I see. do you think you should? [22:30:46] ok [22:30:50] so you can hop on it if needed [22:30:56] confirm the settings and see whats up [22:31:02] YuviPanda: i know otto tried to fix that and it was not so easy [22:31:14] cmjohnson1 has set these up before, so we may need to ping him for some help (though its late there) [22:31:18] later... [22:32:04] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:26] hrmmmm [22:32:31] nuria: ah, hmm. I’ve not dived into it at all. If you are ok with getting alerts for those things I can take a look in a few days [22:32:39] nuria: I want a similar things for labs too [22:32:46] (anything from certain hosts should come to us) [22:34:13] who;s on terbium? [22:34:52] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1229624 (10Yurik) I will create a doc page with a diagram in the next few days. Please ping for any other info. [22:35:05] ok, maintenance scripts very busy [22:35:14] robh:the settings are BIOS Serial Console & EMS [22:35:14] change to COM2 [22:35:15] change baud rate to 115200 [22:35:15] EMS Console change to COM2; [22:35:16] RECOVERY - RAID on terbium is OK optimal, 1 logical, 2 physical [22:35:17] Is there a known issue with Phabrictor atm? [22:35:40] papaul: so its set correctly then? [22:35:48] jamesofur: i dont think so, works for me [22:35:48] err nvm... I was able to get around it.. weird bug [22:35:53] as other working HP systems have been set? [22:35:59] jamesofur: did you just do something on terbium? [22:36:01] recall i have not worked on the hp as an onsite ;D [22:36:22] mutante: no?.... I did something there a little while ago. Why? [22:36:32] load spiked [22:36:37] its under load and throwing icinga errors [22:36:43] ahh, no, I just did an email look up on a slave db [22:36:47] Could be wikidata stuff, I'll check [22:36:47] so that couldn't have been it [22:36:47] so whatever is running is crippling the box [22:36:52] jamesofur: it was very busy for a moment and some php maintenance scripts and then it stopped again [22:37:05] weird.... [22:37:06] or was, my tense was incorrect [22:37:11] i guess one of the regular crons [22:37:17] after all terbium has them all [22:37:20] no I haven't done anything remotely touching scripts today [22:37:21] mutante: Can you tell which ones? [22:37:30] papaul: so those are the settings you've set for the other HP ssytems we have gotten, right? [22:38:00] hoo: getJobQueueLengths.php --totalonly .. maybe [22:38:08] I know logging into the enWiki db was signficantly slower today then it has been, not sure if related [22:38:09] robh: yes [22:38:17] hoo: or CirrusSearch/maintenance/saneitize.php --wiki frwiki . shrug [22:38:21] ok so scap seems hung with 1 remaining server to sync [22:38:34] sync-common: 99% (ok: 464; fail: 0; left: 1) [22:38:39] 13066 www-data 10 -10 30.9g 29g 1492 R 15 93.5 6394:54 php /srv/mediawiki-staging/multiversion/MWScript.php refreshLinks.php - [22:38:47] mutante: ^ that one is eating *all* the ram [22:38:57] papaul: ok, let me try out db2044 [22:39:03] perhaps we have an odd systeme issue [22:39:10] ok [22:39:11] twentyafterfour: does it give you the server name? [22:39:13] so im connecting to its mgmt and serial console [22:39:22] and powering it up to see it post via serial console [22:39:23] Who on earth started that for something as large as ptwiki [22:39:26] ^d: ^ [22:39:27] that reminds me, mariadb s1 lag has been in CRITICAL state for 1d10h. i didn't see mail about it. [22:39:31] Sounds like your thing [22:39:32] mutante: no at least not until it eventually times out [22:39:53] hoo: oh?! i dont know [22:40:05] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:40:11] papaul: it said db2044 was powered on, i thoguht you turned the rest of the db44+ off? [22:40:17] db2044 that is [22:40:30] it should now be posting, i just thought it was already off. [22:40:36] and i'm not getting serial output [22:40:37] /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -lmwdeploy terbium.eqiad.wmnet /srv/deployment/scap/scap/bin/sync-common --no-update-l10n mw1010.eqiad.wmnet mw1033.eqiad.wmnet mw1070.eqiad.wmnet mw1097.eqiad.wmnet mw1216.eqiad.wmnet mw1161.eqiad.wmnet mw1201.eqiad.wmnet mw2001.codfw.wmnet mw2041.codfw.wmnet mw2080.codfw.wmnet mw2119.codfw.wmnet mw2187.codfw.wmnet [22:41:28] i just turn it on to compare the settings with db2043 [22:41:34] mutante: Indeed ^d's script... walked up the ppids [22:42:41] papaul: so something isnt right on db2044 either [22:42:41] (03PS1) 10Ori.livneh: Try to capture and send a backtrace to fluorine on fatal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206000 [22:42:47] hoo: thanks! ok [22:42:49] neither one show serial console output [22:42:50] YuviPanda: I think we are good cause for application alarm (validation, throughput) we pay attention and for hosts ottomata pays attention to EL hosts /cluster and such [22:42:54] when previous hp systems you setup have [22:42:57] so im nto sure whats not right [22:43:04] and we cannot really take offline the other hp systems to check [22:43:17] can you try to figure out the settings, set up vsp on a console [22:43:20] (03CR) 10Ori.livneh: [C: 032] Try to capture and send a backtrace to fluorine on fatal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206000 (owner: 10Ori.livneh) [22:43:25] (03Merged) 10jenkins-bot: Try to capture and send a backtrace to fluorine on fatal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206000 (owner: 10Ori.livneh) [22:43:27] and then mess with the redirection settings and see if you cannot get something to work [22:43:36] document what you do for easy review later [22:43:51] mw1161.eqiad.wmnet [22:43:58] papaul: also, its getting late there, it doesnt need to be tonight [22:44:00] !log Killed demon's "sudo -u www-data php /srv/mediawiki-staging/multiversion/MWScript.php refreshLinks.php --wiki=ptwiki" on terbium, sending the box into swap [22:44:03] you can compare notes with cmjohnson1 tomorrow [22:44:05] Logged the message, Master [22:44:11] robh:ok [22:44:13] twentyafterfour: Still hung? I think you can safely kill a sync to terbium and just run sync-common there manually later [22:44:18] He said it's fine to kill (see SAL), so I went ahead [22:44:21] since he has also set them up in person, he is more familar with them than I am [22:44:39] bd808: so kill what exactly? the rsync process on terbium? [22:44:44] twentyafterfour: let me check that one now [22:45:01] twentyafterfour: the ssh call from tin to terbium [22:45:08] bd808: ok [22:45:20] robh: i do that tomorrow [22:45:26] wait it's working [22:45:36] must have been just extra slow [22:45:41] mw1161 is a scap proxy it seems [22:45:53] and up [22:45:55] twentyafterfour: hoo killed a nasty proc there [22:46:14] so terbium was just running super slow? [22:46:15] oh both things are connected, nice [22:46:24] the 2 bugs just merged into one [22:46:37] deployment issue because of the terbium job [22:46:39] twentyafterfour: probably. Its the host we run cron jobs and other nasty things from [22:46:47] I see [22:46:50] well, ^d .. [22:46:56] ran something [22:47:36] !log twentyafterfour Finished scap: testwiki to php-1.26wmf3 and rebuild l10n cache (duration: 37m 11s) [22:47:39] Logged the message, Master [22:48:48] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1229645 (10RobH) these are failing to show serial console output, so some setting is wrong on them (db2043 onwards). @Papaul will keep troubleshooting this tomorrow with @cmjohnson [22:50:57] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205907 (owner: 1020after4) [22:51:03] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205907 (owner: 1020after4) [22:52:10] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf2 [22:52:15] Logged the message, Master [22:52:21] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205908 (owner: 1020after4) [22:53:22] Fatal error: Call to undefined method SkinMinerva::getLicenseLink() in /srv/mediawiki/php-1.26wmf2/extensions/ZeroBanner/includes/ZeroSpecialPage.php on line 232 [22:53:29] getting a lot of those [22:54:09] (03Merged) 10jenkins-bot: Group0 to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205908 (owner: 1020after4) [22:54:09] yurik: ^ [22:54:28] maybe it's just intermittent though [22:54:43] no, still see them [22:54:57] * yurik looking [22:55:00] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf3 [22:55:03] Logged the message, Master [22:58:00] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1229686 (10Dzahn) carbon is now an aggregator for misc eqiad. on uranium in /etc/ganglia/gmetad.conf there is: ``` data_source "Miscellaneous eqiad" carbon.wikimedia.org m... [22:59:32] (03PS1) 10Ori.livneh: Revert "Try to capture and send a backtrace to fluorine on fatal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206003 [22:59:43] (03CR) 10Ori.livneh: [C: 032] Revert "Try to capture and send a backtrace to fluorine on fatal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206003 (owner: 10Ori.livneh) [23:00:05] RoanKattouw, ^d, Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150422T2300). [23:00:18] (03Merged) 10jenkins-bot: Revert "Try to capture and send a backtrace to fluorine on fatal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206003 (owner: 10Ori.livneh) [23:00:23] aude, mobile frontend broke up :(( db585ed5e57fa5531fb6098debff5513ab230564 [23:00:29] *us [23:00:40] yurik: :( [23:00:42] still going twentyafterfour? [23:00:44] are there issues? [23:01:01] Krenair: it's done [23:01:19] It sounds like aude or yurik are running into issues? [23:01:35] Krenair, mobile frontend remove a func we were calling, and didn't tell anyone :) [23:01:37] * aude is just watching the logs [23:01:46] looks like an easy fix [23:01:57] :D [23:02:10] Krenair: are you talking about the one I just submitted a task for? [23:02:16] https://phabricator.wikimedia.org/T96932 [23:02:20] * aude would also like unit tests for it, like in a follow up :) [23:02:26] to avoid it happening again [23:02:43] or something... [23:03:02] Krenair, can I change my mind again about the Flow patches? [23:03:25] superm401, ... you've changed your mind once? :) [23:03:35] Yes, I want to do it again though. [23:03:43] What Flow patches? [23:04:12] https://gerrit.wikimedia.org/r/#/c/204735/ and https://gerrit.wikimedia.org/r/#/c/205246/ [23:04:21] or you added then removed the patches, I checked the history now [23:04:30] you can add them back, sure [23:04:46] (03CR) 10Alex Monk: [C: 032] Allow wikitech to use local username blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205889 (owner: 10Alex Monk) [23:05:59] almost done [23:06:09] yurik, done... making a patch? [23:06:17] Krenair, alright, doing the bump now. [23:08:50] (03PS19) 10GWicke: Set up /api/rest_v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) [23:09:59] (03PS1) 10Dzahn: tiny formatting fix in apache_status.pyconf [puppet] - 10https://gerrit.wikimedia.org/r/206006 [23:11:30] (03CR) 10GWicke: [C: 031] "Fixed the URL rewrite to also replace /rest_v1/ with /v1/ and tested in labs." [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [23:12:39] (03Merged) 10jenkins-bot: Allow wikitech to use local username blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205889 (owner: 10Alex Monk) [23:13:44] 6operations, 6Labs, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1229705 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Alright, so shinkengen handling duplicate hostnames is a thing it has to do during the DNS migration :) [23:14:47] Krenair, bump is https://gerrit.wikimedia.org/r/#/c/206008/ . Being gated now. [23:16:20] !log krenair Synchronized php-1.26wmf2/extensions/OpenStackManager/nova/OpenStackNovaUser.php: https://gerrit.wikimedia.org/r/#/c/205887/ (duration: 00m 12s) [23:16:25] Logged the message, Master [23:17:34] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/205889/ (duration: 00m 12s) [23:17:38] Logged the message, Master [23:18:37] superm401, OK [23:19:04] superm401, my preferred way of writing these submodule update commit messages is to provide the change ID of the commit I'm backporting [23:19:12] anyone wants to do a broken prod deploy? [23:19:27] Krenair, want to push it? [23:19:30] Krenair, okay, will try to remember to do that in the future. [23:19:38] thanks :) [23:20:22] (03PS1) 10Kaldari: Turning on WikiGrok on English Wikipedia (for 2 week test) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206009 [23:20:25] RobH: fixed the problem on db2043 and db2044 the installation is in progress [23:20:29] yurik, ok [23:20:37] papaul: what was it? [23:20:43] yurik, we just need https://gerrit.wikimedia.org/r/#/c/206005/ for wmf3 right? [23:20:48] i guess so [23:21:11] Robh: vitural serial was set to com3 [23:21:17] Krenair, unless you want to backport mobile as well ;) [23:21:17] it suppose to be com2 [23:21:32] ok, i bet the rest are that way, but you'll find out [23:21:37] you'll want to do the installs via vsp [23:21:42] to ensure its working right =] [23:21:56] so, db2043 likely didnt need reinstall [23:21:57] yurik, ... what would we backport on mobile? [23:22:05] if it is, thats fine, but it may confuse signing later, but no big deal [23:22:16] robh: under System Options [23:22:16] Serial Port Options: Set to COM1 [23:22:17] never mind, only the new branch is broken [23:22:22] Krenair, ^ [23:22:23] there is virtual serial [23:22:45] * yurik did a poor attempt at a joke [23:23:04] Krenair: I have a config change to go out during the SWAT window. Just added it to the calendar. I’m happy to do it myself after you’re done. [23:23:16] https://gerrit.wikimedia.org/r/#/c/206009/ [23:23:16] So, queue: [23:23:19] If we're doing last minute config changes, can I add one too? :) [23:23:21] superm401/Flow team [23:23:32] papaul: cool, so once install finishes on those two and you confirm they are ok and still showing the login prompt on vsp [23:23:33] yurik/ZeroBanner [23:23:35] we can move onto the rest [23:23:44] (well, you can move onto it later, you dont need to do tonight) [23:23:52] kaldari/WikiGrok [23:24:06] robh: ok will confirm before i leave [23:24:43] Oh, and I think I missed an OpenStackManager change that needs deployment before tuesday [23:25:22] robh: db2043 and db2044 are up [23:25:29] awesome [23:25:39] Krenair, i was about to pass out, have an early flight to catch, can you do the depl? [23:25:41] feel free to contiue onto the rest of the installs db2045+ tomorrow [23:25:43] i am login out so you get try [23:26:02] yes yurik [23:26:07] thx ) [23:26:08] papaul: db2043 looks good =] [23:26:08] robh:ok [23:26:24] yurik: I assume you know that ZeroBanner is throwing errors currently? [23:26:38] (03PS1) 10Mattflaschen: Add editcontentmodel on testwiki temporarily for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206015 [23:26:46] robh: ok [23:26:47] superm401, your change is syncing [23:26:55] yurik: Looks like it’s due to Florian’s update to the Licensing code [23:26:55] !log krenair Synchronized php-1.26wmf3/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/206008/ (duration: 00m 13s) [23:27:01] Logged the message, Master [23:27:04] Krenair, thanks, would also like to add above config change. Will add to calendar. [23:27:06] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1229733 (10RobH) a:5RobH>3Papaul Papaul found the incorrect bios settings and fixed, assigning this to him to complete the installs for db2045-db2070. Once done, assign this back to me for the key signing. [23:28:03] yurik: I assume that what your update is to fix? [23:28:19] Oh I need to do the submodule update for ZeroBanner [23:28:38] I guess yurik officially passed out :) [23:29:19] kaldari, yep [23:29:22] i'm on top of it [23:29:25] was on top of it [23:29:25] cool [23:29:26] https://gerrit.wikimedia.org/r/#/c/206005/ [23:31:49] superm401, ah... so, editcontentmodel for sysop [23:32:00] Krenair, yeah, on testwiki. [23:32:01] superm401, wasn't editcontentmodel a security thing? [23:32:36] Because it could be used to break pages, I think? [23:32:45] Krenair, yeah, it's only temporary, for https://phabricator.wikimedia.org/T95381 . I'm trying to unbreak something, actually. [23:33:13] You have shell access, you don't need a silly sysop flag... [23:33:15] (03PS2) 10Mattflaschen: Add editcontentmodel on testwiki temporarily for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206015 (https://phabricator.wikimedia.org/T95381) [23:33:27] Especially on testwiki.. [23:33:43] Krenair, well, how specifically do you want me to fix the bug? Because I have a way, but it involves editcontentmodel. [23:35:19] I think it could be done using an internal API request and a, uhm, 'customised' user object [23:35:34] But this is probably the saner way to do it, admittedly. [23:36:18] Krenair, deployed? [23:36:18] superm401: could give yourself editcontentmodel briefly [23:36:36] Yeah, I think that is Krenair's first thing above. [23:36:46] yurik, not yet [23:36:52] oki, i'm off to bed :) [23:37:00] I'm waiting for jenkins [23:37:07] to merge my submodule update [23:37:14] I can do that if you prefer. [23:37:24] Krenair, https://xkcd.com/303/ [23:37:46] But I already have JS code for it. [23:38:01] Whereas I would have to convert it to FauxRequest. [23:38:35] (03PS20) 10BBlack: Set up /api/rest_v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [23:39:50] superm401, alright so I reviewed the bug that led to the creation of that right, I think it's OK to grant this to sysops... [23:39:54] (03CR) 10BBlack: [C: 032] "(noting anomie's removal of -1 here: https://phabricator.wikimedia.org/T95229#1228242 )" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [23:40:23] I think sysops can already do far worse things [23:40:31] Krenair, yeah, and I can revert as soon as I'm done fixing it. [23:40:56] (03CR) 10Alex Monk: [C: 032] "Per discussion, I think this is OK" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206015 (https://phabricator.wikimedia.org/T95381) (owner: 10Mattflaschen) [23:41:21] (03Merged) 10jenkins-bot: Add editcontentmodel on testwiki temporarily for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206015 (https://phabricator.wikimedia.org/T95381) (owner: 10Mattflaschen) [23:42:35] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:44:06] !log krenair Synchronized php-1.26wmf3/extensions/ZeroBanner/includes/ZeroSpecialPage.php: https://gerrit.wikimedia.org/r/#/c/206017/ (duration: 00m 13s) [23:44:12] Logged the message, Master [23:44:12] csteipp, oh hey [23:44:20] (03PS1) 10Dzahn: ganglia: terbium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206018 [23:44:25] Hey Krenair [23:44:36] csteipp, the editcontentmodel right [23:44:44] That's OK to grant to sysops on testwiki temporarily, right? [23:45:31] Do sysops have the right to edit user's common.js there already? [23:45:57] yeah [23:46:10] Ok, so yeah, no difference in ability in that case. [23:46:13] thought so [23:46:22] thanks for confirming [23:47:21] Krenair: are you updating ZeroBanner on wmf2 also? [23:47:46] aude, I was just wondering about that, I had assumed wmf3 only [23:47:51] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/206015 (duration: 00m 12s) [23:47:54] Logged the message, Master [23:47:57] but then I checked, and the errors show for wmf2 [23:48:02] so I guess we need to do wmf2 as well [23:48:04] Krenair: no, wmf2 also [23:48:27] can you do the backport+submodule update for that please? [23:48:34] if not I will [23:48:51] prefer someone more awake do it [23:48:54] ok [23:49:26] hi kaldari [23:49:38] ready for me? [23:50:08] kaldari, I'm wondering about your commit... [23:50:31] Is Jon Katz the product manager for that? [23:50:51] Krenair: yes [23:51:06] Has he approved this? [23:51:44] Krenair: Yes: https://phabricator.wikimedia.org/T94444 [23:52:19] (03CR) 10Dzahn: [C: 032] ganglia: terbium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206018 (owner: 10Dzahn) [23:52:39] That's not really approval. [23:53:30] Krenair: hmm, well, it’s in our current sprint, which he created :) Unfortunately, he’s already gone home for the day, today. [23:54:27] (03PS1) 10BBlack: varnish does not like duplicate names for a director + backend [puppet] - 10https://gerrit.wikimedia.org/r/206022 [23:54:53] Krenair: We can postpone until tomorrow if you need us to, but I’ll be on vacation then. [23:55:19] kaldari, you have deployment access don't you? [23:56:00] Krenair: Yes [23:56:12] You can do it [23:56:23] Krenair: Cool, thanks [23:56:30] Krenair, it worked: https://test.wikipedia.org/wiki/User_talk:Hazard-SJ . I can revert now. [23:56:40] You don't exactly need my permission, but... :) [23:56:45] Krenair: Is anyone else after me? [23:56:48] OK superm401 [23:57:05] kaldari, I'm just doing a ZeroBanner update, then superm401's revert [23:57:13] then you [23:57:56] (03PS1) 10Mattflaschen: Revert "Add editcontentmodel on testwiki temporarily for sysop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206024 [23:58:46] (03CR) 10Alex Monk: [C: 032] Revert "Add editcontentmodel on testwiki temporarily for sysop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206024 (owner: 10Mattflaschen)