[00:04:50] (03PS1) 10Tim Landscheidt: Tools: Remove gridengine aliases for some hosts [puppet] - 10https://gerrit.wikimedia.org/r/235157 (https://phabricator.wikimedia.org/T109485) [00:06:13] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 1 failures [00:19:18] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1591294 (10chasemp) Note, I am rolling out https://gerrit.wikimedia.org/r/#/c/235048/ which should fix wikitech search now [00:24:06] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1591298 (10Dzahn) 5Resolved>3Open [00:25:24] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1312654 (10Dzahn) reopened the ticket because of personal mail between Samir and John i was CCed on [00:28:27] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1591312 (10EBernhardson) [00:28:29] Krenair: thanks [00:32:13] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:36:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [00:46:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:56:12] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 24 connecting: (unnamed) [00:58:12] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [00:58:13] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100852s 100000s) [00:58:30] legoktm, am I going mad or is the check for IP arrays in throttle.php completely broken? [00:58:40] * legoktm looks [00:58:43] if ( is_array( $options['IP'] ) && !in_array( $ip, $options['IP'] ) ) { [00:58:44] continue; [00:58:44] } elseif ( $ip != $options['IP'] ) { [00:58:44] continue; [00:58:44] } [00:59:23] If it's set up to use an array of IPs, and the user's IP *IS* in the array, we'll skip the throttle raise entry because 'ip' != array( 'ip' ) [01:01:01] I think you're right [01:01:30] same for IP range support [01:01:44] but not DB names [01:02:24] make it always an array? [01:03:06] Krenair: so the != check needs a !is_array( $options['IP'] ) in front [01:03:54] may be simpler to just cast it to an array and run in_array [01:06:00] what's the best phabricator project tag to involve ops responsible for nginx configuration? [01:06:05] "ops-eqiad" ? [01:06:34] no, ops-eqiad is for hardware at a specific datacenter [01:06:45] saper: what nginx config are you referring to exactly/ [01:07:55] (when in doubt, you can always tag with just "operations" and leave it at needs-triage priority so that someone notices it for ops duty and sorts out where it belongs) [01:08:33] bblack: er, "site" nginx? handling API requests to commons,test2,... ? [01:08:54] that runs through nginx? [01:08:58] 6operations, 6Commons, 6Multimedia: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1591373 (10saper) 5stalled>3Open [01:09:09] all of the public termination runs through nginx [01:09:16] ah right, ssl termination [01:09:16] (and varnish) [01:09:40] but honestly even then, are we talking about the TLS termination nginx config, or actually about the apache config in front of mediawiki at a much deeper level? :) [01:10:12] saper, file it under #operations, and we'll see what the issue is and where it goes [01:10:48] yep [01:10:49] ^^ [01:11:02] writing a comment now with my findings [01:11:07] yeah the 413 could be at many levels actually [01:11:21] it would be helpful to get exact header/error output, may indicate which level is kicking the error [01:12:11] FWIW, the very front edge TLS terminators are configured as "client_max_body_size 100m;" [01:15:23] 6operations, 6Commons, 6Multimedia: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1591385 (10saper) Here's a complete shell script using curl to reproduce the problem without any external tools: https://github.com/saper/upload-bug86436 This s... [01:16:48] 6operations, 6Commons, 6Multimedia: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1591387 (10saper) I was connecting this way: ``` m> traceroute6 2620:0:862:ed1a::1 traceroute6 to 2620:0:862:ed1a::1 (2620:0:862:ed1a::1) from 2a01:4f8:a0:7383::... [01:18:15] bblack: ^^ more information cannot be extracted by us, mere mortals :) [01:22:12] that does look like our outer terminators doing it, if we can believe the Server headerf [01:25:56] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591398 (10JKrauska) @dzahn Is there an api/method to get a real-time dump of the wmf group in ldap? Is that simply in operations-puppet somewhere? [01:26:59] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591399 (10Krenair) You can run `ldaplist -l group wmf` from labs [01:27:33] (03CR) 10Dzahn: [C: 031] "this group will give read access to a mysql config file with a password to acccess the research database. so that should be it" [puppet] - 10https://gerrit.wikimedia.org/r/235047 (https://phabricator.wikimedia.org/T110754) (owner: 10John F. Lewis) [01:30:11] 6operations, 6Commons, 6Multimedia: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1591401 (10BBlack) a:3BBlack The pasted headers look like the TLS terminators (nginx 1.9.3) are returning the error, which is strange because they're configured... [01:30:22] 6operations, 6Commons, 6Multimedia, 10Traffic: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1591403 (10BBlack) [01:33:04] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591410 (10JKrauska) @Krenair: Thanks. Can you think of any web or code based approach for that to automated /without/ needing to ssh in and run that from time to time? [01:33:10] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591411 (10Krenair) That'll give you people's UIDs, although I guess you'll need to check `ldaplist -l passwd $uid` for some of them to find their mail entry, which would likely be a bette... [01:34:06] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591412 (10Krenair) You could probably stick something simple (a tool?) up in labs to query LDAP from. I'm not aware of an existing tool for it. [01:37:01] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591426 (10JKrauska) no experience with labs tools. @mark perhaps here's a good opportunity to have someone in ops help code up something? [01:44:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [01:45:02] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591427 (10scfc) Isn't the LDAP server accessible from where you run your scripts? The code basically would probably follow the lines of the LDAP queries in [[http://git.wikimedia.org/blo... [01:50:09] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1591429 (10Dzahn) a:3ArielGlenn [01:50:27] bblack: thanks! [01:51:18] (03PS2) 10BBlack: clean up text/mobile whitespace-only diffs [puppet] - 10https://gerrit.wikimedia.org/r/235148 (https://phabricator.wikimedia.org/T109286) [01:51:20] (03PS2) 10BBlack: mobile: add bits compat code [puppet] - 10https://gerrit.wikimedia.org/r/235147 (https://phabricator.wikimedia.org/T109286) [01:51:22] (03PS2) 10BBlack: standardize hiera-overridable class/config params [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) [01:51:24] (03PS2) 10BBlack: vcl: merge cluster_options into vcl_config, refactor [puppet] - 10https://gerrit.wikimedia.org/r/235145 (https://phabricator.wikimedia.org/T96847) [01:52:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:52:18] saper: np, thanks for the report :) [01:52:51] 6operations: disable shell account - Ananth Ramakrishnan / ananthrk - https://phabricator.wikimedia.org/T110984#1591434 (10RobH) 3NEW a:3RobH [01:53:01] bblack: now trying to repro the bug in the other direction: https://phabricator.wikimedia.org/T75200 :) [01:53:13] 6operations: disable shell account - Ananth Ramakrishnan / ananthrk - https://phabricator.wikimedia.org/T110984#1591443 (10RobH) [01:53:17] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591442 (10RobH) [01:55:07] (03CR) 10Dzahn: "yea, i think against "database" is good for grant requests" [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) (owner: 10Aklapper) [01:56:21] (03PS1) 10RobH: disable user ananthrk [puppet] - 10https://gerrit.wikimedia.org/r/235173 [01:57:43] (03CR) 10RobH: [C: 032] disable user ananthrk [puppet] - 10https://gerrit.wikimedia.org/r/235173 (owner: 10RobH) [01:59:18] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591448 (10RobH) [01:59:20] 6operations: disable shell account - Ananth Ramakrishnan / ananthrk - https://phabricator.wikimedia.org/T110984#1591446 (10RobH) 5Open>3Resolved The patchset is now submitted and merged live, disabling his access. https://gerrit.wikimedia.org/r/#/c/235173/ If it turns out wrong, someone can revert it. [02:05:59] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591449 (10Dzahn) I would say "ldaplist" is already the existing tool for it. 25 "An application that implements the functionality of Solaris's ldaplist." .. 37 parser.add_optio... [02:06:22] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:07:27] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1591450 (10Dzahn) >>! In T108131#1591412, @Krenair wrote: > You could probably stick something simple (a tool?) up in labs to query LDAP from. I'm not aware of an existing tool for it. `i... [02:08:13] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10848 bytes in 0.094 second response time [02:08:42] arghasdfadsf damn you icinga false alerts [02:09:06] and my clear pages beat my criticals [02:09:11] ahh sms. [02:09:59] (03PS3) 10BBlack: clean up text/mobile whitespace-only diffs [puppet] - 10https://gerrit.wikimedia.org/r/235148 (https://phabricator.wikimedia.org/T109286) [02:10:01] (03PS3) 10BBlack: mobile: add bits compat code [puppet] - 10https://gerrit.wikimedia.org/r/235147 (https://phabricator.wikimedia.org/T109286) [02:10:03] (03PS3) 10BBlack: standardize hiera-overridable class/config params [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) [02:10:05] (03PS3) 10BBlack: vcl: merge cluster_options into vcl_config, refactor [puppet] - 10https://gerrit.wikimedia.org/r/235145 (https://phabricator.wikimedia.org/T96847) [02:10:25] robh: we've (I've?) been thinking that's a false-ish alert for a long time, but actually now I think it's real [02:10:36] the solution is complicated though, and there's nothing we can do about it tonight easily [02:10:45] oh [02:11:04] i stand corrected [02:11:09] I was thinking icinga was at fault. now I'm pretty sure its our internal ipv6 networking in general is flaky [02:11:16] asdfasldkjfas;kjfds damn you icinga not false but annoyingly not correctable alert! [02:11:48] I suppose I could downtime it, but I haven't even made a good phab summary of the issue, and it needs to get worked on soon. leaving it alive will pester me [02:11:54] so its a real alert in our monitoring system that isnt being replicated to our users [02:11:54] ? [02:12:18] no, I think it's actually-flaky. if we had more v6 traffic/users, they'd probably be complaining once in a blue moon [02:12:27] ah [02:12:38] but then they likely fail over to v4 for now like most folks [02:12:44] yeah maybe [02:13:04] it will become more of an issue than an annoying page then, good to know =] [02:14:31] (03CR) 10BBlack: [C: 031] "compiler results look good, but holding for tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/235145 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [02:16:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [02:25:23] !log l10nupdate@tin Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 06m 00s) [02:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:04] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:28:30] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf20) at 2015-09-01 02:28:30+00:00 [02:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:13] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:32:13] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10848 bytes in 0.089 second response time [02:59:59] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1591474 (10Tbayer) Here are the three files. (As discussed in person today, there will also be some separate CSV in addition - but those don't need to go up today; will reopen this task then.) Thanks... [03:07:29] (03PS1) 10Alex Monk: Don't set wgEchoBundleEmailInterval if it won't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235177 (https://phabricator.wikimedia.org/T110985) [03:22:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [03:23:03] 6operations, 6Labs, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#1591503 (10Krenair) 3NEW [03:29:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:20:36] 6operations, 10MediaWiki-extensions-CentralNotice, 7Database, 7Schema-change, 7Tracking: Create CentralNotice campaign mixin tables - https://phabricator.wikimedia.org/T110963#1591593 (10greg) [04:20:54] 6operations, 10MediaWiki-extensions-CentralNotice, 7Database, 7Schema-change: Create CentralNotice campaign mixin tables - https://phabricator.wikimedia.org/T110963#1591596 (10greg) [04:27:43] (03PS4) 10BBlack: standardize hiera-overridable class/config params [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) [04:31:33] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (4279 100000s) [04:31:41] (03PS5) 10BBlack: standardize hiera-overridable class/config params [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) [04:32:17] (03CR) 10BBlack: [C: 032] vcl: merge cluster_options into vcl_config, refactor [puppet] - 10https://gerrit.wikimedia.org/r/235145 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [04:39:52] (03CR) 10BBlack: [C: 032] "compiler no-op on canaries" [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [04:40:09] (03PS4) 10BBlack: mobile: add bits compat code [puppet] - 10https://gerrit.wikimedia.org/r/235147 (https://phabricator.wikimedia.org/T109286) [04:44:26] (03CR) 10BBlack: [C: 032] mobile: add bits compat code [puppet] - 10https://gerrit.wikimedia.org/r/235147 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [04:44:34] (03PS4) 10BBlack: clean up text/mobile whitespace-only diffs [puppet] - 10https://gerrit.wikimedia.org/r/235148 (https://phabricator.wikimedia.org/T109286) [04:44:43] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: puppet fail [04:44:57] (03CR) 10BBlack: [C: 032 V: 032] clean up text/mobile whitespace-only diffs [puppet] - 10https://gerrit.wikimedia.org/r/235148 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [04:48:43] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:03:13] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [05:07:03] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [05:07:48] (03PS1) 10BBlack: Bugfix for c6806db0 + misc-web VCL [puppet] - 10https://gerrit.wikimedia.org/r/235183 [05:08:01] (03CR) 10BBlack: [C: 032 V: 032] Bugfix for c6806db0 + misc-web VCL [puppet] - 10https://gerrit.wikimedia.org/r/235183 (owner: 10BBlack) [05:09:12] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:09:23] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [05:10:54] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:11:22] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:14:53] (03PS2) 10Glaisher: Clean up WikidataPageBanner related config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234944 [05:15:44] (03CR) 10Glaisher: Enable WikidataPageBanner extension on Russian Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234942 (https://phabricator.wikimedia.org/T110837) (owner: 10Glaisher) [05:18:09] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 1 05:18:09 UTC 2015 (duration 18m 8s) [05:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:40:53] (03Draft1) 10Ori.livneh: pybal (1.09) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235186 [06:01:03] (03CR) 10Ori.livneh: [C: 032] pybal (1.09) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235186 (owner: 10Ori.livneh) [06:01:20] (03Merged) 10jenkins-bot: pybal (1.09) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235186 (owner: 10Ori.livneh) [06:14:22] (03PS1) 10BBlack: trivial mobile/text.pp diff reductions [puppet] - 10https://gerrit.wikimedia.org/r/235187 (https://phabricator.wikimedia.org/T109286) [06:14:24] (03PS1) 10BBlack: add zero updater to text.pp, align position [puppet] - 10https://gerrit.wikimedia.org/r/235188 (https://phabricator.wikimedia.org/T109286) [06:14:26] (03PS1) 10BBlack: genericize the cluster_nodes-like variables to reduce diffs [puppet] - 10https://gerrit.wikimedia.org/r/235189 (https://phabricator.wikimedia.org/T96847) [06:17:43] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: puppet fail [06:21:26] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1591723 (10Dzahn) p:5Triage>3High [06:24:13] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1591728 (10Dzahn) was time critical before the 1st of month PST, per HaeB root@dataset1001:/data/xmldatadumps/public/other/surveys# mkdir editorsurvey2012 root@dataset1001:/data/xmldatadumps/public... [06:28:10] 6operations, 10Wikimedia-Mailing-lists: wikinews-l: no active listadmin - https://phabricator.wikimedia.org/T110956#1591731 (10Revi) I'm more active on kowikinews (not enwikinews) but if I can be sufficient or no others are volunteering I am willing to do it. (I am listowner of otrs-ko, if this matter.) [06:31:42] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:04] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:53] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:23] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:25] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:37] puppet o'clock? [06:36:39] !log uploaded survey2012 to dumps/dataset1001; ownership as it is for survey2011; - T110746 in time for midnight PST [06:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:37:18] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1591753 (10Dzahn) 5Open>3Resolved added link on http://dumps.wikimedia.org/other/surveys/ 23:38 < mutante> !log uploaded survey2012 to dumps/dataset1001; ownership as it is for survey2011; - T11... [06:43:13] (03CR) 10BBlack: [C: 032] genericize the cluster_nodes-like variables to reduce diffs [puppet] - 10https://gerrit.wikimedia.org/r/235189 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [06:43:41] (03CR) 10BBlack: [C: 032] trivial mobile/text.pp diff reductions [puppet] - 10https://gerrit.wikimedia.org/r/235187 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [06:43:52] (03CR) 10BBlack: [C: 032] add zero updater to text.pp, align position [puppet] - 10https://gerrit.wikimedia.org/r/235188 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [06:45:32] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:55:43] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:55:43] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:03] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:12] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:06:42] (03PS1) 10Jcrespo: Repool es1007, pool es1013 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235192 (https://phabricator.wikimedia.org/T105843) [07:09:54] (03CR) 10Jcrespo: [C: 032] "Confirmed the changes against our monitoring (buffer pool, ips, etc.)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235192 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [07:13:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1007, pool es1013 (duration: 00m 13s) [07:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:21:33] (03PS1) 10Jcrespo: Depool es1010 to clone it to es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235193 (https://phabricator.wikimedia.org/T105843) [07:22:03] (03CR) 10Jcrespo: [C: 032] Depool es1010 to clone it to es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235193 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [07:23:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1010 (duration: 00m 12s) [07:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:39:07] _joe_: around? [07:39:53] _joe_: we need to talk about https://phabricator.wikimedia.org/T103911 (monitoring/alerting for WDQS) [07:43:37] SMalyshev, joe is on vacation for 2 weeks [07:43:58] jynus: ahh, thanks, I didn't know that :) [07:45:13] !log cloning mysql data from es1010 to es1017 [ETA: 6h] [07:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:46:21] which probably means we need some other ops person to work with Discovery team... [07:59:03] (03CR) 10Jcrespo: [C: 04-1] "The coredb classs is a legacy class- right now it is only in production in the servers that have not been restarted/uppgraded recently. Ir" [puppet] - 10https://gerrit.wikimedia.org/r/228806 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [08:03:18] (03CR) 10Jcrespo: [C: 031] "We can proceed with this, phabricator issues are gone for now, and this is a passive slave." [puppet] - 10https://gerrit.wikimedia.org/r/233670 (owner: 10Muehlenhoff) [08:05:11] 6operations: Assign salt grains to server groups for debdeploy - https://phabricator.wikimedia.org/T111006#1591873 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [08:09:12] moritzm: so, let's do 1003, then 2, then 1? [08:10:55] https://gerrit.wikimedia.org/r/#/c/227945/ needs to be rebased and its dependency on the 1001 patch removed, I guess [08:13:56] moritzm: jynus so I am not sure what connection tracking actually gives us in this case.... [08:14:05] if it doesn't give us much maybe we should just turn it off? [08:15:35] moritzm, right now on labs we have 60 active mysql connections [08:16:37] jynus, YuviPanda: ok, I'd say let's go ahead, check the data and reassess as needed [08:17:03] netstat -t | wc -l ---> this give me 90 [08:18:55] I am doing a query every second from iron and also have a labs app handy [08:19:44] (03PS1) 10Muehlenhoff: Enable ferm on labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/235194 [08:20:09] jynus, YuviPanda: in addition, when we merge it, I'll add iptables rules to catch dropped traffic [08:20:33] moritzm, we now your job is difficult, so do not worry! :-) [08:20:37] *know [08:20:57] but it is much needed [08:21:31] merging 235194 now [08:21:37] ok! [08:22:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/235194 (owner: 10Muehlenhoff) [08:23:41] making a puppet run on 1003 [08:24:03] ok, queries from iron stopped [08:24:17] and back again [08:24:23] rules are configured now [08:24:46] connections dropped from labs but back again [08:24:49] maybe a 2 second drop? [08:25:49] that's as expected; during the initial setup a DROP policy is put in effect and only after puppet has generated the rules files which allow the legitimate traffic these are allowed again [08:26:21] yes, my ssh connection also dropped [08:26:28] I will check replication [08:26:34] it should retry after a fail [08:27:11] ok, so far the logging rules haven't shown any dropped traffic [08:27:22] (it's logged to syslog with a prefix "iptables-dropped:") [08:27:59] em [08:28:06] somthing is wrong with replication [08:28:21] labsdb1003 [08:28:24] right? [08:28:37] yes [08:29:08] no, false alarm [08:29:11] !log repool mw1125 mw1142 after nutcracker failures [08:29:11] my fault [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:57] jynus, YuviPanda: let's give it 10 more mins to see whether anything pops up and continue with 1002 then? [08:29:59] let me wait for the graphs to update to see the impact [08:30:06] yes, please [08:30:18] yup! [08:30:21] so I can see the impact on connections, if any [08:30:28] jynus: or maybe just ping us when the graph indicates we're good to proceed [08:30:37] (my graphs have a 5 minute delay) [08:31:34] !log enabled ferm on labsdb1003 [08:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:06] so, as far as I can see, there is a small amount of connections blocked temporarily, is that consistent/expected? [08:33:17] not now, when it was deployed [08:36:00] I also had a spike in lag (300s), but that is probably caused by the avalanche of connections and not so critical [08:37:52] jynus: I think the connections blocked temporarily is the effect of the way ferm enables the rules; when the initial DROP policy gets activated, it disrupts existing connections and only in a later step the 3306 connections are accepted again [08:38:40] https://phabricator.wikimedia.org/T110514 [08:39:22] ok, graphs are back to normal, ok to proceed [08:39:44] moritzm: hmm, is it not possible to not have the DROP followed by the accepts but the other way around? [08:40:59] YuviPanda: that could be done, but would also involve patching the ferm package in Debian [08:41:08] jynus: ok, I'll prepare the next change [08:41:23] moritzm: aaah, alright then [08:42:28] (03PS1) 10Muehlenhoff: Enable ferm on labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/235198 [08:42:43] merging ^ [08:43:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/235198 (owner: 10Muehlenhoff) [08:43:29] making a puppet run on 1002 [08:44:31] rules are set up [08:45:37] it is more than 2 seconds, actually [08:46:07] (seems to work from labs now) [08:47:59] jynus: indeed, the full puppet run on 1002 took 19s, and the gap is probably more towards 10s [08:48:13] no signs of dropped traffic so far [08:49:30] likewise for 1003 so far [08:50:32] 3 minute lag, going away [08:51:14] (03PS2) 10Muehlenhoff: Enable base::firewall on labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/227944 [08:51:39] !log enabled ferm on labsdb1002 [08:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:18] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1591979 (10fgiunchedi) >>! In T83580#1571705, @Ottomata wrote: > Natively share the dict? Hm. Just quickly tried this, and I get an immediate segfault: err, what I meant is to kee... [08:55:08] lag is gone now [08:55:35] jynus, YuviPanda: ok to proceed with 1001? [08:55:59] moritzm: yep [08:56:02] ok for me [08:56:05] ok, merging [08:56:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/227944 (owner: 10Muehlenhoff) [08:56:55] making a puppet run on 1001 [08:58:01] rules are up [08:58:33] !log fixup current graphite retention for metrics under "servers" hierarchy T96662 [08:58:38] 17 seconds of new connections being blocked [08:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:52] !log enabled ferm on labsdb1001 [08:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:32] jynus: moritzm: so all good on that too, I guess? [09:06:19] (03PS3) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [09:06:59] (03PS4) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [09:08:07] (03PS5) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [09:08:21] moritzm: we still have to do labsdb1004, 5, 6, 7, right? [09:09:00] 5 needs some more work on the rules, but 4,6,7 are good to go [09:09:27] labsdb100[1-3] seem fine to me, yes [09:11:27] YuviPanda, jynus: any preference on the order? 7 -> 6 -> 4? [09:11:37] yeah [09:11:44] 7 and 6 are... postgres, right? [09:12:05] all three are postgres [09:12:17] right [09:12:28] 4 is also supposed to have mysql but doesn't have it yet [09:12:37] so 7 and 6 are used by maps [09:12:41] 4 is used by far fewer people [09:12:46] so let's do 4, 7, 6? [09:13:07] jynus: good morning :-) Is there any procedure to have a new misc MySQL database setup ? I am wondering whether you have a standard form listing the needs :D [09:13:48] jynus: that is for a misc service like phabricator or Gerrit. Not a wiki [09:13:57] (03PS6) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [09:14:03] hashar, if you need on new hardware, you first need to buy it [09:14:20] jynus: does 4 -> 7 -> 6 work for you? [09:14:56] (03CR) 10ArielGlenn: [C: 032 V: 032] schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 (owner: 10ArielGlenn) [09:15:02] hashar, otherwise, just create a ticket with the needs, QPS, type of application ,etc [09:15:32] all details so I can accomodate it, if possible [09:15:49] (load expected) [09:16:58] moritz, as 4,6,7 were not mysql boxes, I havn't checked the rules [09:17:33] the rules have been reviewed, let me dig out the gerrit changes [09:17:58] then go for it, it is just that I do not have good monitoring for them [09:18:45] nor, to be fair, I am not familiar with its application usage, like 1,2,3 [09:19:10] e.g. maps [09:19:35] YuviPanda: do we have some basic functionality tests for 4,6,7 (plus, we'd be able to spot dropped traffic) [09:20:04] so i think one of 6 or 7 is totally unused atm and is just a failover one. let me dig up what that is [09:20:16] I wonder if akosiaris is around, he's been the one dealing with all those boxes [09:20:47] jynus: nice filling a ticket :-} [09:20:58] YuviPanda: according to git blame he's also the primary author of postgres.pp and osm.pp [09:21:04] yeah [09:21:50] I am around [09:21:58] ah postgres boxes [09:22:04] so what's up with them ? [09:22:07] but I ssupose those are the old maps machines, right? [09:22:13] not the new setup [09:22:15] define old [09:22:17] oh [09:22:20] the one in labs [09:22:25] sigh... [09:22:29] not the one in codfw [09:22:34] so maps-test* is the "new" setup [09:22:39] yes [09:22:53] labsdb100{6,7} is the "new" setup for labs osm project [09:23:02] labsdb100{4,5} is the old one for the same reason [09:23:03] but [09:23:14] labsdb1005 does not have a postgres installed at the moment [09:23:24] jynus: we are actually waiting on you to upgrade it to jessie first [09:23:30] and then turn it to a postgres slave [09:23:34] waiting on me? [09:23:47] akosiaris: this is just enabling ferm on those boxes atm [09:23:49] well, you are going to mumble something about a ticket [09:23:53] and it will be correct [09:23:58] :-) [09:24:02] we did that with YuviPanda in wikimania [09:24:15] ah ok then [09:24:27] so don't enable ferm in the module or we are toast [09:24:35] there are roles around for that [09:24:44] i'm ok with doing that, just tell me :-) [09:24:44] so maps.pp for ferm in the maps project (the one in codfw) [09:25:06] I'm somewhat confused now. [09:25:32] so 4, 6, 7 are just postgres ones, right? and so we can basically just enable ferm on them assuming we have appropriate holes for the postgres ports [09:25:35] postgres.pp for labsdb100{4,5} and osm.pp for labsdb100{6,7} [09:26:03] the required ferm rules are all defined in postgres.pp and osm.pp, but the roles don't include base::firewall ATM [09:26:26] YuviPanda: yes they are. yes but do it per role [09:26:31] moritzm: perfect then [09:27:10] btw, sentry is coming along. not sure where it will be deployed but for some reason it also needs postgres [09:27:58] what does the "so don't enable ferm in the module or we are toast" comment refer to, only labsdb1005, so, we can enable ferm in the role for osm::master, osm::slave and role::postgres::master? [09:29:01] moritzm: it refers to the fact that these 3 sets of boxes have different clients and we probably ain't gonna get the rules correctly with the first try [09:29:40] moritzm: so the 3 roles you just mentioned, those are the perfect place to enable ferm [09:30:22] sure, we'd flip these one by one to look out for regressions, if that's what you refer to [09:30:42] well, regressions would take a long time to show up [09:30:51] days if not weeks [09:31:08] as in VMs in labs use them and ppl might notice after a long time [09:31:18] and they will rightfully complain after that [09:31:43] select zoom,block,idx from tiles WHERE zoom >= 0 and block >= 0 limit 10; [09:31:44] InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key [09:31:51] congratulations cassandra!! [09:32:20] lol [09:33:08] CQL <> SQL [09:33:10] akosiaris: moritzm I think if we know that the postgres ports are open, that should be enough no? [09:33:27] YuviPanda: yes, I think so [09:33:37] jynus: tell me about it... [09:33:37] ok, so the ferm rules are there.... [09:33:57] jynus: it get's even better [09:34:04] akosiaris: moritzm I'd say let's just do it and see what happens :) I can test 4 easily enough (labels.wmflabs.org is on it) [09:34:07] I suppose that for low traffic nodes, logging can be left on [09:34:10] and I"m guessing we can test the maps stuff easily too [09:34:12] tables created with CQL3 and not viewable with CQL2 clients [09:34:53] YuviPanda: ok [09:35:07] moritzm: in that case, let's do 4 first since we have an easy verification mechanism for that [09:35:27] akosiaris: ah yeah, thrift interface vs the new one or sth like that? [09:35:39] YuviPanda: remember the labsdb1006,7 are only being used by the maps labs project [09:35:41] godog: yes [09:35:52] godog: the more I get to know cassandra, the less I like it btw [09:36:04] akosiaris: we'll be adding logging rules for dropped traffic, so in case there's e.g. queries from outside the internal network (as the currently postgres rules are limited to), we'd spot it and will amend the rules [09:36:28] moritzm: that's nice. good idea [09:36:34] akosiaris, cassandra is way better than other distributed storages [09:36:55] jynus: like mongodb ? obviously!!! [09:37:00] but that does not say much [09:37:23] mongodb is useful if not used with the default storage backend [09:37:46] YuviPanda: which one shall we start with? [09:37:52] jynus: I suppose you are familiar with aphyr.com ? [09:38:00] 6operations, 5Continuous-Integration-Isolation, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1592079 (10hashar) [09:38:05] moritzm: 4 [09:38:14] jynus: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads [09:38:16] ok, I'll prepare the change [09:38:19] akosiaris, only by name [09:38:23] and so on [09:38:23] akosiaris: heh, there are some wtf alright [09:38:37] he has some very very good blog posts [09:38:42] jynus: worth reading [09:39:07] for instance, I am trying to fetch all cassandra rows... [09:39:22] so it has a builtin LIMIT of 10000 [09:39:28] I should override it [09:39:30] to what ? [09:39:34] moritzm: ok! [09:39:37] well..... not really possible to know [09:40:04] well, in its defence, it is a key-value store [09:40:15] the guesstimations of nodetool cfstats are obviously wrong [09:40:38] by design, you souldn't be able to get any range [09:40:38] I think I am gonna override it with some like 999999999999999999999999999999999999999999999999999999999 [09:41:01] jynus: which is the perfect way to be vendor locked-in [09:41:15] (03PS1) 10Muehlenhoff: Enable ferm for postgres::maste role (currently only used by labsdb1004) [puppet] - 10https://gerrit.wikimedia.org/r/235201 [09:41:29] me: "hey, I want my data dumped". Cassandra: "I 'm sorry Dave, I can't let you do that" [09:41:44] I do not know, as I said many times, my main concern about many of those storages is the cluster part of it and its consistency model [09:42:03] not so much about its features [09:42:30] jynus: you asked for it. https://aphyr.com/posts/294-call-me-maybe-cassandra/ [09:42:33] it's a nice read [09:43:02] akosiaris: +1 https://gerrit.wikimedia.org/r/#/c/235201/1 [09:43:03] ? [09:43:15] we could do the same to mysql: "mysql replication, manual provisioning. WTF? is this 1960" [09:43:59] my philosophy is : there is no bad technology, only bad application [09:44:37] (03CR) 10Alexandros Kosiaris: [C: 031] Enable ferm for postgres::maste role (currently only used by labsdb1004) [puppet] - 10https://gerrit.wikimedia.org/r/235201 (owner: 10Muehlenhoff) [09:44:51] akosiaris: thanks [09:44:59] YuviPanda: there you go. I also have a gerrit review for pip in contint from you in the queue [09:45:07] jynus: well, that depends on the technology [09:45:15] for example. N-Bombs [09:45:18] (03PS1) 10Yuvipanda: labstore: Run the cleanup script every day [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) [09:45:25] not many good applications for that thing [09:45:41] akosiaris: yeah, that's from hashar tho [09:45:43] well, akosiaris, what if we need them to repel an aline invasion of borgs? [09:45:57] jynus: I was going to make that exact argument [09:46:04] :-D [09:46:07] I wasn't going to mention borgs specifically but still [09:46:20] cause I really can't think of anything else [09:46:24] when I reach this level I usually go to the sqlite argument [09:46:46] which one is that ? [09:47:00] what if we need to repel an alien invastion of sqlite? [09:47:02] "what a horrible database" -> but the one with most uses overal [09:47:21] look at your phone and the 200 instances of it [09:47:32] probably more [09:47:43] but yeah, it's put to good use [09:47:48] cause it's cheap [09:47:49] * YuviPanda used it a fair bit for the android apps [09:47:57] cassandra ain't exactly cheap though [09:48:19] YuviPanda: I'll go ahead with the merge for 1004, ok? [09:48:20] it is a great database for that specific usage [09:48:31] (sqlite) [09:48:31] moritzm: yup, i've test case ready [09:48:39] yup [09:48:42] (also https://gerrit.wikimedia.org/r/235203 for a systemd related change if anyone wants to review) [09:48:48] (03PS2) 10Muehlenhoff: Enable ferm for postgres::maste role (currently only used by labsdb1004) [puppet] - 10https://gerrit.wikimedia.org/r/235201 [09:49:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for postgres::maste role (currently only used by labsdb1004) [puppet] - 10https://gerrit.wikimedia.org/r/235201 (owner: 10Muehlenhoff) [09:49:09] if you have a relational model, and need for complex queries, go for mysql or postgres [09:49:18] YuviPanda: systemd timers ? [09:49:31] what happened to good old chron ? [09:49:35] making a puppet run [09:49:48] * akosiaris never used systemd timers yet [09:50:50] YuviPanda: rules are up [09:51:49] moritzm: testing now [09:51:57] hmm, something might've broken [09:52:17] moritzm: are you seeing any dropped traffic? [09:52:21] nope [09:53:11] moritzm: ah, yep. it's all good [09:53:18] I think the active connection it had dropped... [09:54:01] moritzm: so that's good, I think [09:56:16] moritzm: not sure how we can test 6 / 7 tho [09:57:39] YuviPanda: 1004 should be good, no signs of dropped packets so far, but will continue to check through the day, also doesn't have a slave to sync against ATM [09:58:16] yeah [10:00:40] not sure about 6/7, akosiaris, is there a test case/setup for osm:master and osm::slave? [10:01:22] moritzm: hmm, the maps labs project... but it is caching tiles anyway so it might be a bit difficult to test [10:01:24] lemme check [10:01:40] ah, I think I can help you [10:02:43] ok I got a test case [10:02:52] 6operations: dumps: update docs to reflect staged dumps and xml streams - https://phabricator.wikimedia.org/T111018#1592121 (10ArielGlenn) 3NEW a:3ArielGlenn [10:03:30] moritzm: I can test it, wanna do the change ? [10:04:00] 6operations, 5Continuous-Integration-Isolation, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1592130 (10jcrespo) a:3jcrespo @Hashar is this related to the other OpenStack-related databases that normally @Andrew works with? [10:04:03] sure, do you want to flip osm::master or osm::slave first, or both at the same time? [10:04:21] 6operations, 5Continuous-Integration-Isolation, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1592132 (10jcrespo) p:5Triage>3Normal [10:05:59] (03PS1) 10Muehlenhoff: Enable ferm for role::osm::master (currently only used by labsdb1006) [puppet] - 10https://gerrit.wikimedia.org/r/235205 [10:07:23] (03PS1) 10Muehlenhoff: Enable ferm for osm::slave role (currently only used for labsdb1007) [puppet] - 10https://gerrit.wikimedia.org/r/235206 [10:10:38] (03PS1) 10ArielGlenn: add apps development guidelines to legal text for dumps [puppet] - 10https://gerrit.wikimedia.org/r/235208 [10:11:17] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1592156 (10ArielGlenn) @Krenair, how about this? https://gerrit.wikimedia.org/r/235208 [10:11:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] labstore: Run the cleanup script every day (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) (owner: 10Yuvipanda) [10:20:31] (03PS1) 10Krinkle: Remove unused rl-test.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235209 [10:23:08] 6operations: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020#1592176 (10ArielGlenn) 3NEW a:3ArielGlenn [10:23:27] 6operations: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020#1592186 (10ArielGlenn) [10:23:28] 6operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066#1592185 (10ArielGlenn) [10:26:09] 6operations: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021#1592187 (10ArielGlenn) 3NEW a:3ArielGlenn [10:26:22] 6operations: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021#1592196 (10ArielGlenn) [10:27:25] 6operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066#1568234 (10ArielGlenn) [10:29:11] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1592215 (10akosiaris) >>! In T110514#1579712, @MoritzMuehlenhoff wrote: > Fixing the default config will only limit the window; there's still a window between loading /etc/ferm/conf.d/00_main (which sets up the DROP policy... [10:30:10] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1592219 (10akosiaris) my take is 1. I fear that 2. will end up being messy and unwieldy and will confuse people. [10:34:05] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1592233 (10MoritzMuehlenhoff) I have a ferm test setup in labs; I'll give the patched ferm a try in there. [10:36:48] (03CR) 10Alexandros Kosiaris: [C: 031] Enable ferm for osm::slave role (currently only used for labsdb1007) [puppet] - 10https://gerrit.wikimedia.org/r/235206 (owner: 10Muehlenhoff) [10:37:04] (03CR) 10Alexandros Kosiaris: [C: 031] Enable ferm for role::osm::master (currently only used by labsdb1006) [puppet] - 10https://gerrit.wikimedia.org/r/235205 (owner: 10Muehlenhoff) [10:43:23] (03PS2) 10Krinkle: contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [10:43:29] (03CR) 10Krinkle: [C: 031] contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [10:45:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor issue inline. Otherwise LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [10:46:22] (03CR) 10Alexandros Kosiaris: [C: 031] contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [10:53:23] akosiaris: please ping me we when you've time to test osm and I'll do the merges [11:09:22] (03Abandoned) 10Muehlenhoff: Enable base::firewall for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/229688 (owner: 10Muehlenhoff) [11:15:47] (03CR) 10Lydia Pintscher: "Can this be merged please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [11:20:51] (03Abandoned) 10Muehlenhoff: Enable base::firewall for labsdb1006 [puppet] - 10https://gerrit.wikimedia.org/r/229697 (owner: 10Muehlenhoff) [11:21:11] (03Abandoned) 10Muehlenhoff: Enable base::firewall for labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/229698 (owner: 10Muehlenhoff) [11:27:50] moritzm: I am around, I can test [11:30:18] akosiaris: ok, starting with the slave first? [11:31:47] sure [11:32:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for osm::slave role (currently only used for labsdb1007) [puppet] - 10https://gerrit.wikimedia.org/r/235206 (owner: 10Muehlenhoff) [11:35:02] akosiaris: rules on 1007 are enabled now [11:37:29] moritzm: testing says "OK" [11:38:36] also no signs of dropped traffic (will monitor through the next days anyway) [11:38:41] doing the master now [11:39:20] (03PS2) 10Muehlenhoff: Enable ferm for role::osm::master (currently only used by labsdb1006) [puppet] - 10https://gerrit.wikimedia.org/r/235205 [11:39:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for role::osm::master (currently only used by labsdb1006) [puppet] - 10https://gerrit.wikimedia.org/r/235205 (owner: 10Muehlenhoff) [11:44:24] 6operations, 5Continuous-Integration-Isolation, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1592390 (10hashar) //This task is to pick a database for Nodepool// For continuous integration purposes, we are setting up a python based daemon named Nodepool. It maintains a... [11:45:43] akosiaris: ferm chokes on the rsync_from_labs srange, I'm having a look [11:45:59] 6operations, 5Continuous-Integration-Isolation, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1592393 (10hashar) >>! In T110693#1592130, @jcrespo wrote: > @Hashar is this related to the other OpenStack-related databases that normally @Andrew works with? Unrelated. It is... [11:46:17] PROBLEM - RAID on labsdb1006 is CRITICAL: Timeout while attempting connection [11:46:39] PROBLEM - Check if rsync server is running on labsdb1006 is CRITICAL: Timeout while attempting connection [11:46:56] PROBLEM - configured eth on labsdb1006 is CRITICAL: Timeout while attempting connection [11:46:57] PROBLEM - DPKG on labsdb1006 is CRITICAL: Timeout while attempting connection [11:47:07] PROBLEM - puppet last run on labsdb1006 is CRITICAL: Timeout while attempting connection [11:47:07] PROBLEM - Disk space on labsdb1006 is CRITICAL: Timeout while attempting connection [11:47:16] PROBLEM - dhclient process on labsdb1006 is CRITICAL: Timeout while attempting connection [11:47:27] PROBLEM - salt-minion processes on labsdb1006 is CRITICAL: Timeout while attempting connection [11:57:26] moritzm: ^ [11:57:37] err [11:57:50] akosiaris: ^ [11:57:53] is that just NRE? [11:58:26] YuviPanda: not sure. moriz ? sudo service ferm stop [11:58:41] unless you 've already done it and this is just stale [11:59:26] no [11:59:28] let me do that [12:00:17] https://www.irccloud.com/pastebin/wfsgvBkW/ [12:00:20] akosiaris: moritzm ^ [12:00:28] RECOVERY - RAID on labsdb1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [12:00:34] I've just stopped ferm (the syntax error prevented that) [12:00:47] RECOVERY - Check if rsync server is running on labsdb1006 is OK: PROCS OK: 1 process with command name rsync, regex args /usr/bin/rsync --no-detach --daemon [12:01:06] RECOVERY - configured eth on labsdb1006 is OK: OK - interfaces up [12:01:07] RECOVERY - DPKG on labsdb1006 is OK: All packages OK [12:01:13] !log disable puppet on labsdb1006 [12:01:17] RECOVERY - Disk space on labsdb1006 is OK: DISK OK [12:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:28] RECOVERY - dhclient process on labsdb1006 is OK: PROCS OK: 0 processes with command name dhclient [12:01:37] RECOVERY - salt-minion processes on labsdb1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:02:13] I'll revert the base::firewall inclusion for now so that we can properly sort out the broken srange in rsync_from_labs [12:03:26] (03PS1) 10Muehlenhoff: Revert inclusion of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/235214 [12:03:38] moritzm: wait, lemme check it first [12:05:00] akosiaris: sure (the current /etc/ferm/conf.d/50_rsync_from_labs is hand-edited, otherwise ferm choked on the syntax error when stopping the service) [12:05:45] ok thanks [12:05:56] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:06:04] and sigh... there is a rule 10_rsync-server [12:06:07] I enabled puppet btw [12:06:29] due to ferm's design it can not apply the transaction so it can not create another problem [12:07:11] so even if the rsync_from_labs rule was ok, the 10_rsync_server rule would have priority altering the security policy for the service [12:07:23] which is why I don't want ferm::rules in modules [12:08:50] ok found it [12:08:51] ack, I'll doublecheck the use of the rsync rule in the module, all the recent ones have only been added to the roles [12:08:57] dash vs underscore [12:08:58] sigh [12:12:56] (03PS2) 10Yuvipanda: labstore: Run the cleanup script every day [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) [12:12:57] akosiaris: ^ updated. thanks for catching those! [12:13:10] (03PS1) 10Alexandros Kosiaris: osm: fix typo in ferm::service srange [puppet] - 10https://gerrit.wikimedia.org/r/235216 [12:14:52] YuviPanda: seems fine, just curious why not a cron ? [12:15:10] (03CR) 10Alexandros Kosiaris: [C: 032] osm: fix typo in ferm::service srange [puppet] - 10https://gerrit.wikimedia.org/r/235216 (owner: 10Alexandros Kosiaris) [12:15:30] moritzm: ^ fixed [12:15:37] (03CR) 10Yuvipanda: labstore: Run the cleanup script every day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) (owner: 10Yuvipanda) [12:15:40] thanks for caring and sorry for the typo [12:15:43] akosiaris: ^ there. forgot to hit review [12:15:57] akosiaris: ok, thanks! [12:16:13] akosiaris: mostly, 'systemctl start' in a cronjob felt weird, and [12:16:32] akosiaris: also later on this gives us the ability to say '1h after the previous one ended' which we can't really do with cron [12:16:41] (03CR) 10Alexandros Kosiaris: labstore: Run the cleanup script every day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) (owner: 10Yuvipanda) [12:16:47] (03CR) 10Alexandros Kosiaris: [C: 031] labstore: Run the cleanup script every day [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) (owner: 10Yuvipanda) [12:17:03] akosiaris: with OnUnitActivateSec [12:17:20] no dropped traffic on 1006 either, but will have a look throughout the next days [12:17:25] (03PS3) 10Yuvipanda: labstore: Run the cleanup script every day [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) [12:17:33] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Run the cleanup script every day [puppet] - 10https://gerrit.wikimedia.org/r/235203 (https://phabricator.wikimedia.org/T109954) (owner: 10Yuvipanda) [12:17:46] (03Abandoned) 10Muehlenhoff: Revert inclusion of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/235214 (owner: 10Muehlenhoff) [12:22:18] (03PS1) 10Yuvipanda: labstore: Make sure that the timer and service names match [puppet] - 10https://gerrit.wikimedia.org/r/235217 [12:22:26] (03CR) 10jenkins-bot: [V: 04-1] labstore: Make sure that the timer and service names match [puppet] - 10https://gerrit.wikimedia.org/r/235217 (owner: 10Yuvipanda) [12:22:33] (03PS2) 10Yuvipanda: labstore: Make sure that the timer and service names match [puppet] - 10https://gerrit.wikimedia.org/r/235217 [12:26:52] 6operations, 6Engineering-Community, 3ECT-September-2015: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1592484 (10Qgil) [12:28:16] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Make sure that the timer and service names match [puppet] - 10https://gerrit.wikimedia.org/r/235217 (owner: 10Yuvipanda) [12:30:46] 7Puppet, 6operations: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031#1592500 (10yuvipanda) 3NEW [12:34:26] 6operations, 6Labs, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592521 (10yuvipanda) Ok, so the problem was that the cleanup script wasn't being triggered by any means automatically. Should be fixed now - need to... [12:34:39] 6operations, 6Labs, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592523 (10yuvipanda) a:3yuvipanda [12:57:34] (03PS3) 10TheDJ: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) [12:58:30] (03PS4) 10TheDJ: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) [13:00:06] (03Abandoned) 10Muehlenhoff: Enable base::firewall on labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/227945 (owner: 10Muehlenhoff) [13:05:01] PROBLEM - Last cleanup of snapshots in the labstore vg on labstore1001 is CRITICAL: NRPE: Command check_cleanup-snapshots-labstore-state not defined [13:06:33] hmmm [13:09:21] !log enabled ferm on labsdb100[467] [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:11] (03PS2) 10Muehlenhoff: Enable ferm on db1048.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/233670 [13:13:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on db1048.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/233670 (owner: 10Muehlenhoff) [13:14:22] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: puppet fail [13:15:58] 6operations, 10ops-eqiad: Change racktables entries for renamed analytics -> kafka names - https://phabricator.wikimedia.org/T109856#1592690 (10Cmjohnson) Common Names have been changed. Need to add labels [13:17:01] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1592699 (10Cmjohnson) a:5Cmjohnson>3jcrespo Assigning this to @jcrespo [13:17:05] !log enabled ferm on db1048 [13:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:03] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=533.10 Read Requests/Sec=0.00 Write Requests/Sec=468.88 KBytes Read/Sec=0.00 KBytes_Written/Sec=1875.10 [13:28:12] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.10 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=1.20 [13:31:29] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [13:32:17] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1592738 (10AndyRussG) @jrobell thanks a lot for mentioning this! I didn't know it was widespread e... [13:33:15] (03PS1) 10Muehlenhoff: Add a custom rsync ferm rule for swift storage [puppet] - 10https://gerrit.wikimedia.org/r/235221 (https://phabricator.wikimedia.org/T108987) [13:33:33] (03CR) 10Krinkle: [C: 032] Remove unused rl-test.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235209 (owner: 10Krinkle) [13:33:39] (03Merged) 10jenkins-bot: Remove unused rl-test.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235209 (owner: 10Krinkle) [13:35:12] (03PS1) 10Krinkle: Remove all files except README [apache-config] - 10https://gerrit.wikimedia.org/r/235222 [13:36:08] (03CR) 10Krinkle: "Keeps coming up in search results which is annoying. Let's clear out its HEAD. Can be kept for archival purposes in Git history if wanted." [apache-config] - 10https://gerrit.wikimedia.org/r/235222 (owner: 10Krinkle) [13:38:49] !log krinkle@tin Synchronized w/: Remove rl-test.php (duration: 00m 13s) [13:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:04] 7Puppet, 6operations: Kill manifests/realm.pp - https://phabricator.wikimedia.org/T85459#1592790 (10fgiunchedi) also some things don't seem to belong there at all, for example `$site` is currently autodetected from ip address whereas I think it should be set by provisioning. e.g. by writing `/etc/wikimedia/sit... [13:43:19] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:50] (03PS1) 10BBlack: Add missing codfw public icinga monitors [puppet] - 10https://gerrit.wikimedia.org/r/235225 [13:49:59] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1592844 (10zeljkofilipin) [[ https://gerrit.wikimedia.org/r/#/c/225238/ | 225238 ]] was reverted by [[ https://gerrit.wikimedia.org/r/#/c/226898/ | 2268... [13:50:19] (03PS2) 10Alexandros Kosiaris: Log for Apertium [puppet] - 10https://gerrit.wikimedia.org/r/230992 (https://phabricator.wikimedia.org/T108797) (owner: 10KartikMistry) [13:50:47] (03CR) 10BBlack: [C: 032] Add missing codfw public icinga monitors [puppet] - 10https://gerrit.wikimedia.org/r/235225 (owner: 10BBlack) [13:51:48] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1592846 (10Ottomata) When you say 'queue' and 'dict' do you mean the same thing? I had thought you were suggesting just using a regular old dict shared between the child (varnishlog... [13:52:01] ¡Hola jynus! How's it going? Just wondering if you might have a few minutes today to review a schema change that we're hoping to deploy soon... Thanks in advance! https://gerrit.wikimedia.org/r/#/c/222353 https://phabricator.wikimedia.org/T110963 [13:54:14] AndyRussG: did you see the email reply he gave via ops? [13:54:20] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1592852 (10zeljkofilipin) 5Open>3stalled [13:54:42] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1592857 (10faidon) Honestly, I'd prefer option (2). Reversing the Package->File relationships shouldn't be very hard (we probably need an additional File['/etc/ferm'] resource to make sure the directory exists) and is a one... [13:54:55] JohnFLewis: ah no, thanks! I recently got subscribed there, I think I can now check the archives :) [13:55:32] AndyRussG: yep https://lists.wikimedia.org/mailman/private/ops/2015-September/049876.html -- just auth and its there :) [13:59:00] JohnFLewis: ahhh K... thanks much for pointing that out... mm ooops [14:00:04] jynus: ooops! I just saw your reply on the ops mailing list, sorry about that ;p [14:00:19] PROBLEM - mysqld processes on es1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:00:46] * mafk eyes JohnFLewis - says 'hi' [14:00:53] AndyRussG, I think I have a meeting now, will taly to you in 1 hour [14:00:54] mafk: hi :) [14:01:06] jynus: looks like he had the same idea :) [14:01:23] AndyRussG, will you be available in 1 hour? [14:01:34] hi jynus - and 'bye - /me gone [14:02:25] PROBLEM - mysqld processes on es1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:02:49] ignore icinga, those are not yet in production [14:03:43] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1592889 (10BBlack) 3NEW [14:04:01] jynus: thanks!!! :) exactly in 1 hour is a bit tough--I have an appointment that's not easy to change, but it's nearby. Maybe 2 hours from now? (16:00 UTC) [14:04:05] PROBLEM - Host upload-lb.codfw.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:27] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1592901 (10TFlanagan-WMF) I would like to join #Project-Creators to be able to create and manage projects for the [[ https://outreach.wikimedi... [14:04:34] I might be able to move it, though let me see [14:04:41] AndyRussG, I work 24/7, so not a big deal [14:05:02] AndyRussG, just ping me when you are available again [14:05:19] 24/7? Slacker. [14:05:27] jynus: K will do, thanks so much!!! [14:05:45] yeah working from home has benefits and drawbacks... [14:06:09] AndyRussG, join #wikimedia-databases for less noise when pinging me [14:06:23] that is 100% dedicated for databases [14:06:29] OK got it :) [14:06:40] and someone else will help you if I cannot at some point [14:06:46] but I will do [14:06:59] * ostriches adds a new channel to idle in [14:07:39] (03CR) 10Ottomata: [C: 031] Enable ferm on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/235001 (owner: 10Muehlenhoff) [14:08:03] (03PS2) 10Muehlenhoff: Enable ferm on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/235001 [14:08:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/235001 (owner: 10Muehlenhoff) [14:15:51] (03PS1) 10BBlack: fix text-lb.codfw config-geo typo [dns] - 10https://gerrit.wikimedia.org/r/235230 [14:16:05] (03CR) 10BBlack: [C: 032 V: 032] fix text-lb.codfw config-geo typo [dns] - 10https://gerrit.wikimedia.org/r/235230 (owner: 10BBlack) [14:20:24] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1592958 (10Gilles) 3NEW a:3Gilles [14:21:17] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1592972 (10Krenair) I was waiting for @VBaranetsky to comment [14:28:00] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1593008 (10Gilles) I've just realized that this will require having a tracking mechanism to purge articles when thumbnails they contain get purged themselves. Otherwise article caching wo... [14:32:27] 6operations: Error: 2013 Lost connection to MySQL server during query (10.64.16.8) - https://phabricator.wikimedia.org/T111052#1593034 (10Steinsplitter) 3NEW [14:35:41] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1593052 (10Dzahn) confirmed by putnik as well. < putnik> It not used anymore as I know. [14:37:35] !log reset elasticsearch cluster.routing.allocation.disk.high back to 90% [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:27] RECOVERY - mysqld processes on es1010 is OK: PROCS OK: 1 process with command name mysqld [14:40:06] (03CR) 10Jdlrobson: [C: 031] Enable WikidataPageBanner extension on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234942 (https://phabricator.wikimedia.org/T110837) (owner: 10Glaisher) [14:41:02] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1593063 (10Milimetric) 3NEW [14:41:18] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1593072 (10Milimetric) [14:42:27] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1593073 (10Gilles) For now GlobalUsage might be sufficient: https://en.wikipedia.org/w/api.php?action=query&prop=globalusage&titles=File:Louis_Armstrong_restored.jpg Combined with https:... [14:43:31] 6operations, 10Wikimedia-Mailing-lists: wikimedia-ke: close list - https://phabricator.wikimedia.org/T110975#1593075 (10Dzahn) The last messages in archives were September 2014 (spam) and June 2014 (Manuel of wikimedia.ch asking what to do with mailboxes for wikimedia.or.ke on their servers since the domain na... [14:44:40] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593078 (10Dzahn) [14:44:41] 6operations, 10Wikimedia-Mailing-lists: wikimedia-ke: close list - https://phabricator.wikimedia.org/T110975#1593076 (10Dzahn) 5Open>3Resolved "wikimediake disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page." [14:44:48] 6operations, 10Wikimedia-Mailing-lists: wikimediake: close list - https://phabricator.wikimedia.org/T110975#1593080 (10Dzahn) [14:45:27] ACKNOWLEDGEMENT - Host upload-lb.codfw.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black router issues, no prod traffic yet, investigating [14:46:19] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1593086 (10JohnLewis) 3NEW a:3Dzahn [14:49:31] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1593096 (10akosiaris) I thought about the ferm package updates as well. It's mostly a one per Debian release thing though. So once every 2 years. That's why I preferred it. If we can do it nicely in the way of 2 I am fine w... [14:52:38] RECOVERY - Host upload-lb.codfw.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 52.63 ms [14:52:40] 6operations, 10Wikimedia-Mailing-lists: wikifi-l: close the list - https://phabricator.wikimedia.org/T111055#1593107 (10JohnLewis) 3NEW a:3Dzahn [14:53:38] !log labstore1002: mdadm --stop /dev/md3 [14:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:45] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1593118 (10Dzahn) Could somebody mail the list itself or ask on wiki? [14:54:53] (03PS2) 10BBlack: Switch US/TX to codfw [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [14:56:45] (03PS1) 10Muehlenhoff: Enable ferm on analytics1028 [puppet] - 10https://gerrit.wikimedia.org/r/235235 [14:57:54] (03CR) 10Ottomata: [C: 031] Enable ferm on analytics1028 [puppet] - 10https://gerrit.wikimedia.org/r/235235 (owner: 10Muehlenhoff) [14:58:44] !log labstore1002: mdadm --zero-superblock /dev/sday1 [14:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:58] !log labstore1002: mdadm /dev/md/slice15 --re-add /dev/sday [14:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:35] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1593133 (10Dzahn) @JohnLewis please add the info from the mail you sent, so we can use it to refer to when other lists have similar questions [14:59:37] jouncebot: next [14:59:37] In 0 hour(s) and 0 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T1500) [14:59:54] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1593134 (10Dzahn) a:3JohnLewis [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T1500). Please do the needful. [15:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:18] Hey [15:00:45] James_F, want to remove your -1? [15:00:52] Oh, point. :-) [15:01:07] is the VE increase all that's happening in this window? [15:01:08] (03CR) 10Jforrester: [C: 031] "Go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231465 (https://phabricator.wikimedia.org/T90664) (owner: 10Jforrester) [15:01:18] bblack: Looks like. [15:01:24] ok! [15:01:38] (03PS2) 10Reedy: Enable VisualEditor for all new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231465 (https://phabricator.wikimedia.org/T90664) (owner: 10Jforrester) [15:01:42] zomg it rebased :P [15:01:48] Reedy: Shock, etc. [15:01:51] (03CR) 10Yuvipanda: "Will this actually switch the in-use encoder across the cluster? If so I don't think we should do this in PuppetSWAT" [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [15:01:53] (03CR) 10BBlack: [C: 032] Switch US/TX to codfw [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [15:01:56] \o/ [15:02:06] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for all new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231465 (https://phabricator.wikimedia.org/T90664) (owner: 10Jforrester) [15:02:12] (03Merged) 10jenkins-bot: Enable VisualEditor for all new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231465 (https://phabricator.wikimedia.org/T90664) (owner: 10Jforrester) [15:02:19] (03CR) 10Yuvipanda: "or is the current imagescalers all precise still?" [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [15:02:30] YuviPanda: They are. [15:02:37] all of them? [15:03:07] RECOVERY - mysqld processes on es1017 is OK: PROCS OK: 1 process with command name mysqld [15:03:23] I think so? [15:03:23] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/231465/ - VE for all new enwiki accounts (duration: 00m 13s) [15:03:25] (03PS2) 10Muehlenhoff: Enable ferm on analytics1028 [puppet] - 10https://gerrit.wikimedia.org/r/235235 [15:03:27] (03CR) 10Yuvipanda: " They are." [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [15:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:29] James_F, ^ [15:03:31] James_F: thank you. [15:03:33] Whee. [15:03:36] Krenair: Thank you. [15:03:39] np [15:03:43] YuviPanda: No problem. [15:03:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on analytics1028 [puppet] - 10https://gerrit.wikimedia.org/r/235235 (owner: 10Muehlenhoff) [15:04:11] Actually, James_F, YuviPanda: [15:04:12] # mw1153-1160 are imagescalers (trusty) [15:04:21] Oh. Huh. [15:04:24] and all of the codfw imagescalers are trusty, of course [15:04:28] !log labstore1002: mdadm --zero-superblock /dev/sdax1 && mdadm /dev/md/slice15 --re-add /dev/sdax [15:04:29] Is it just the videoscalers? [15:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:35] !log enabled ferm in analytic1028 (initial hadoop worker) [15:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:39] James_F, I think it's tin, terbium, silver, snapshot100[1-4], and tmh100[12] [15:05:53] tmh machines are eqiad videoscalers [15:06:01] yeah, and that change affects only videoscalers [15:07:36] !log labstore1002: mdadm --zero-superblock /dev/sd{aw,bh,bg,bf,be,bd,bc,bb,ba,az}1 [15:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:46] 6operations, 10Wikimedia-Mailing-lists: wikiia-l: close list - https://phabricator.wikimedia.org/T111057#1593161 (10JohnLewis) 3NEW a:3Dzahn [15:12:31] 6operations, 10Wikimedia-Mailing-lists: wikilb-l: close list - https://phabricator.wikimedia.org/T111059#1593187 (10JohnLewis) 3NEW a:3Dzahn [15:13:12] So now codfw caches are serving user traffic in Texas, what's the plan for starting to use the mw/db and related servers there? [15:13:47] search isn't expected to be ready to serve out of dalas til EOM, fwiw [15:13:47] 6operations, 10Wikimedia-Mailing-lists: wikiia-l: close list - https://phabricator.wikimedia.org/T111057#1593196 (10Dzahn) Actually i don't see this list in ./list_lists or in ./archives/private. Already deleted but held messages were left?? [15:14:43] !log labstore1002: mdadm /dev/md/slice15 --re-add /dev/sdaw [15:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:47] 6operations, 10Wikimedia-Mailing-lists: exyu-tech: Delete list - https://phabricator.wikimedia.org/T111060#1593198 (10JohnLewis) 3NEW a:3Dzahn [15:15:01] (search could probably be up a little faster...but wanted to use the time between now and then for testing on a cluster that can fall over) [15:15:05] * aude missed swat but would like to deploy something soonish (once i get it ready) [15:15:36] 6operations, 10Wikimedia-Mailing-lists: wikiia-l: close list - https://phabricator.wikimedia.org/T111057#1593207 (10Dzahn) Nevermind, it is in archives/private. Last messages in 2009. Then a January 2004 and that is all. Running the disable_list.sh "wikiia-l disabled. Archives should be available at current... [15:15:42] 6operations, 10Wikimedia-Mailing-lists: wikiia-l: close list - https://phabricator.wikimedia.org/T111057#1593208 (10Dzahn) 5Open>3Resolved [15:15:44] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593209 (10Dzahn) [15:15:56] 6operations, 10Wikimedia-Mailing-lists: wikiia-l: close list - https://phabricator.wikimedia.org/T111057#1593213 (10JohnLewis) https://lists.wikimedia.org/mailman/listinfo/wikiia-l and it is listed on the main listinfo page. Archives exist https://lists.wikimedia.org/pipermail/wikiia-l/2009-April/000007.html. [15:15:57] (03PS1) 10Muehlenhoff: Add rule for journalnode on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/235238 [15:16:23] !log labstore1002: mdadm /dev/md/slice15 --re-add /dev/sd{bb,ba,az} [15:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:23] (03CR) 10Ottomata: [C: 031] Add rule for journalnode on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/235238 (owner: 10Muehlenhoff) [15:17:52] (03PS2) 10Muehlenhoff: Add rule for journalnode on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/235238 [15:17:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add rule for journalnode on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/235238 (owner: 10Muehlenhoff) [15:18:03] 6operations, 10Wikimedia-Mailing-lists: wikifi-l: close the list - https://phabricator.wikimedia.org/T111055#1593221 (10Dzahn) No messages in archives within this decade :) ack -- running disable_list.sh "wikifi-l disabled. Archives should be available at current location, all mail should be moderated and... [15:18:11] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593223 (10Dzahn) [15:18:12] 6operations, 10Wikimedia-Mailing-lists: wikifi-l: close the list - https://phabricator.wikimedia.org/T111055#1593222 (10Dzahn) 5Open>3Resolved [15:18:49] mutante: JohnFLewis should also close the mediawiki-india list [15:18:53] 6operations, 5Continuous-Integration-Scaling, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1593228 (10chasemp) @Andrew, @hashar -- I would think we treat nodepool as we do other openstack services? Why would it be treated differently from a db perspective? [15:19:05] YuviPanda: is there a ticket? :) [15:19:09] JohnFLewis: yes [15:19:30] ah https://phabricator.wikimedia.org/T110428 [15:19:32] it was done [15:19:33] I missed that [15:19:34] sorry [15:19:35] thanks you :) [15:19:44] :) [15:20:38] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593238 (10Dzahn) [15:20:39] 6operations, 10Wikimedia-Mailing-lists: exyu-tech: Delete list - https://phabricator.wikimedia.org/T111060#1593236 (10Dzahn) 5Open>3Resolved "exyu-tech disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page. " [15:21:07] (03CR) 10Rush: "btw, thanks jan seems to have worked out" [puppet] - 10https://gerrit.wikimedia.org/r/235048 (https://phabricator.wikimedia.org/T110635) (owner: 10JanZerebecki) [15:21:17] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [15:21:50] mark: ^ is that you? [15:22:01] yes [15:22:17] unfortunately necessary [15:23:17] 6operations, 10Wikimedia-Mailing-lists: wikilb-l: close list - https://phabricator.wikimedia.org/T111059#1593251 (10Dzahn) last messages in archive in 2012-Feb, before that just 2004-Oct. "wikilb-l disabled. Archives should be available at current location, all mail should be moderated and the list should not... [15:23:23] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593260 (10Dzahn) [15:23:25] 6operations, 10Wikimedia-Mailing-lists: wikilb-l: close list - https://phabricator.wikimedia.org/T111059#1593259 (10Dzahn) 5Open>3Resolved [15:24:05] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1593262 (10Dzahn) Could somebody mail the list itself and ask for volunteers? [15:24:19] 6operations, 10Wikimedia-Mailing-lists: disable mailman list - https://phabricator.wikimedia.org/T111063#1593264 (10JohnLewis) 3NEW a:3Dzahn [15:24:39] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1593271 (10Dzahn) Could somebody mail the list itself and ask them? [15:24:59] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1593272 (10Dzahn) Maybe @yurik is interested in this one.? [15:25:17] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [15:25:57] 6operations, 10Wikimedia-Mailing-lists: wikinews-l: no active listadmin - https://phabricator.wikimedia.org/T110956#1593281 (10Dzahn) Any other takers? Thanks @Revi for offering, i think we should go ahead then and give it to you. [15:27:24] 6operations, 10Wikimedia-Mailing-lists: wikiskan-l: disable list - https://phabricator.wikimedia.org/T111065#1593292 (10JohnLewis) 3NEW a:3Dzahn [15:28:33] 6operations, 10Wikimedia-Mailing-lists: wikiskan-l: disable list - https://phabricator.wikimedia.org/T111065#1593302 (10Dzahn) last message in archive from 2012-Jan, before just 2005/2006 and nothing in between. [15:29:02] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593315 (10Dzahn) [15:29:04] 6operations, 10Wikimedia-Mailing-lists: wikiskan-l: disable list - https://phabricator.wikimedia.org/T111065#1593313 (10Dzahn) 5Open>3Resolved "wikiskan-l disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page." [15:30:55] 6operations, 10Wikimedia-Mailing-lists: wikiru-a: disable list - https://phabricator.wikimedia.org/T111067#1593322 (10JohnLewis) 3NEW a:3Dzahn [15:31:22] 6operations, 10Wikimedia-Mailing-lists: disable mailman list - https://phabricator.wikimedia.org/T111063#1593329 (10Dzahn) deleted all held messages on list "mailman". let's talk about the disabling procedure though for this one. are we sure it's not needed as it's the meta sitelist? [15:33:12] (03PS1) 10Muehlenhoff: Disable ferm on analytics1028, mapreduce.v2.mrappmaster needs to be researched further [puppet] - 10https://gerrit.wikimedia.org/r/235246 [15:33:27] (03PS2) 10Muehlenhoff: Disable ferm on analytics1028, mapreduce.v2.mrappmaster needs to be researched further [puppet] - 10https://gerrit.wikimedia.org/r/235246 [15:33:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable ferm on analytics1028, mapreduce.v2.mrappmaster needs to be researched further [puppet] - 10https://gerrit.wikimedia.org/r/235246 (owner: 10Muehlenhoff) [15:36:48] 6operations, 10Wikimedia-Mailing-lists: disable mailman list - https://phabricator.wikimedia.org/T111063#1593339 (10JohnLewis) It's used as the origination email for things like bounces, password reminders and so on. It exists mostly as a 'method users can use to communicate with site administrators'. Since @m... [15:36:56] !log disabled ferm in analytic1028, needs some more work on possibly dynamic mapreduce ports [15:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:20] !log labstore1002: mdadm /dev/md/slice51 --add /dev/sd{bh,bg,bf,be,bd,bc} [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:22] 6operations, 5Continuous-Integration-Scaling, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1593350 (10Andrew) I think everyone is right. It's a separate database, strictly speaking, but it should be hosted on the same server (m5-master) as the other labs db services, wi... [15:41:13] (03PS8) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [15:43:50] (03PS1) 10BBlack: cache_mobile: limit 4xx to 1m like text [puppet] - 10https://gerrit.wikimedia.org/r/235251 (https://phabricator.wikimedia.org/T109286) [15:43:52] (03PS1) 10BBlack: cache_mobile: raise max conns to 1k like text [puppet] - 10https://gerrit.wikimedia.org/r/235252 (https://phabricator.wikimedia.org/T109286) [15:44:56] !log labstore1002: update-initramfs -k all -u [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:02] 6operations, 10Wikimedia-Mailing-lists: wikiru-a: disable list - https://phabricator.wikimedia.org/T111067#1593371 (10Dzahn) last message in archive 2013-May with almost no content, nothing at all in 2012 or 2009. [15:45:35] (03CR) 10Filippo Giunchedi: "puppet compiler for last PS shows no unexpected changes, https://puppet-compiler.wmflabs.org/869/" [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:45:54] 6operations, 10Wikimedia-Mailing-lists: wikiru-a: disable list - https://phabricator.wikimedia.org/T111067#1593374 (10Dzahn) 5Open>3Resolved "wikiru-a disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page." [15:45:56] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1593376 (10Dzahn) [15:47:08] 6operations, 5Continuous-Integration-Scaling, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1593380 (10jcrespo) I'm on it, assigning space on the m5 shard. BTW, the `FLUSH PRIVILEGES;` of the Openstack documentation is a bug: http://dbahire.com/stop-using-flush-privileges/ [15:47:29] !log labstore1002: echo 10000 > /sys/block/md123/md/sync_speed_min [15:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:51] !log reedy@tin Synchronized php-1.26wmf20/extensions/SecurePoll/: Stop cronspam (duration: 00m 13s) [15:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:36] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1593388 (10jcrespo) Thanks again, I've already seen the entries on racktables! Was waiting for that to fully own it. [15:50:31] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1593392 (10BBlack) Remaining interesting diffs, aside from the basics (differing "special" cookies, zero/X-CS/X-F-B, mobile-redirect vs X-Subdomain, and Vary-variances): mobile... [15:51:34] !log stopped replicate-tools on labstore1002, and cleaned out lockdir [15:51:38] mark: ^ [15:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:44] thanks [15:51:49] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:53:30] (03PS1) 10Rush: elasticsearch: ferm for one and all [puppet] - 10https://gerrit.wikimedia.org/r/235255 [15:54:00] (03CR) 10Alex Monk: [C: 032] Raise account creation throttle for events in Santiago, 2015-09-04 and 2015-09-05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235168 (https://phabricator.wikimedia.org/T110979) (owner: 10Alex Monk) [15:55:20] (03Merged) 10jenkins-bot: Raise account creation throttle for events in Santiago, 2015-09-04 and 2015-09-05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235168 (https://phabricator.wikimedia.org/T110979) (owner: 10Alex Monk) [15:55:30] (03PS1) 10Krinkle: asset-check: Wait for async modules to finish [puppet] - 10https://gerrit.wikimedia.org/r/235256 [15:57:00] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/235168/ (duration: 00m 13s) [15:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:12] (03CR) 10Krinkle: "Tested locally by invoking phantomjs directly (similar to what asset-check.py would do) for one of the urls. Seems to resolve the variance" [puppet] - 10https://gerrit.wikimedia.org/r/235256 (owner: 10Krinkle) [15:57:51] (03CR) 10Filippo Giunchedi: [C: 04-1] Add a custom rsync ferm rule for swift storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/235221 (https://phabricator.wikimedia.org/T108987) (owner: 10Muehlenhoff) [15:59:17] (03PS1) 10Ottomata: Puppetize yarn.app.mapreduce.am.job.client.port-range [puppet/cdh] - 10https://gerrit.wikimedia.org/r/235257 [15:59:22] (03CR) 10jenkins-bot: [V: 04-1] Puppetize yarn.app.mapreduce.am.job.client.port-range [puppet/cdh] - 10https://gerrit.wikimedia.org/r/235257 (owner: 10Ottomata) [15:59:30] 6operations, 5Continuous-Integration-Scaling, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1593417 (10Andrew) For people following along at home... the archive in question was /srv/others/orphan-volume/visualeditor.tgz [15:59:31] (03CR) 10Rush: [C: 032] elasticsearch: ferm for one and all [puppet] - 10https://gerrit.wikimedia.org/r/235255 (owner: 10Rush) [15:59:40] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1593420 (10chasemp) [15:59:59] !log ferm for elastic1003/2/1(master) [16:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:04] YuviPanda robh: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T1600). [16:00:05] hashar hashar bd808 brion Krinkle: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:11] (03PS2) 10Ottomata: Puppetize yarn.app.mapreduce.am.job.client.port-range [puppet/cdh] - 10https://gerrit.wikimedia.org/r/235257 [16:00:11] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [16:00:15] hello jouncebot [16:00:23] (03CR) 10Ottomata: [C: 032 V: 032] Puppetize yarn.app.mapreduce.am.job.client.port-range [puppet/cdh] - 10https://gerrit.wikimedia.org/r/235257 (owner: 10Ottomata) [16:00:32] hashar: around? [16:01:06] bd808: brion Krinkle around? [16:01:10] nod [16:01:19] Krenair: alright, let me do yours first [16:01:47] You mean Krinkle? [16:02:04] (03PS3) 10Yuvipanda: apache: Remove unused 'title' parameters from extract2.php urls [puppet] - 10https://gerrit.wikimedia.org/r/234174 (owner: 10Krinkle) [16:02:05] 6operations, 5Continuous-Integration-Scaling, 7Database: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1593432 (10hashar) >>! In T110693#1593380, @jcrespo wrote: > BTW, the `FLUSH PRIVILEGES;` of the Openstack documentation is a bug: http://dbahire.com/stop-using-flush-privileges/... [16:02:10] isn't this above the 8 patch limit? [16:02:26] There's a limit? [16:02:29] :P [16:02:32] max 8 patches, yeah [16:02:32] hmm [16:02:33] it is [16:02:43] Antoine cheated :P [16:03:05] (03CR) 10Yuvipanda: [C: 032] apache: Remove unused 'title' parameters from extract2.php urls [puppet] - 10https://gerrit.wikimedia.org/r/234174 (owner: 10Krinkle) [16:03:14] Krenair: gotta wait 20mins for it to change everywhere... [16:03:15] bag [16:03:17] bah [16:03:18] Krinkle: [16:03:56] Krenair: not sure how to deal with the limit... [16:04:02] (03PS2) 10Filippo Giunchedi: xenon additional instances [dns] - 10https://gerrit.wikimedia.org/r/234286 (https://phabricator.wikimedia.org/T95253) [16:04:08] also, none of the other two people who proposed patches are here [16:04:21] Krinkle: puppet-merged. [16:04:43] * YuviPanda pokes hashar, bd808 and brion [16:04:51] wot [16:04:56] hi brion [16:05:07] brion: some of your patches are on puppetswat! https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T1600 [16:05:12] \o/ [16:05:14] YuviPanda: I'm not really here. no devices meeting starts in 1 minute :/ [16:05:27] bd808: if brion is here that's good enough for me [16:05:31] heh [16:05:32] coolio [16:05:41] brion: can you respond to my question on https://gerrit.wikimedia.org/r/#/c/234699/ [16:06:32] (03PS5) 10Yuvipanda: beta: Replace deployment-videoscaler01 with deployment-tmh01 [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) (owner: 10BryanDavis) [16:06:52] (03CR) 10Brion VIBBER: "This'll affect anything on Trusty... if prod video & image scalers are all Precise then it shouldn't explode them yet. I thought we'd upda" [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [16:06:54] (03CR) 10Yuvipanda: [C: 032] "Note that you can also use per-host hiera from wikitech now. Page template is Hiera:$project/host/$hostname" [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) (owner: 10BryanDavis) [16:07:43] are image scalers still precise? i though tthose had been updated (but i'm only vaguely following image scalers) [16:07:43] brion: yes but imagescalers don't use ffmpeg no? [16:07:44] brion: yes, imagescalers are all trusty now [16:07:46] (but it looks like image scalers do the thumbnails, so they use ffmpeg/avconv for that) [16:08:03] oh well that sounds fun [16:08:12] brion: ah. then I'm going to decline this for puppetswat, needs more co-ordination I'm afriad. [16:08:30] yeah we need to swap in the new trusty videoscalers [16:08:42] and update the config to use ffmpeg instead of avconv [16:08:48] then it should be safe to do em all at once? [16:09:03] so i believe joe set up *one* new videoscaler on trusty [16:09:04] (03CR) 10Yuvipanda: "Ah, so they are used in the imagescalers as well for thumbnailing videos." [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [16:09:22] brion: yeah, probably. but I want to keep PuppetSWAT to simple patches and I don't think this qualifies... [16:09:24] and we haven't put it in rotation yet (wanted to get beta sorted out first) [16:09:25] yeah :D [16:09:27] ok! [16:09:42] brion: :) ok! [16:09:52] hashar: around? [16:10:07] o/ [16:10:11] YuviPanda: yeah pretty much [16:10:12] hashar: hello! [16:10:38] hashar: doing your patches now [16:10:39] (03PS2) 10Yuvipanda: admin: remove dupe 'haithams' from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/234669 (owner: 10Hashar) [16:10:39] (03CR) 10Yuvipanda: [C: 032 V: 032] admin: remove dupe 'haithams' from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/234669 (owner: 10Hashar) [16:10:41] should be straight forward [16:10:54] robh -1 ed one but I removed it from PuppetSWAT iirc [16:10:54] (03PS2) 10Yuvipanda: admin: add 'demon' to gerrit-admins group [puppet] - 10https://gerrit.wikimedia.org/r/234670 (owner: 10Hashar) [16:11:05] (03CR) 10Yuvipanda: [C: 032 V: 032] admin: add 'demon' to gerrit-admins group [puppet] - 10https://gerrit.wikimedia.org/r/234670 (owner: 10Hashar) [16:11:27] hashar: akosiaris -1'd https://gerrit.wikimedia.org/r/#/c/226729/ as well [16:11:52] (03PS2) 10Yuvipanda: contint: drop /data/project/debianrepo [puppet] - 10https://gerrit.wikimedia.org/r/234239 (owner: 10Hashar) [16:11:59] (03CR) 10Yuvipanda: [C: 032 V: 032] contint: drop /data/project/debianrepo [puppet] - 10https://gerrit.wikimedia.org/r/234239 (owner: 10Hashar) [16:12:17] !log policy.wikimedia.org dns change happening now [16:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:28] (03PS4) 10RobH: policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 [16:12:38] YuviPanda: skip the -1ed patch. Will amend later [16:12:49] hashar: so I've merged all the patches that don't have a -1 [16:12:52] hashar: https://gerrit.wikimedia.org/r/#/c/233413/ also has one [16:12:56] YuviPanda: awesome thank you very much [16:13:02] hashar: and https://gerrit.wikimedia.org/r/#/c/226730/ depends on a -1'd patch [16:13:11] YuviPanda: I will rework the other patches and add them to next PuppetSWAT once done :-} [16:13:19] hashar: :) ok! [16:13:22] yup they are related [16:13:24] thanks a ton ! [16:13:28] I think that's everyone! [16:13:31] {done} [16:13:36] * YuviPanda declares puppet swat {{done}} [16:13:45] {{{done}}} [16:14:16] * YuviPanda surrounds ori in {s [16:14:21] combo breaker! [16:14:26] (03CR) 10RobH: [C: 032] policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 (owner: 10RobH) [16:14:40] (03PS1) 10Ottomata: Set Yarn AppMaster possible port range to 55000-55199 [puppet] - 10https://gerrit.wikimedia.org/r/235261 [16:17:39] (03PS3) 10Ori.livneh: contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:17:54] YuviPanda: i fixed hashar's patch, if you're still up for swatting it [16:19:00] ori: sure! that lets me swat both of 'em [16:19:07] (03PS4) 10Yuvipanda: contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:19:14] (03CR) 10Yuvipanda: [C: 032 V: 032] contint: install pip 7.1.0 from pypi [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:19:24] (03PS3) 10Yuvipanda: contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [16:19:34] (03CR) 10Yuvipanda: [C: 032 V: 032] contint: drop pip obsolete download-cache option [puppet] - 10https://gerrit.wikimedia.org/r/226730 (owner: 10Hashar) [16:19:37] hashar: ^ [16:19:40] ori: <3 thank you! [16:20:33] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [16:20:41] ori: thanks ! [16:21:48] (03CR) 10Hashar: "Thank you akosiaris / ori / yuvi" [puppet] - 10https://gerrit.wikimedia.org/r/226729 (owner: 10Hashar) [16:22:37] (investigating the fail) [16:24:19] (Just a transient failure) [16:24:25] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:25:59] YuviPanda, any chance of getting https://gerrit.wikimedia.org/r/#/c/234404/ in? [16:26:23] Krenair: thursday :) [16:26:28] hmm [16:27:00] Krenair: it's also slighlty more involved, and I don't fully understand the implications of that [16:27:07] otherwise I would've probably been ok doing it now [16:27:12] (I don't fully know our apache infrastructure) [16:27:33] Krenair: I would suggest getting a +1 from someone who knows the apache infrastructure (off the top of my head, or.i and mutant.e, _joe._ is on vacation) [16:27:50] ok [16:27:55] anybody could look at https://gerrit.wikimedia.org/r/#/c/234673/? It's really small config change, shouldn't influence anything [16:28:26] (03PS2) 10Ori.livneh: Set batch size to default for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/234673 (owner: 10Smalyshev) [16:28:33] (03CR) 10Ori.livneh: [C: 032 V: 032] Set batch size to default for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/234673 (owner: 10Smalyshev) [16:28:54] ori: thank you! [16:31:07] SMalyshev: I see that patch's on the thursday puppetswat, can you remove it? thanks :) [16:31:19] YuviPanda: sure, will do [16:31:29] YuviPanda: awesome work [16:31:35] managing the puppet swats [16:32:10] YuviPanda: indeed, thank you for handling that [16:32:18] i wanted to but i had meeting conflict i forgot about =P [16:32:39] I'll make sure to be point on Thursday's [16:32:39] ori: :D [16:32:45] robh: ok [16:32:56] I just emailed the ops@ list asking for volunteers from next week [16:33:07] I'll be back in PDT and don't know if I'll be up for SWAT that early (9AM!) [16:34:20] heh, im on vacation next week [16:34:23] so not it ;] [16:35:25] robh: heh :( [16:35:27] err [16:35:27] :) [16:35:31] I Can always move it to the evening [16:35:35] or even move it an hour down [16:37:11] yea just move it around wheneves [16:37:21] its swatting, it shoudl take place in the working hours of whoever is running it really [16:37:38] as long as its announcing it should be cool (just announce window the week before for the upcoming week on deployments page) [16:37:57] imo. [16:38:44] (03CR) 10Muehlenhoff: [C: 031] Set Yarn AppMaster possible port range to 55000-55199 [puppet] - 10https://gerrit.wikimedia.org/r/235261 (owner: 10Ottomata) [16:39:18] robh: true, but I think the current time fits a wide variety of people (Europeans + East Coast + SF) and I'd like to keep it there [16:42:24] i think its too late for EU folks really [16:42:24] but thats just me [16:42:31] Askign them to do things in their evenings is kind of shitty [16:42:48] if something breaks, then they are stuck around laptop for dinner. [16:42:55] swat should be low chance of breaking of course [16:43:06] indeed [16:43:10] * aude waves [16:43:18] i agree that early in the SF day is better for all involved except possibly SF (though im awake at 7am so whatevs ;) [16:43:37] yeah, I don't think I should be allowed on a terminal at 7AM unless I've not slept the night before [16:43:50] i would like to deploy a somewhat important bug fix for wikidata and not wait until swat at 1am [16:43:59] is anyone else deploying or would that be ok? [16:44:30] heh, aude came up in my audit of offboarding stuff yesterday. 'who is this' 'its katie, shes wmde, its cool' 'do we have any phab or rt ticket for the access?' 'no, shes been around longer than 99.99% of the folks you know' [16:44:30] aude: afaict nobody else is deploying, we're in the PuppetSWAT window and tI'm not doing naything [16:44:30] YuviPanda: when are you coming to visit our office? :) [16:44:39] aude: tomorrow! [16:44:44] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [16:44:56] robh: i think there was a rt ticket [16:45:06] YuviPanda: \o/ [16:45:15] !log performing schema change on testwiki and metawiki [16:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:44] * aude waits for jenkins then.... [16:45:57] aude: heh, well, i didnt find it but im still tempted to comment the admin file with 'its katie, shes cool.' [16:46:04] and leave it as the only reasoning. [16:46:05] robh: :) [16:47:14] (we're actually goign to end up just notating the file of the WMDE folks and pointing it at a newer phabricator ticket where i go on record listing them) [16:47:21] sounds good [16:49:24] robh: https://github.com/wikimedia/operations-puppet/commit/6bb785ed382837c5c1fd49363bed0481f24f38b0 :) [16:49:51] dear lazyirc, thanks! JohnFLewis =] [16:50:02] even better, i am amused [16:50:45] hrmm [16:50:58] shouldnt there be a phab task for rt6460? [16:51:13] There's a load of stuff missing from phabricator that was in RT [16:51:20] no, access requests and procurement were never migrated [16:51:26] urgh. [16:51:43] access requests needs to be migrated. [16:51:43] (set to private, but migrated) [16:51:43] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1593581 (10faidon) >>! In T109286#1593392, @BBlack wrote: > mobile backend and frontend VCL has this in vcl_fetch, which text doesn't, which I don't fully understand yet: > ``` >... [16:51:45] My access request was migrated :/ [16:51:46] chasemp: Who do I chat with about that? [16:52:06] oh? [16:52:12] oh? I may be misinformed though but last I checked they weren't [16:52:15] Krenair: link? i wanna compare [16:52:21] maybe yours wasnt in the same project as others? [16:52:28] https://phabricator.wikimedia.org/T84818 [16:52:35] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [16:52:39] its not a big deal mind you, i just rather we kill rt eventually and this data migration will be a blocker [16:52:56] ahh [16:52:57] Might be because this was in ops-requests? [16:52:59] Krenair: yours was in ops [16:53:01] it's been so long I can't remember if they were or not, but last time I talked to mark I thought the conclusion was [16:53:02] hers is in access- [16:53:03] yep [16:53:08] remainging things in rt coudl stay there as long as we can reference them [16:53:11] chasemp: seems indeed, they were not [16:53:16] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [60.0] [16:53:31] well, wouldnt it be easier to migrate access-requests into private tasks and kill the need to keep rt for referencign them? [16:53:39] afaik we don't plan on moving anything else from rt to phab [16:53:40] (maybe not, im asking not telling) [16:53:57] well, then we leave all the historical access requests out of phab [16:54:01] that seems less than ideal. [16:54:06] well, easier idk, how often do we reference them and will it trail off over a short period of time? [16:54:07] why would we care? [16:54:48] folks audit the file and if its non staff, its hard to determine by all third partieis who they are [16:55:04] so putting in references for those folks in the file is nice, but if rt is eventually gone those references are null [16:55:08] compared to the trouble of migrating that stuff, I think we can manage [16:55:10] but if they were imported to phab, they are ok [16:55:15] i don't intend on removing RT anytime soon [16:55:15] RECOVERY - Persistent high iowait on labstore1002 is OK: OK: Less than 50.00% above the threshold [40.0] [16:55:18] just not moving it very much [16:55:21] s/moving/using/ [16:55:23] cool [16:55:30] it can just sit there [16:55:32] yea i imagine it'll soon be in a vm [16:55:33] and sit [16:55:37] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1593610 (10VBaranetsky) This looks great! Thank so much for the help. -Vickie [16:55:41] do ops still get onboarded in rt? [16:55:45] nope [16:56:03] i disabled most of the queues as well so they dont get incoming crap [16:56:16] but we didnt go and disable any accounts really. (other when offboarding) [16:56:23] other then offboarding. [17:01:09] 7Puppet, 10Continuous-Integration-Config, 6Scrum-of-Scrums, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1593617 (10zeljkofilipin) [17:05:25] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [60.0] [17:07:37] 6operations, 10MediaWiki-extensions-CentralNotice, 7Database, 7Schema-change: Create CentralNotice campaign mixin tables - https://phabricator.wikimedia.org/T110963#1593622 (10jcrespo) Gerrit change has been deployed in an online fashion, with the following new structure for both metawiki and testwiki: ``... [17:09:24] RECOVERY - Persistent high iowait on labstore1002 is OK: OK: Less than 50.00% above the threshold [40.0] [17:11:07] 6operations, 10MediaWiki-extensions-CentralNotice, 7Database, 7Schema-change: Create CentralNotice campaign mixin tables - https://phabricator.wikimedia.org/T110963#1593624 (10Glaisher) [17:12:15] ok, ready to +2 my change [17:18:47] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1593669 (10MaxSem) [17:23:02] !log aude@tin Synchronized php-1.26wmf20/extensions/Wikidata: Fix for change dispatcher (duration: 00m 20s) [17:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:32] done :) [17:28:52] !log freezing elasticsearch indices before applying ferm fules on master [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:58] !log bouncing Cassandra on restbase1001 to apply temporary GC setting [17:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:25] 6operations, 7Wikimedia-log-errors: Memcached TIMEOUT error spam from memcached log for global:slave_lag keys - https://phabricator.wikimedia.org/T108982#1593732 (10demon) p:5Triage>3Normal [17:39:45] (03PS1) 10Yuvipanda: labstore: Check for instances being able to rw NFS [puppet] - 10https://gerrit.wikimedia.org/r/235272 [17:39:46] ^ if anyone wants to review? [17:40:08] (03PS2) 10Yuvipanda: labstore: Check for instances being able to rw NFS [puppet] - 10https://gerrit.wikimedia.org/r/235272 [17:41:00] (03CR) 10Yuvipanda: [C: 032] labstore: Check for instances being able to rw NFS [puppet] - 10https://gerrit.wikimedia.org/r/235272 (owner: 10Yuvipanda) [17:41:45] YuviPanda: why ask for reviews then merge? :P [17:44:58] 6operations, 10ops-codfw: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1593759 (10RobH) 3NEW a:3Papaul [17:46:10] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago [17:46:11] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1593769 (10RobH) [17:46:11] 6operations, 10ops-codfw: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1593768 (10RobH) [17:47:44] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:52:09] (03PS2) 10BBlack: cache_mobile: limit 4xx to 1m like text [puppet] - 10https://gerrit.wikimedia.org/r/235251 (https://phabricator.wikimedia.org/T109286) [17:52:25] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 1809 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1720, number_of_pending_tasks: 2961, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4701, initializing_shards: 88, number_of_data_ [17:52:25] PROBLEM - ElasticSearch health check for shards on elastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 1809 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1720, number_of_pending_tasks: 2962, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4701, initializing_shards: 88, number_of_data_ [17:52:35] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 1788 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1699, number_of_pending_tasks: 3029, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4722, initializing_shards: 88, number_of_data_ [17:52:36] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch inactive shards 1788 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1699, number_of_pending_tasks: 3035, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4722, initializing_shards: 88, number_of_data_ [17:52:55] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3230, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:55] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3230, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:55] PROBLEM - ElasticSearch health check for shards on elastic1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3230, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:55] PROBLEM - ElasticSearch health check for shards on elastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3232, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:55] PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3232, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:56] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3232, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:52:56] PROBLEM - ElasticSearch health check for shards on elastic1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 1729 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1640, number_of_pending_tasks: 3232, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4781, initializing_shards: 88, number_of_data_ [17:53:05] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 1711 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 1622, number_of_pending_tasks: 3314, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4799, initializing_shards: 88, number_of_data_ [17:53:06] PROBLEM - ElasticSearch health check for shards on elastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 1711 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1622, number_of_pending_tasks: 3323, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4799, initializing_shards: 88, number_of_data_ [17:53:06] PROBLEM - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 1711 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1622, number_of_pending_tasks: 3323, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4799, initializing_shards: 88, number_of_data_ [17:53:06] PROBLEM - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 1711 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1622, number_of_pending_tasks: 3323, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4799, initializing_shards: 88, number_of_data_ [17:53:06] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 1711 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1622, number_of_pending_tasks: 3323, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4799, initializing_shards: 88, number_of_data_ [17:53:15] PROBLEM - ElasticSearch health check for shards on elastic1023 is CRITICAL: CRITICAL - elasticsearch inactive shards 1690 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1613, number_of_pending_tasks: 3354, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4820, initializing_shards: 76, number_of_data_ [17:53:24] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 1685 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1593, number_of_pending_tasks: 3439, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4825, initializing_shards: 91, number_of_data_ [17:53:34] PROBLEM - ElasticSearch health check for shards on elastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 1666 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1574, number_of_pending_tasks: 3513, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4844, initializing_shards: 91, number_of_data_ [17:53:34] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 1666 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1574, number_of_pending_tasks: 3513, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4844, initializing_shards: 91, number_of_data_ [17:53:35] PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 1666 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1574, number_of_pending_tasks: 3530, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4844, initializing_shards: 91, number_of_data_ [17:53:35] PROBLEM - ElasticSearch health check for shards on elastic1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 1666 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1574, number_of_pending_tasks: 3530, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 1, active_shards: 4844, initializing_shards: 91, number_of_data_ [17:53:45] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:53:45] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:53:45] PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:53:45] PROBLEM - ElasticSearch health check for shards on elastic1024 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:53:45] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:53:46] PROBLEM - ElasticSearch health check for shards on elastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 1648 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1555, number_of_pending_tasks: 3614, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4861, initializing_shards: 93, number_of_data_ [17:55:40] (03CR) 10BBlack: [C: 032] cache_mobile: limit 4xx to 1m like text [puppet] - 10https://gerrit.wikimedia.org/r/235251 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [17:55:40] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 1608 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1515, number_of_pending_tasks: 3756, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4901, initializing_shards: 93, number_of_data_ [17:55:41] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch inactive shards 1608 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1515, number_of_pending_tasks: 3763, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4901, initializing_shards: 93, number_of_data_ [17:55:41] PROBLEM - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 1589 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1496, number_of_pending_tasks: 3837, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2165, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 4920, initializing_shards: 93, number_of_data_ [17:55:41] chasemp: ^ [17:55:41] (03PS1) 10EBernhardson: Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [17:55:41] who killed elasticsearch? [17:55:41] (03PS2) 10BBlack: cache_mobile: raise max conns to 1k like text [puppet] - 10https://gerrit.wikimedia.org/r/235252 (https://phabricator.wikimedia.org/T109286) [17:55:41] thanks we are aware [17:55:41] elasticsearch is recovering currently [17:55:42] (03CR) 10jenkins-bot: [V: 04-1] Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [17:55:42] (03CR) 10BBlack: [C: 032] cache_mobile: raise max conns to 1k like text [puppet] - 10https://gerrit.wikimedia.org/r/235252 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [17:55:42] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1593795 (10Qgil) @Tflanagan-wmf, you're in! [17:59:30] (03PS1) 10Jcrespo: Repool es1010, pool es1017 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235276 (https://phabricator.wikimedia.org/T105843) [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T1800). Please do the needful. [18:06:31] PROBLEM - NFS read/writeable on labs instances on labstore1001 is CRITICAL: Connection refused [18:06:31] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: Connection refused [18:08:47] PROBLEM - NFS read/writeable on labs instances on labstore2001 is CRITICAL: Connection refused [18:12:47] (03CR) 10Krinkle: [C: 031] Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 (owner: 10Faidon Liambotis) [18:14:31] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1593923 (10MaxSem) >>! In T109286#1593392, @BBlack wrote: > and finally, mobile has this inexplicable thing (I think there's a ticket about it floating around somewhere, but can'... [18:17:00] PROBLEM - Persistent high iowait on labstore1002 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [60.0] [18:17:32] bblack: I wonder if we can go ahead and kill the mobile caches in beta. [18:19:01] RECOVERY - Persistent high iowait on labstore1002 is OK: OK: Less than 50.00% above the threshold [40.0] [18:19:18] * ostriches tries :p [18:20:01] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [18:34:23] (03PS1) 10Chad: Stop sending purges to deployment-cache-bits04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235282 [18:34:53] (03CR) 10Chad: [C: 032] Stop sending purges to deployment-cache-bits04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235282 (owner: 10Chad) [18:34:58] PROBLEM - NFS read/writeable on labs instances on labstore1003 is CRITICAL: Connection refused [18:34:58] (03PS2) 10OliverKeyes: Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [18:35:00] (03Merged) 10jenkins-bot: Stop sending purges to deployment-cache-bits04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235282 (owner: 10Chad) [18:35:28] (03CR) 10jenkins-bot: [V: 04-1] Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [18:37:17] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 612, number_of_pending_tasks: 5695, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5864, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:17] RECOVERY - ElasticSearch health check for shards on elastic1003 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 612, number_of_pending_tasks: 5701, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5864, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:18] RECOVERY - ElasticSearch health check for shards on elastic1023 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 612, number_of_pending_tasks: 5730, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5864, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:18] RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 612, number_of_pending_tasks: 5730, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5864, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:27] RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 610, number_of_pending_tasks: 5779, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5866, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:37] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 609, number_of_pending_tasks: 5992, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5867, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:37] RECOVERY - ElasticSearch health check for shards on elastic1010 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 609, number_of_pending_tasks: 5992, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5867, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:47] RECOVERY - ElasticSearch health check for shards on elastic1025 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 607, number_of_pending_tasks: 6121, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5869, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:57] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 605, number_of_pending_tasks: 6249, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5871, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:37:57] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 605, number_of_pending_tasks: 6249, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5871, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:38:07] RECOVERY - ElasticSearch health check for shards on elastic1012 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 604, number_of_pending_tasks: 6379, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5872, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:38:07] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 604, number_of_pending_tasks: 6379, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5872, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:38:16] RECOVERY - ElasticSearch health check for shards on elastic1014 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 603, number_of_pending_tasks: 6509, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5873, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassi [18:38:37] RECOVERY - ElasticSearch health check for shards on elastic1016 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:37] RECOVERY - ElasticSearch health check for shards on elastic1001 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:38] RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:38] RECOVERY - ElasticSearch health check for shards on elastic1013 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:42] (03PS1) 10Chad: Decom bits caches from beta/staging [puppet] - 10https://gerrit.wikimedia.org/r/235284 [18:38:46] RECOVERY - ElasticSearch health check for shards on elastic1011 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:46] RECOVERY - ElasticSearch health check for shards on elastic1030 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:46] RECOVERY - ElasticSearch health check for shards on elastic1009 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:46] RECOVERY - ElasticSearch health check for shards on elastic1007 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:56] RECOVERY - ElasticSearch health check for shards on elastic1024 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:56] RECOVERY - ElasticSearch health check for shards on elastic1020 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:58] RECOVERY - ElasticSearch health check for shards on elastic1027 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:58] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:58] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:58] RECOVERY - ElasticSearch health check for shards on elastic1019 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:38:58] RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:39:07] RECOVERY - ElasticSearch health check for shards on elastic1028 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:39:07] RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:39:07] RECOVERY - ElasticSearch health check for shards on elastic1022 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 601, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2166, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5875, initializing_shards: 36, number_of_data_nodes: 31, delayed_unassigne [18:39:24] (03PS4) 10Alex Monk: Close wikimania2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224770 (https://phabricator.wikimedia.org/T105675) (owner: 10Dereckson) [18:39:37] (03CR) 10Alex Monk: "Can you guys remove your -1s?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224770 (https://phabricator.wikimedia.org/T105675) (owner: 10Dereckson) [18:41:49] I can't ssh to tin, hmm. [18:41:52] Or bast1001. [18:42:59] Hmm, choking on ipv6. [18:43:05] * ostriches bets its comcast's fault [18:44:27] Hmm, can hit bast1001 when forcing -4, but not tin when doing the same [18:45:55] Hrm, working now [18:46:02] * ostriches blames server gremlins [18:47:02] ostriches: no, I don't think you can kill the bet amobile caches :) [18:47:13] Awww [18:47:41] note that even in production, we're not functionally killing anything about mobile. it's just about merging it functionally into text. [18:48:26] but there's some pretty difficult sticking points left in that process. If we just naively smashed them together at the ops level today without further work at the MW level, they would pollute each other re: desktop/mobile variants of pages, etc [18:48:38] Ah, gotcha. [18:49:18] bblack: On the subject: https://gerrit.wikimedia.org/r/#/c/235284/ - last remnants of bits-specific caches gone here. Beta's now serving them via the main text caches as well. [18:49:47] (03PS1) 10EBernhardson: Turn off CirrusSearch user test for phrase slop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235285 [18:49:50] cool [18:50:30] (03CR) 10BBlack: [C: 031] Decom bits caches from beta/staging [puppet] - 10https://gerrit.wikimedia.org/r/235284 (owner: 10Chad) [18:51:04] ^ does it need further review? [18:51:17] bblack: no [18:51:45] (03CR) 10BBlack: [C: 032] Decom bits caches from beta/staging [puppet] - 10https://gerrit.wikimedia.org/r/235284 (owner: 10Chad) [18:57:56] I hate that we have those nasty bits hardcoded there, but at least there's less of them now [19:08:34] 6operations, 7HTTPS, 5Patch-For-Review: sitemap.wikimedia.org uses invalid SSL certificate - https://phabricator.wikimedia.org/T110511#1594178 (10demon) I'd just decom it for now. Can always resurrect if T23765 or something else finds a use for it. [19:14:33] (03PS2) 10Andrew Bogott: Move labvir1001 and 1002 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234996 (https://phabricator.wikimedia.org/T110886) [19:15:06] (03CR) 10Hashar: "Maybe the service user should stick to a /bin/false login. I found out I can use:" [puppet] - 10https://gerrit.wikimedia.org/r/234483 (owner: 10Hashar) [19:18:03] (03CR) 10Andrew Bogott: [C: 032] Move labvir1001 and 1002 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234996 (https://phabricator.wikimedia.org/T110886) (owner: 10Andrew Bogott) [19:34:09] (03PS2) 10Andrew Bogott: Move labvirt1003 and 1006 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234997 (https://phabricator.wikimedia.org/T110886) [19:36:04] (03CR) 10Andrew Bogott: [C: 032] Move labvirt1003 and 1006 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234997 (https://phabricator.wikimedia.org/T110886) (owner: 10Andrew Bogott) [19:36:50] !log ori@tin Synchronized php-1.26wmf20/includes/skins/SkinTemplate.php: cc643a0934: Deprecate unconditional loading of mediawiki.ui.button on all pages (duration: 00m 13s) [19:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:49] Hi greg-g! Hope you're better today... jynus helped us deploy the schema this morning, so we booked a slot for the actual deploy tomorrow at 16:00 PST. Hope that's OK, I'll send a message to @Ops mailing list as per your suggestion yesterday :) thanks! [19:43:12] legoktm, does https://gerrit.wikimedia.org/r/#/c/235177/ seem sane to you, or should we add another explicit opt-out for labswiki itself? [19:43:56] (03PS1) 1020after4: symlinks for 1.26wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235317 [19:44:41] (03CR) 1020after4: [C: 032] symlinks for 1.26wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235317 (owner: 1020after4) [19:44:47] (03Merged) 10jenkins-bot: symlinks for 1.26wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235317 (owner: 1020after4) [19:45:42] (03PS2) 10Andrew Bogott: Make OpenStack Juno the new default, except for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/234998 (https://phabricator.wikimedia.org/T110886) [19:46:50] (03CR) 10Andrew Bogott: [C: 032] Make OpenStack Juno the new default, except for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/234998 (https://phabricator.wikimedia.org/T110886) (owner: 10Andrew Bogott) [19:52:52] !log removed tools20150901132642 from labstore vg on labstore1002 [19:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:27] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:54:46] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:55:27] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:56:34] twentyafterfour: did you do scap yet? [19:56:52] (03CR) 10Mattflaschen: "This can be reviewed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [19:58:27] * aude thinks not [19:59:39] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [20:00:24] (03PS1) 1020after4: delete 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235347 [20:00:47] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [20:01:03] aude: note yet I'm about to [20:01:11] aude: should I wait? [20:01:38] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [20:02:24] twentyafterfour: go ahead [20:02:43] * aude had deployed somethign earlier, but realized i forgot to update the submodule [20:02:53] but it should go out with the train now, so ok [20:03:00] ok cool [20:03:12] !log twentyafterfour@tin Started scap: sync 1.26wmf21 [20:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:17] (03PS2) 10Andrew Bogott: Now that all virt nodes are running Juno, return everything to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/234999 (https://phabricator.wikimedia.org/T110886) [20:03:34] !log unfreezing elasticsearch indices [20:04:04] (03PS1) 10Rush: elasticsearch: exclude 1001 from ferm for now [puppet] - 10https://gerrit.wikimedia.org/r/235348 [20:04:18] (03PS2) 10Rush: elasticsearch: exclude 1001 from ferm for now [puppet] - 10https://gerrit.wikimedia.org/r/235348 [20:05:46] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=510.64 Read Requests/Sec=0.00 Write Requests/Sec=450.45 KBytes Read/Sec=0.00 KBytes_Written/Sec=1801.82 [20:06:13] (03CR) 10Andrew Bogott: [C: 032] Now that all virt nodes are running Juno, return everything to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/234999 (https://phabricator.wikimedia.org/T110886) (owner: 10Andrew Bogott) [20:07:17] (03PS3) 10Rush: elasticsearch: exclude 1001 from ferm for now [puppet] - 10https://gerrit.wikimedia.org/r/235348 [20:07:46] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.41 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=1.20 [20:09:18] 6operations, 10ops-ulsfo: troubleshoot ulsfo side of IC-313592 - https://phabricator.wikimedia.org/T111101#1594466 (10RobH) 3NEW a:3RobH [20:10:43] (03CR) 10Rush: [C: 032] elasticsearch: exclude 1001 from ferm for now [puppet] - 10https://gerrit.wikimedia.org/r/235348 (owner: 10Rush) [20:19:37] 6operations, 10Wikimedia-Mailing-lists: disable mailman list - https://phabricator.wikimedia.org/T111063#1594528 (10Dzahn) "mailman disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page." [20:19:55] 6operations, 10Wikimedia-Mailing-lists: disable mailman list - https://phabricator.wikimedia.org/T111063#1594530 (10Dzahn) 5Open>3Resolved [20:19:56] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1594531 (10Dzahn) [20:23:08] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1594537 (10chasemp) We attempted to enable ferm on the active master node today (and the only remaining node). I 'preseeded' the ferm configuration as referenced above and here is the puppet output: ``` r... [20:30:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [20:33:49] !log twentyafterfour@tin Finished scap: sync 1.26wmf21 (duration: 30m 37s) [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:37] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [20:40:27] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [20:45:03] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1594626 (10Dzahn) [20:45:40] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1450894 (10Dzahn) [20:45:47] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1594665 (10JohnLewis) [20:54:32] !log restarted nutcracker on mw1142 [20:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:16] (03PS3) 10EBernhardson: Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:05:16] (03PS3) 10EBernhardson: Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:05:42] (03CR) 10jenkins-bot: [V: 04-1] Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [21:05:42] (03CR) 10jenkins-bot: [V: 04-1] Add boilerplate for cirrussearch connection limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [21:06:16] (03PS4) 10EBernhardson: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:06:16] (03PS4) 10EBernhardson: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:06:40] (03CR) 10jenkins-bot: [V: 04-1] Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [21:06:40] (03CR) 10jenkins-bot: [V: 04-1] Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (owner: 10EBernhardson) [21:07:01] (03PS5) 10EBernhardson: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:07:01] (03PS5) 10EBernhardson: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 [21:08:37] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [21:14:38] (03PS3) 10Dzahn: mailman: ferm, allow rsync from sodium for migration [puppet] - 10https://gerrit.wikimedia.org/r/235155 (https://phabricator.wikimedia.org/T110129) [21:15:41] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra encryption (TLS) - https://phabricator.wikimedia.org/T108953#1594758 (10Eevans) [21:15:46] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1594760 (10Dzahn) p:5High>3Normal [21:16:04] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1594763 (10Dzahn) p:5High>3Normal [21:16:19] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1594764 (10Dzahn) p:5High>3Normal [21:16:31] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1594765 (10Dzahn) p:5High>3Normal [21:17:25] 6operations, 10Wikimedia-Mailing-lists: wikinews-l: no active listadmin - https://phabricator.wikimedia.org/T110956#1594766 (10Dzahn) p:5High>3Normal [21:18:23] (03CR) 10Alex Monk: [C: 032] Don't set wgEchoBundleEmailInterval if it won't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235177 (https://phabricator.wikimedia.org/T110985) (owner: 10Alex Monk) [21:18:47] (03Merged) 10jenkins-bot: Don't set wgEchoBundleEmailInterval if it won't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235177 (https://phabricator.wikimedia.org/T110985) (owner: 10Alex Monk) [21:19:16] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1594774 (10Dzahn) p:5High>3Normal [21:19:33] (03CR) 10Dzahn: [C: 032] mailman: ferm, allow rsync from sodium for migration [puppet] - 10https://gerrit.wikimedia.org/r/235155 (https://phabricator.wikimedia.org/T110129) (owner: 10Dzahn) [21:20:12] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/235177/ (duration: 00m 12s) [21:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:43] (03CR) 10Dzahn: "yes, that would be intended" [dns] - 10https://gerrit.wikimedia.org/r/197361 (owner: 10Dzahn) [21:28:16] (03PS1) 10Gergő Tisza: Disable MediaViewer thumbnail guessing on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235361 (https://phabricator.wikimedia.org/T69651) [21:28:18] (03PS4) 10Dzahn: park wikiartpedia domains [dns] - 10https://gerrit.wikimedia.org/r/197361 [21:30:05] (03PS1) 10Jforrester: Enable VisualEditor for NS_PROJECT on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235362 (https://phabricator.wikimedia.org/T100067) [21:41:46] (03PS5) 10Dzahn: Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) (owner: 10Chmarkine) [21:42:17] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1594884 (10Eevans) [21:42:53] (03CR) 10Dzahn: [C: 032] Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) (owner: 10Chmarkine) [21:49:50] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1594920 (10Eevans) Any word here? Encryption is a dependency for scaling RESTBase into codfw, a quarterly goal for Services (T102306), and time is starting to run short. I'm happ... [21:51:03] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1594922 (10Eevans) [22:01:37] PROBLEM - torrus.wikimedia.org HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Torrus Top: Wikimedia not found on http://torrus.wikimedia.org:80/torrus - 838 bytes in 0.277 second response time [22:02:29] ok, hrmpf, i'll follow the docs how to fix that regular one ^ [22:03:37] RECOVERY - torrus.wikimedia.org HTTP on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2166 bytes in 0.282 second response time [22:04:18] or not :p [22:05:24] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1594988 (10Eevans) >>! In T95253#1536539, @Eevans wrote: >>>! In T95253#1535221, @fgiunchedi wrote: >> also multiple instances means we'll need to... [22:08:16] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [22:09:01] (03PS1) 10Alex Monk: Search Συγγραφέας namespace by default on elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235366 (https://phabricator.wikimedia.org/T110871) [22:10:41] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1595013 (10JohnLewis) p:5High>3Triage [22:15:34] (03PS1) 10Rush: Allow upstart to set ulimit [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/235368 [22:18:06] !log Maps: creating and populating admin table [22:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:03] JohnFLewis: so "High" to "Needs triage" is "raise" :) [22:19:33] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1595038 (10JohnLewis) p:5Triage>3High [22:19:45] apparently. why that happened, no idea :) [22:19:49] (03CR) 10RobH: [C: 04-2] "The patchset looks good, my vote is purely to block merge until an operations member can confirm (from any linked private tasks) that ther" [puppet] - 10https://gerrit.wikimedia.org/r/235047 (https://phabricator.wikimedia.org/T110754) (owner: 10John F. Lewis) [22:20:45] JohnFLewis: it's hard to tell when to close any kind of "add wiki docs" task [22:20:53] so open-ended, wiki is never done [22:21:00] (03PS1) 10Rush: Nutcracker: set a higher ulimit [puppet] - 10https://gerrit.wikimedia.org/r/235370 [22:21:14] indeed. [22:21:28] but any improvement is a massive improvement so I'll call it there [22:21:48] close all as duplicate of https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [22:21:57] ok, yes it is [22:22:02] do it [22:22:16] I'll close it when lists.wikimedia.org's page (remame to Mailman perhaps? long annoying name :) ) is replaced [22:22:36] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1595045 (10RobH) a:3RobH I'll claim this to my assignment until our Operations meeting next week.... [22:24:03] 6operations, 7Monitoring: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1595047 (10greg) >>! In T110484#1579327, @akosiaris wrote: > I am assuming this has been going for a long time. Questions: > > * What kind of privacy issues does it create. As above, P... [22:27:22] (03PS2) 10Rush: Nutcracker: set a higher ulimit [puppet] - 10https://gerrit.wikimedia.org/r/235370 [22:27:40] JohnFLewis: done, i flipped the redirect and target around and moved the page [22:32:20] 6operations, 6Engineering-Community, 3ECT-September-2015: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1595087 (10Rfarrand) 5Open>3Resolved [22:32:56] (03PS5) 10Alex Monk: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [22:34:05] (03CR) 10Alex Monk: [C: 031] Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [22:34:49] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1595100 (10chasemp) Saw this again today on mw1142. As before it was accompanied by: `[2015-09-01 20:52:21.847] nc_proxy.c:330 client connections 935 exceed limit 93` http://graphit... [22:40:04] We longer have https://github.com/wikimedia/operations-debs-txstatsd deployed, correct? [22:40:33] We use operations-debs-python-statsd now? or operations-debs-StatsD? [22:41:54] or debs-statsite? [22:44:46] lol [22:45:01] repository naming fun [22:45:15] statsite has a decomission class in puppet, so I guess that one is being removed. [22:46:00] Hn.. maybe not [22:47:06] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1595147 (10chasemp) Any luck? [22:47:25] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1595148 (10chasemp) a:5chasemp>3mmodell [22:47:54] no use of txstatsd indeed, but I see puppet classes for statsite, statsd and python_statsd. [22:49:28] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1595152 (10RobH) 3NEW [22:49:57] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1595160 (10RobH) Additionally, inclusion in the ops group means sudo rights, and then this has to wait for next Monday. Inclusion in bastiononly just means a 3 day wait. [22:52:20] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1595169 (10RobH) So his onsite responsibilities include checking power usage and temp/humidity. This means librenms access, as well as being able to directly connect to the PDU fo... [22:52:32] Hnm,. statsd.eqiad.wmnet resolves to graphite1001 which in puppet has only statsdlb installed. I figured statsdlb was only a load balancer. [22:52:42] * Krinkle starts wikitech page. [22:53:30] Aha. So it's statsite. Great. [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150901T2300). Please do the needful. [23:00:04] tgr James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] ugh, I had a patch or two to schedule that I've forgotten to do [23:00:26] 6operations: Please add kharold@wikimedia.org to grants alies - https://phabricator.wikimedia.org/T111125#1595239 (10eross) 3NEW [23:00:38] will do these scheduled ones first [23:01:40] (03CR) 10Alex Monk: [C: 032] Disable MediaViewer thumbnail guessing on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235361 (https://phabricator.wikimedia.org/T69651) (owner: 10Gergő Tisza) [23:01:48] (03Merged) 10jenkins-bot: Disable MediaViewer thumbnail guessing on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235361 (https://phabricator.wikimedia.org/T69651) (owner: 10Gergő Tisza) [23:01:51] Krenair: i also snuck 2 config updates in at the last second [23:02:03] yep, got those [23:02:43] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/235361/ (duration: 00m 13s) [23:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:59] tgr: heya, please test the change above [23:06:34] 6operations, 7Mail: Please add kharold@wikimedia.org to grants alies - https://phabricator.wikimedia.org/T111125#1595272 (10Krenair) [23:06:49] Krenair: Me me me? ;-) [23:07:04] yeah, I guess the previous change was only for beta [23:07:27] (03PS2) 10Alex Monk: Enable VisualEditor for NS_PROJECT on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235362 (https://phabricator.wikimedia.org/T100067) (owner: 10Jforrester) [23:07:32] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for NS_PROJECT on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235362 (https://phabricator.wikimedia.org/T100067) (owner: 10Jforrester) [23:07:41] (03Merged) 10jenkins-bot: Enable VisualEditor for NS_PROJECT on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235362 (https://phabricator.wikimedia.org/T100067) (owner: 10Jforrester) [23:08:09] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/235362/ (duration: 00m 14s) [23:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:23] James_F, looks like it's working now [23:10:46] Krenair: Confirmed. [23:11:42] ebernhardson, these changes are okay with your team, right? no phabricator tickets linked... [23:12:50] James_F: Any flash of unveed tab recently? [23:13:12] Krinkle: Not yet, but we've not been doing scary things recently. [23:16:12] wow, url overkill? [23:16:13] https://phabricator.wikimedia.org/diffusion/ODDY/browse/master/src/nc_core.c;37fb9a2b939821c6d704ba09b7d80bcc88961224$39-40?view=1 [23:16:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1595330 (10Dzahn) rsync running in screen ... [23:16:32] Krenair: confirmed, sorry for the delay [23:16:33] slashes, semi colons, query parameters and $-style hash tag [23:19:01] Krinkle: But you have to remember that It Works™. :-) [23:21:16] 6operations, 10Wikimedia-Mailing-lists: announce scheduled downtime - https://phabricator.wikimedia.org/T110133#1595352 (10Dzahn) Agreed with John it should be: Wednesday, September 9th at 1400 UTC which is 7am PDT and 2pm BST, so well within in European hours and works for both of us. Also not on a Friday or... [23:22:48] 6operations, 10Wikimedia-Mailing-lists, 7user-notice: announce scheduled downtime - https://phabricator.wikimedia.org/T110133#1595360 (10JohnLewis) 3PM BST and adding usernotice. [23:27:20] Krenair: yea they are ok, lemme see if i can dig up a ticket for each [23:28:50] (03PS6) 10EBernhardson: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (https://phabricator.wikimedia.org/T76497) [23:30:34] (03PS2) 10EBernhardson: Turn off CirrusSearch user test for phrase slop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235285 (https://phabricator.wikimedia.org/T109018) [23:30:41] Krenair: bugs added to both [23:30:56] (03PS1) 10Tim Landscheidt: Tools: Fix quoting in sql script [puppet] - 10https://gerrit.wikimedia.org/r/235378 (https://phabricator.wikimedia.org/T75595) [23:31:22] thanks [23:35:28] -> pm [23:40:06] (03CR) 10BBlack: [C: 031] "+1 in general, as this goes in the right direction and has to be better than before, but:" [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/235368 (owner: 10Rush) [23:40:51] (03CR) 10Alex Monk: [C: 032] Turn off CirrusSearch user test for phrase slop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235285 (https://phabricator.wikimedia.org/T109018) (owner: 10EBernhardson) [23:41:16] (03Merged) 10jenkins-bot: Turn off CirrusSearch user test for phrase slop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235285 (https://phabricator.wikimedia.org/T109018) (owner: 10EBernhardson) [23:41:46] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/235285/ (duration: 00m 14s) [23:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:15] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1595434 (10Dzahn) You said "read L2" above, but that's the volunteer NDA agreement. I believe instead he should sign L3. [23:48:10] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1595438 (10Dzahn) Also see T109640 where he already got the ops LDAP group and it's linked to his labs user name which is "uid=pt1979" [23:50:24] (03CR) 10Deskana: [C: 04-1] "Please hold off on this while I check something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [23:56:53] (03CR) 10Alex Monk: [C: 032] Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [23:57:17] (03Merged) 10jenkins-bot: Disable webp for now, so we can enable outside of WMF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [23:59:43] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/221731/ (duration: 00m 13s) [23:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master