[00:00:04] RoanKattouw, ^d: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T0000). [00:01:35] nothing to swat? [00:02:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:03:52] (03CR) 10RobH: [C: 031] Give parsoid admins the ability to update/restart the RT testing service. [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [00:07:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [00:07:59] andrewbogott: could you check out that access request for parsoid admins @ T86804 [00:08:15] * andrewbogott reads [00:08:50] thanks, the patch is linked there, it also came from an old RT thing [00:10:56] (03PS4) 10Andrew Bogott: Give parsoid admins the ability to update/restart the RT testing service. [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [00:11:41] mutante: doesn’t look like anyone objects; any reason for me not to merge right now? [00:13:20] andrewbogott: given the upload date, i think we should. what i dont know is if it needs an approval [00:13:58] (03CR) 10Andrew Bogott: [C: 032] Give parsoid admins the ability to update/restart the RT testing service. [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [00:14:57] 3operations, Ops-Access-Requests: Give parsoid admins the ability to update/restart the RT testing service - https://phabricator.wikimedia.org/T86804#1013341 (10Andrew) 5Open>3Resolved a:3Andrew Merged. Sorry for the delay. [00:15:45] andrewbogott: :) thx [00:18:55] 3Scrum-of-Scrums, operations, RESTBase, hardware-requests: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1013354 (10RobH) Update from vendor: These have shipped and are due to arrive onsite @ eqiad on Thursday, 2015-02-05. [00:29:46] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1013366 (10RobH) 3NEW a:3RobH [00:30:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:30:56] (03PS1) 10RobH: changing dumps.w.o cert from sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/188492 [00:33:05] 3operations, Datasets-General-or-Unknown: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013374 (10wpmirrordev) Well done. [00:33:14] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1013375 (10RobH) So while I see no https traffic on the system, I do see regular http traffic, and reloading apache for the https change will affect those downloads. When there is no traffic, the follo... [00:33:45] heh, if i reloaded dumps.w.o and someone was in the last few percent of a dump download, im pretty sure they'll hunt me down and killme. [00:33:54] * robh makes a note to schedule that particular replacement. [00:34:37] <^d> I was at 98% on the enwiki-full-history dump! [00:35:30] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1013376 (10RobH) Also, Once this is completed, we need to revoke the SHA1 cert. So someone can do this via rapidssl, or I'll do so once the other ticket is live. (Anyone can push this change live when... [00:35:57] ^d: yea thats my fear for some random user in the middle of fucking nowhere and they have shit internet [00:36:01] but... on closer inspection.. [00:36:07] its crawlers and neon [00:36:09] so fuck it, gonna push now. [00:36:11] \o/ [00:36:14] there's rsyncd on it :) [00:36:37] that shouldnt care about http(s) rehupping though right? [00:36:42] no [00:36:50] cool, didnt think so, but dumps are not normal [00:36:54] we have a few "public" users who rsync from there [00:36:55] so wanted to check, heh [00:37:21] (03CR) 10RobH: [C: 032] changing dumps.w.o cert from sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/188492 (owner: 10RobH) [00:37:42] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1013378 (10RobH) followed my own advice, and the one http connection that wasnt a crawler or neon just died, so pushing now. [00:38:56] !log replacing dumps.w.o sha1 cert with sha256 [00:39:00] Logged the message, Master [00:39:50] hrmm, didnt replace it in nginx... [00:41:51] grrr [00:45:34] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1013380 (10RobH) so the cert is in the filesystem on ms1001, but isnt serving it. ive restarted and reloaded nginx, so not sure whats up. I'll keep hacking at it, but usability isnt gone, just still s... [00:46:47] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013381 (10RobH) [00:49:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:02:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:04:06] (03CR) 10Dzahn: [C: 032] "this is just on uranium. it doesn't have base::firewall yet and will probably need more holes, but this is just fine as a prerequisite" [puppet] - 10https://gerrit.wikimedia.org/r/188415 (owner: 10John F. Lewis) [01:06:37] (03CR) 10Dzahn: "noop on uranium as expected" [puppet] - 10https://gerrit.wikimedia.org/r/188415 (owner: 10John F. Lewis) [01:11:46] (03CR) 10Dzahn: [C: 04-2] "the variable used here as srange isn't available yet" [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [01:14:31] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1013417 (10Dzahn) meanwhile i have edited the list settings of the ops list and removed gerrit@wikimedia.org as a valid sender in sender filters, instead added it to discard... [01:17:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:28:13] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1013436 (10Dzahn) >>! In T88388#1010940, @hashar wrote: > I guess the Gerrit user `novaadmin ` has been made to watch the operations/puppet.git... [01:30:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:39:06] !log restarted elasticsearch on logstash1002; rolling restart of cluster part 2 of 3 [01:39:15] Logged the message, Master [01:39:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:47:05] !log restarted elasticsearch on logstash1001; rolling restart part 3 of 3 [01:47:11] Logged the message, Master [01:49:50] 3WMF-Legal, Wikimedia-General-or-Unknown, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1013479 (10LuisV_WMF) Yes and yes. [01:52:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:01:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:15:31] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 03s) [02:15:40] Logged the message, Master [02:16:38] !log LocalisationUpdate completed (1.25wmf14) at 2015-02-04 02:15:35+00:00 [02:16:41] Logged the message, Master [02:21:38] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:26:48] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:29:13] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:29:22] Logged the message, Master [02:30:20] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-04 02:29:17+00:00 [02:30:23] Logged the message, Master [02:39:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:48:56] !log tstarling Synchronized php-1.25wmf15/includes/specials/SpecialUserrights.php: Unbreak interwiki user rights granting (duration: 00m 05s) [02:49:04] Logged the message, Master [02:56:08] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [03:11:39] (03PS1) 10Springle: use m1-master CNAME [puppet] - 10https://gerrit.wikimedia.org/r/188508 [03:32:17] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 40 failures [03:34:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:46:36] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [03:48:30] 3operations: The certificate chains of newly installed SHA2 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1013572 (10Chmarkine) 3NEW [03:52:55] 3operations: The certificate chains of newly installed SHA2 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1013580 (10Chmarkine) [03:52:58] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013579 (10Chmarkine) [03:59:08] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:21:08] 3MediaWiki-extensions-WikimediaIncubator, Wikimedia-Language-setup, Wikimedia-Site-requests, operations: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#1013617 (10Glaisher) One of the proposed solution is to remove the 'minnan'.project.org DNS so this is in #operations s... [04:21:56] 3Wikimedia-Language-setup, operations, Wikimedia-DNS: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#1013621 (10Glaisher) [04:45:09] (03PS1) 10Springle: Generate grants SQL scripts in /etc/mysql [puppet] - 10https://gerrit.wikimedia.org/r/188511 [04:47:40] (03PS2) 10Springle: Generate grants SQL scripts in /etc/mysql [puppet] - 10https://gerrit.wikimedia.org/r/188511 [04:49:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Feb 4 04:48:42 UTC 2015 (duration 48m 41s) [04:49:50] Logged the message, Master [04:50:15] (03PS3) 10Springle: Generate grants SQL scripts in /etc/mysql [puppet] - 10https://gerrit.wikimedia.org/r/188511 [04:50:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:54:18] (03CR) 10Springle: [C: 032] Generate grants SQL scripts in /etc/mysql [puppet] - 10https://gerrit.wikimedia.org/r/188511 (owner: 10Springle) [05:09:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:10:18] YuviPanda: where do we set https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep now? [05:10:36] akosiaris: ping when available :) [05:15:18] (03PS1) 10Springle: depool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188513 [05:15:43] (03CR) 10Springle: [C: 032] depool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188513 (owner: 10Springle) [05:15:47] (03Merged) 10jenkins-bot: depool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188513 (owner: 10Springle) [05:16:36] !log springle Synchronized wmf-config/db-eqiad.php: depool db1057 (duration: 00m 06s) [05:16:41] Logged the message, Master [05:24:32] (03CR) 10Florianschmidtwelzow: "Not every sysop want to have the translateadmin rights to make sure, that they don't break things, so i think an opt-in is the best soluti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) (owner: 10Florianschmidtwelzow) [06:18:12] (03PS6) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [06:28:16] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:17] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: puppet fail [06:28:38] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: puppet fail [06:29:06] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: puppet fail [06:29:07] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:16] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:17] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] <_joe_> good morning passenger [06:32:32] (03PS7) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [06:35:29] (03PS1) 10KartikMistry: cxserver: Enable English-Russian language pair [puppet] - 10https://gerrit.wikimedia.org/r/188517 [06:42:46] (03PS19) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [06:45:56] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:48:07] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:48:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:48:07] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:49:16] (03PS1) 10Yuvipanda: cache: Use new ip for deployment-mediawiki01 [puppet] - 10https://gerrit.wikimedia.org/r/188519 [06:49:57] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:02] (03PS2) 10Yuvipanda: cache: Use new ip for deployment-mediawiki01 [puppet] - 10https://gerrit.wikimedia.org/r/188519 [06:53:17] (03CR) 10Yuvipanda: [C: 032] cache: Use new ip for deployment-mediawiki01 [puppet] - 10https://gerrit.wikimedia.org/r/188519 (owner: 10Yuvipanda) [06:53:25] (03CR) 10Yuvipanda: [V: 032] cache: Use new ip for deployment-mediawiki01 [puppet] - 10https://gerrit.wikimedia.org/r/188519 (owner: 10Yuvipanda) [07:16:56] 3operations: Puppet broken on silver.wikimedia.org - https://phabricator.wikimedia.org/T88513#1013801 (10Joe) 3NEW [07:22:04] kart_: ping [07:24:40] akosiaris: pong. [07:25:21] akosiaris: see two patches, but ping was about... [07:26:02] akosiaris: how can I add different language pair for Beta and Production? [07:26:10] (ie config.js) [07:27:09] Using hieradata/labs/deployment-prep/common.yaml is the way, but looking at config.js, it seems scary. [07:31:29] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1013830 (10Qgil) p:5High>3Normal [07:31:37] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [07:32:35] (03CR) 10Nikerabbit: cxserver: Add Yandex support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [07:35:58] (03PS8) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [07:37:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [07:39:55] (03CR) 10Alexandros Kosiaris: [C: 031] use m1-master CNAME [puppet] - 10https://gerrit.wikimedia.org/r/188508 (owner: 10Springle) [07:43:34] (03CR) 10Nikerabbit: cxserver: Enable English-Russian language pair (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188517 (owner: 10KartikMistry) [07:44:54] (03PS2) 10KartikMistry: cxserver: Enable English-Russian language pair [puppet] - 10https://gerrit.wikimedia.org/r/188517 [07:45:51] (03CR) 10KartikMistry: cxserver: Enable English-Russian language pair (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188517 (owner: 10KartikMistry) [07:49:38] !log Manual failover of Hadoop namenode from analytics1002 to analytics1001, as analytics1002 had Heap space errors [07:49:44] Logged the message, Master [07:52:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:53:25] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1013855 (10hashar) People interesting in receiving all notifications can just watch the project from the Gerrit settings. I don't think we need to spam any list with those c... [07:54:32] (03PS3) 10Nikerabbit: cxserver: Enable English to Russian MT Change-Id: I9e73433becde16701b1963a39613bf02c0a10f14 [puppet] - 10https://gerrit.wikimedia.org/r/188517 (owner: 10KartikMistry) [07:54:54] (03PS1) 10Alexandros Kosiaris: ircd: services started by upstart should not fork [puppet] - 10https://gerrit.wikimedia.org/r/188521 [07:56:32] (03CR) 10Alexandros Kosiaris: [C: 032] ircd: services started by upstart should not fork [puppet] - 10https://gerrit.wikimedia.org/r/188521 (owner: 10Alexandros Kosiaris) [08:00:17] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:03:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [08:06:56] (03CR) 10Alexandros Kosiaris: "Adding some more info today:" [puppet] - 10https://gerrit.wikimedia.org/r/188188 (owner: 10Dzahn) [08:14:46] RECOVERY - Disk space on stat1002 is OK: DISK OK [08:15:01] hoo: ping? [08:15:06] pong [08:15:47] hoo: paravoid just noticed "Cache-Control: private, s-maxage=0, max-age=0, must-revalidate" on wikidata items. is this new? [08:16:11] there has been a spike in page load time since non-wikipedias were bumped to wmf15 at 19:30 UTC [08:16:26] that sounds nasty [08:16:34] I'm not aware of we setting any such headers [08:16:38] (03PS1) 10Alexandros Kosiaris: ganglia_new: move ferm::service from module to role [puppet] - 10https://gerrit.wikimedia.org/r/188524 [08:17:14] ori: We didn't deploy any new code with wmf15 [08:17:23] so not a regression in Wikidata code [08:19:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:20:02] (03CR) 10Hashar: "Ubuntu provides a package for puppet-lint 1.1.0 ( http://packages.ubuntu.com/vivid/puppet-lint )." [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T88430) (owner: 10Hashar) [08:20:15] ori: Do we have a ticket about that? [08:20:56] not yet [08:21:56] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_new: move ferm::service from module to role [puppet] - 10https://gerrit.wikimedia.org/r/188524 (owner: 10Alexandros Kosiaris) [08:25:05] akosiaris: is tere a plan to have a pop ib asia/far east ? [08:25:12] Please open one, I'll try to get people to look at it then [08:25:22] (03Abandoned) 10Hashar: beta: Remove explict ssh grant for mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/185949 (owner: 10Yuvipanda) [08:25:25] Don't personally plan to work much for WMDE today, but that's not set in stone [08:25:58] matanya: an actual well thought out and ready t o implement plan ? no. A will ? yes [08:26:46] thanks akosiaris, now i can go and answer the requestor [08:27:47] if this day comes, i have experiance with 4 DC's in that area, if one is interested in my opinion [08:29:01] matanya: OK, thanks. That sounds nice! [08:38:28] (03PS2) 10Alexandros Kosiaris: misc-web-lb changes to support servermon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/188389 (owner: 10RobH) [08:38:38] ori: got to go... please open a bug, if you want someone to look into that today [08:43:42] hothanks, will do, sorry [08:48:17] s/hothanks/hoo: thanks/ [08:51:07] (03PS1) 10Alexandros Kosiaris: Make corp LDAP mirror alerts paging [puppet] - 10https://gerrit.wikimedia.org/r/188528 [08:52:21] (03PS1) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all appservers [puppet] - 10https://gerrit.wikimedia.org/r/188529 [08:52:23] (03PS1) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all api appservers [puppet] - 10https://gerrit.wikimedia.org/r/188530 [08:52:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all mediawiki hhvm installations [puppet] - 10https://gerrit.wikimedia.org/r/188531 [08:57:07] <_joe_> !log installing the new hhvm package on all appservers, one at a time [08:57:14] Logged the message, Master [09:02:28] 3operations: Build a new HHVM package - https://phabricator.wikimedia.org/T86906#1013959 (10Joe) The package had a bug, which I fixed in https://gerrit.wikimedia.org/r/#/c/188332/ The new version is being rolled out everywhere right now. [09:06:16] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures [09:19:09] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1013961 (10Joe) a:3Joe [09:19:52] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1006147 (10Joe) The dumps stopped after we rebooted the servers for protecting against the GHOST vulnerability. I'm investigating this. [09:23:26] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:29:47] greetings [09:36:40] (03PS1) 10Ori.livneh: [Regression] Revert "Non wikipedias to 1.25wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188532 [09:36:58] paravoid: ^ [09:37:52] imgur because i was too lazy to open up graphite and generate a graph using absolute rather than relative time [09:38:07] apologies to posterity, etc [09:39:51] would you like to +1 or shall i go ahead? [09:40:35] (03CR) 10Faidon Liambotis: [C: 031] [Regression] Revert "Non wikipedias to 1.25wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188532 (owner: 10Ori.livneh) [09:41:18] danke [09:41:24] (03PS2) 10Ori.livneh: [Regression] Revert "Non wikipedias to 1.25wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188532 [09:41:31] (03CR) 10Ori.livneh: [C: 032] [Regression] Revert "Non wikipedias to 1.25wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188532 (owner: 10Ori.livneh) [09:41:36] (03Merged) 10jenkins-bot: [Regression] Revert "Non wikipedias to 1.25wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188532 (owner: 10Ori.livneh) [09:42:27] !log ori Started scap: I78446aacb: [Regression] Revert "Non wikipedias to 1.25wmf15" [09:42:29] !log ori scap aborted: I78446aacb: [Regression] Revert "Non wikipedias to 1.25wmf15" (duration: 00m 02s) [09:42:32] Logged the message, Master [09:42:35] Logged the message, Master [09:42:41] aborted? [09:42:48] forgot to merge [09:42:49] !log ori Started scap: I78446aacb: [Regression] Revert "Non wikipedias to 1.25wmf15" [09:43:17] 2nd one's ok [09:50:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 612 [09:51:33] ori: why are you scapping for it? [09:52:28] because i don't want to accidentally a file [09:52:51] sync-wikiversions would have been sufficient, right? [09:53:15] it shouldn't be long, now, anyway. [09:55:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 659840 Threads: 2 Questions: 1939499 Slow queries: 4451 Opens: 14530 Flush tables: 2 Open tables: 64 Queries per second avg: 2.939 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:57:36] yeah, sync-wikiversions should've been fine [10:02:57] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:14] hm [10:03:17] PROBLEM - HHVM rendering on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:22] <_joe_> hey [10:03:33] is that you? [10:03:37] <_joe_> I can't find wmf-config/AdminSettings.php [10:03:45] <_joe_> ori: no [10:03:50] AdminSettings? [10:03:57] <_joe_> yeah, dumps search for it [10:04:03] <_joe_> is that wrong, I suppose [10:04:20] ffed23d50f94981790789efefe8b6714b00f6853 [10:04:28] https://gerrit.wikimedia.org/r/#/c/145408/ [10:04:40] <_joe_> sigh sob [10:04:42] <_joe_> ok [10:05:04] (03PS4) 10Nikerabbit: cxserver: Enable English to Russian MT [puppet] - 10https://gerrit.wikimedia.org/r/188517 (owner: 10KartikMistry) [10:05:23] _joe_: I updated the usages of it in puppet a while back that Ariel merged [10:05:35] <_joe_> Reedy: mmmh where is that merged? [10:05:38] just finding it [10:05:47] <_joe_> docs on wikitech is clearly wrong [10:05:58] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.105 second response time [10:06:06] https://github.com/wikimedia/operations-puppet/commit/4c0c2920f42f9e3b771c23d1ea8385a49a6565e9 [10:06:17] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 65949 bytes in 0.271 second response time [10:06:41] !log restarted hung HHVM on mw1039 [10:06:47] Logged the message, Master [10:06:54] !log start migrating graphite from tungsten to graphite1001 https://gerrit.wikimedia.org/r/#/c/188036/1 https://gerrit.wikimedia.org/r/#/c/188035/1 https://phabricator.wikimedia.org/T85909 [10:06:58] Logged the message, Master [10:07:09] godog: :D [10:07:18] I see no mention of AdminSettings in the puppet repo anymore [10:07:33] ori: \o/ backfilling would be more painful than I expected tho, anyways [10:07:59] (03PS2) 10Filippo Giunchedi: graphite: move to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/188035 (https://phabricator.wikimedia.org/T85909) [10:08:08] Reedy: operations/dumps is also a thing, and it hasn't been updated for a year or two now, I think [10:08:14] Yeah [10:08:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: move to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/188035 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [10:08:20] According to github there's no mention of AdminSettings [10:08:29] <_joe_> Reedy: u sure? [10:08:31] godog: which instance were you having NFS troubles with the other day? [10:08:44] reedy@ubuntu64-web-esxi:~/git/operations/puppet$ grep -R AdminSettings * [10:08:45] grep: modules/admin/files/home/akosiaris/.my.cnf: Permission denied [10:08:45] reedy@ubuntu64-web-esxi:~/git/operations/puppet$ [10:08:45] <_joe_> ok anyways, I'll fix this shit. [10:08:49] YuviPanda: filippo-test-trusty [10:08:55] https://github.com/wikimedia/operations-dumps/search?utf8=%E2%9C%93&q=AdminSettings [10:09:39] <_joe_> godog: hey, gdash is quite important atm [10:09:54] <_joe_> I guess ori and reedy where doing some revert due to perf issues [10:10:34] yes, please don't break graphite right now [10:10:47] reverting [10:11:56] (03PS1) 10Filippo Giunchedi: Revert "graphite: move to graphite1001" [dns] - 10https://gerrit.wikimedia.org/r/188533 [10:12:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "graphite: move to graphite1001" [dns] - 10https://gerrit.wikimedia.org/r/188533 (owner: 10Filippo Giunchedi) [10:13:21] Reedy: it is a symlink and on purpose. Leave my poor .my.cnf alone [10:13:59] godog: your NFS issue is fixed for now! [10:14:10] now for me to file a bug and document the current mess [10:14:20] fun fact [10:14:22] root@labstore1001:/# mount | wc -l [10:14:22] 16709 [10:14:24] !log ori Finished scap: I78446aacb: [Regression] Revert "Non wikipedias to 1.25wmf15" (duration: 31m 34s) [10:14:29] Logged the message, Master [10:14:30] I didn't even know you can have duplicate mounts [10:14:36] YuviPanda: sweet, thanks! [10:15:25] godog: until things get saner in labstore - if this happens again you just have to manually run manage-nfs-volumes on labstore1001 [10:16:12] YuviPanda: good to know, is that from us or third party? [10:16:17] godog: from us. [10:17:51] akosiaris: haha. I wasn't making a comment about that :P [10:22:04] <_joe_> win 18 [10:22:39] * Reedy hands _joe_ some ///////// [10:23:52] he does it on purpose, he likes to narrate his gui movements [10:24:02] he's keeping score, maybe. [10:24:10] you win 18, you lose 1 [10:24:14] <_joe_> or, I am doing 3 things at the same time [10:24:22] or 18 things [10:24:23] <_joe_> no at the moment I'm losing 18 [10:24:31] <_joe_> as in years of joy [10:24:53] <_joe_> because of a chain-wreck of permission conflicts [10:25:11] paravoid: https://graphite.wikimedia.org/render/?title=navigationStart%20to%20loadEventEnd%20on%20desktop%20sites,%20last%20day&vtitle=milliseconds&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias(color(frontend.navtiming.totalPageLoadTime.desktop.overall.median,%22blue%22),%22Median%22)&target=alias(col [10:25:57] i think the theory is confirmed [10:25:59] :) [10:26:02] yeah pretty much :) [10:26:05] <_joe_> so, the wikitech says I should run a script as root. But then mediawiki tells me I'm naughty. It seems it ran as user datasets, but then I can't read the mw private files. So I try mwdeploy, but that can't write in the lock dir [10:26:11] <_joe_> yeah [10:26:51] * ori emails engineering@ [10:27:05] thanks :) [10:28:10] (03CR) 10Faidon Liambotis: [C: 04-2] "Icinga restarting failed services sounds... ugly. Isn't gitblit supposed to be retired soon?" [puppet] - 10https://gerrit.wikimedia.org/r/188480 (https://phabricator.wikimedia.org/T73974) (owner: 10Dzahn) [10:29:07] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [10:31:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some comments, and we are still waiting on the proxy support." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [10:35:43] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Having icinga restart any service is way too error prone and creates security holes. If we are not retiring gitblit anytime soon (which I " [puppet] - 10https://gerrit.wikimedia.org/r/188480 (https://phabricator.wikimedia.org/T73974) (owner: 10Dzahn) [10:48:51] (03PS1) 10Filippo Giunchedi: public entry point for restbase [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) [10:51:04] paravoid: sent. thanks for your help. good night / morning. [10:51:15] akosiaris: re 186538: Some comments, and we are still waiting on the proxy support. [10:51:37] akosiaris: anything from Language team here ^^ ? [10:56:22] I was planning to resume breaking graphite, good to go? [10:56:33] in like 10-15mins [11:00:25] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. There seems to be some bikeshedding in https://phabricator.wikimedia.org/T78194 about the name. I am fine with rest.wikimedia.org th" [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) (owner: 10Filippo Giunchedi) [11:01:14] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014085 (10Joe) Dumps for bigwikis (ie all but enwiki and possibly dewiki) finally started and seem to have resumed operations. I won't guarantee they will work completely, but I'... [11:01:50] kart_: not sure I follow.. [11:02:22] kart_: I see https://phabricator.wikimedia.org/T87587 open [11:07:36] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014089 (10Joe) Monitor has restarted correctly as well, updated info should show up as soon as dumps start flowing again. [11:10:17] akosiaris: gah, the last comment in https://gerrit.wikimedia.org/r/#/c/188537/ the hyperlink is mangled, do you see the same? (unrelated, but I remember we thought it was fixed) [11:10:49] _joe_ paravoid I was planning to resume breaking graphite in 10, good to go you think? [11:10:55] yeah [11:11:11] <_joe_> good for me [11:12:22] ack, thanks [11:13:39] godog: yes [11:14:16] it works (the second one) but it is mangled indeed. Some gerrit bug or something ? [11:15:10] akosiaris: thanks! [11:18:55] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use lru pcre cache for all appservers [puppet] - 10https://gerrit.wikimedia.org/r/188529 (owner: 10Giuseppe Lavagetto) [11:19:31] akosiaris: yeah very likely [11:22:32] (03PS1) 10Filippo Giunchedi: graphite: move to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/188539 (https://phabricator.wikimedia.org/T85909) [11:23:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: move to graphite1001 [dns] - 10https://gerrit.wikimedia.org/r/188539 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [11:23:48] !log start migrating graphite from tungsten to graphite1001 https://gerrit.wikimedia.org/r/#/c/188036/1 https://gerrit.wikimedia.org/r/#/c/188035/1 https://phabricator.wikimedia.org/T85909 [11:23:52] Logged the message, Master [11:28:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:28:57] godog: ^ [11:29:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:30:01] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014113 (10Joe) Dumps for smallwikis have restarted as well. Right now the only process that is still blocked is for enwiki AFAICT. [11:30:15] YuviPanda: oh? I didn't merge anything recently, anyways ssh palladium 'sudo -u gitpuppet ssh strontium.eqiad.wmnet' does it [11:30:31] godog: you merged the last change, no? move to graphite1001 :) [11:30:53] YuviPanda: the puppet one not yet, no [11:31:04] oh? [11:31:05] ok [11:34:14] (03PS2) 10Filippo Giunchedi: graphite: move to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188036 (https://phabricator.wikimedia.org/T85909) [11:34:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: move to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188036 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [11:35:07] <_joe_> !log installing the new hhvm package on api, one at a time [11:35:12] Logged the message, Master [11:43:59] (03PS1) 10Yuvipanda: cache: Use new ip for deployment-mediawiki02 [puppet] - 10https://gerrit.wikimedia.org/r/188542 [11:50:59] !log bounce mwprof on tungsten to force picking up dns changes [11:51:05] Logged the message, Master [11:52:44] (03CR) 10Yuvipanda: [C: 032] cache: Use new ip for deployment-mediawiki02 [puppet] - 10https://gerrit.wikimedia.org/r/188542 (owner: 10Yuvipanda) [11:57:40] !log bounce diamond in batches in esams [11:57:44] Logged the message, Master [11:59:48] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014151 (10matmarex) By the way, how do I file such reports in the future so that they get noticed even when they are not blocking WMF activities? I assume that pinging people dir... [12:00:26] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014152 (10mark) >>! In T88209#1014151, @matmarex wrote: > By the way, how do I file such reports in the future so that they get noticed even when they are not blocking WMF activi... [12:00:52] !log bounce diamond in batches in ulsfo [12:00:54] Logged the message, Master [12:02:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:02:58] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:06:39] (03PS9) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [12:10:21] !log bounce diamond in batches in codfw [12:10:33] (03CR) 10KartikMistry: cxserver: Add Yandex support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [12:12:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:18:35] !log bounce udp2log on analytics1026 to pick up dns changes [12:19:14] <_joe_> godog: oh nice to see i'm not the only one playing whack-a-mole this morning [12:19:55] hehe indeed _joe_ dns changes are a messy way to failover [12:28:01] !log bounce txstatsd on ms-be* [12:28:07] Logged the message, Master [12:28:30] !log bounce txstatsd on ms-fe* [12:28:33] Logged the message, Master [12:33:47] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:38:44] (03PS2) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all api appservers [puppet] - 10https://gerrit.wikimedia.org/r/188530 [12:38:53] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use lru pcre cache for all api appservers [puppet] - 10https://gerrit.wikimedia.org/r/188530 (owner: 10Giuseppe Lavagetto) [12:39:11] (03PS10) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [12:39:29] akosiaris: is there an easy way to pause or stop bacula while it is running backups? it started on tungsten :( [12:39:49] it isn't a problem per se just wrong timing [12:41:15] godog: damn it's wednesday ? [12:41:26] so bacula jobs are not preemptable really [12:41:40] stop the bacula-fd on tunstgen to force fail the backup [12:41:44] <_joe_> !log installing the new HHVM package on jobrunners [12:41:48] Logged the message, Master [12:41:50] akosiaris: thanks, that works! [12:42:04] it is indeed wednesday [12:42:34] yeah, the relevance is Database backups are taken today [12:42:40] !log stop bacula-fd on tungsten, backups running during migration [12:42:43] Logged the message, Master [12:43:24] puppet will rerun it but no other jobs are scheduled so it should be no problem [12:45:17] sweet, yeah I disabled puppet temporarily too [12:58:44] also I realized analytics is special cased, akosiaris still there? I'd need 10.64.32.155/32 added where 10.64.0.18/32 shows up [12:59:00] I'd do it but I don't trust myself touching a fw in a "rush" [12:59:11] sigh [12:59:16] let me see [12:59:52] analytics-in4 I think, just two terms tho [13:00:06] godog: yeah, graphite and statsd [13:03:15] godog: done [13:04:52] akosiaris: \o/ thanks, appreciate it [13:33:31] damn you, you've firewalled off ssh everywhere [13:34:49] mark___: were you also trying to ssh from tin? :) [13:34:59] no, from home [13:35:27] oh? [13:36:34] (03PS2) 10Yuvipanda: beta: Remove mediawiki appserver role [puppet] - 10https://gerrit.wikimedia.org/r/185966 (https://phabricator.wikimedia.org/T87210) [13:39:11] (03PS3) 10Yuvipanda: beta: Remove mediawiki appserver role [puppet] - 10https://gerrit.wikimedia.org/r/185966 (https://phabricator.wikimedia.org/T87210) [13:39:27] (03CR) 10Yuvipanda: [C: 032] beta: Remove mediawiki appserver role [puppet] - 10https://gerrit.wikimedia.org/r/185966 (https://phabricator.wikimedia.org/T87210) (owner: 10Yuvipanda) [13:40:27] mark___: niah we didn't (yet) [13:40:36] (03PS11) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [13:40:44] just on lots of boxes ;p [13:45:46] mark: that would have made my heart skip several beats :) [13:46:17] akosiaris: think you'll have time today to help with the notrack stuff for labnet1001? [13:46:46] 3Beta-Cluster, operations: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1014331 (10yuvipanda) [13:46:47] 3Beta-Cluster: Remove beta specific mediawiki roles - https://phabricator.wikimedia.org/T87210#1014329 (10yuvipanda) 5Open>3Resolved Re-imaged mediawiki01 and 02, and everything seems peachy! :D [13:47:25] YuviPanda: probably. I am finishing up the evaluation of openstack vs ganeti, after that I am all yours [13:47:31] akosiaris: \o/ cool [13:48:07] akosiaris: Was my doc sufficient for your needs and/or did you want Horizon? [13:50:26] akosiaris: Also, after a long talk with the ceph people; they say "soon", and it's promising, but not ready for primetime yet. [13:50:33] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1014333 (10zeljkofilipin) Apologies for the late reply, I was traveling for the last 3 weeks. I am working on this now. [13:55:21] Coren: an Unbreak Now! issue with NFS https://phabricator.wikimedia.org/T88527 [13:59:18] YuviPanda: Looking at it now [13:59:52] (03PS1) 10Mark Bergsma: Disable LWP SSL hostname verification [puppet] - 10https://gerrit.wikimedia.org/r/188553 [14:00:04] (03PS1) 10Hoo man: Re-enable wgCentralAuthAutoMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188554 [14:10:56] !log bounce navtiming on hafnium to pick up dns changes [14:11:02] Logged the message, Master [14:14:10] !log bounce webperf-related services on hafnium too: ve, statsd-mw-js-deprecate, statsv, asset-check [14:14:13] Logged the message, Master [14:22:46] (03PS3) 10Steinsplitter: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) [14:23:12] Anybody doing anything in here in the past ten minutes or so? [14:24:20] marktraceur: I think you'd have to be more specific, what's wrong? [14:24:35] UploadWizard was failing for at least one person [14:25:18] Is failing* but only in non-debug mode [14:26:36] no clue, sorry [14:27:44] KK [14:27:46] Thanks :) [14:29:53] (03PS1) 10Steinsplitter: To allow dia-files (for flowcharts) we need to whitelist x-gzip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188557 (https://phabricator.wikimedia.org/T88242) [14:33:19] <_joe_> marktraceur: bad cache I'd say [14:33:29] <_joe_> marktraceur: we don't usually do anything, no [14:33:47] <_joe_> we just pretend to be busy, and get away with it by being grumpy [14:33:53] Heh [14:34:01] I'd rather be sure [14:42:17] Anyone mind if I stumble through the UW repo and touch a bunch of files? [14:43:45] <_joe_> marktraceur: what are you trying to achieve? [14:44:14] Hopefully we can fix the cache... [14:44:24] I have one file in mind, but after that I'll try more. [14:44:34] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Niah, https://phabricator.wikimedia.org/T88507 is the reason, it should be fixed there" [puppet] - 10https://gerrit.wikimedia.org/r/188553 (owner: 10Mark Bergsma) [14:44:40] :P [14:45:53] that's fine [14:45:57] but you're gonna do that ;) [14:46:08] 3operations: The certificate chains of newly installed SHA2 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1014582 (10akosiaris) This broke the RT mail gateway as evidenced in https://gerrit.wikimedia.org/r/#/c/188553/ [14:46:36] 3operations: The certificate chains of newly installed SHA2 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1014583 (10akosiaris) p:5Triage>3Unbreak! [14:46:46] _joe_: Thoughts? [14:46:58] I always wanted to do a UBN in phabricator [14:47:11] well, that is not true [14:47:18] <_joe_> marktraceur: we may just purge the cache? [14:47:19] I wanted to do since this morning :-) [14:47:31] _joe_: The whole cache? [14:47:42] <_joe_> no, that particular url [14:47:44] <_joe_> :) [14:48:02] <_joe_> if it's a ton of those, touching the file is obviously fine as well [14:48:23] Yeah, I think touch the file is the best bet [14:48:33] Because I'm getting the same error about the same file from three people [14:48:37] So [14:48:39] <_joe_> not sure if it's enough to purge the varnish cache [14:48:49] <_joe_> which file? [14:48:49] Are RL loads cached there? [14:49:04] <_joe_> RL loads? what is that? [14:49:08] extensions/UploadWizard/resources/controller/uw.controller.Upload.js but I already touched it [14:49:17] (03CR) 10Raimond Spekking: [C: 031] "Looks reasonable, but not tested by me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188557 (https://phabricator.wikimedia.org/T88242) (owner: 10Steinsplitter) [14:49:17] _joe_: ResourceLoader load.php requests [14:49:26] marktraceur: You also need to sync it :P [14:49:29] I'm going to!@ [14:49:38] <_joe_> hoo: I was about to tell him [14:49:38] Just telling _joe_ he needn't worry about it [14:50:19] !log marktraceur Synchronized php-1.25wmf15/extensions/UploadWizard/resources/controller/uw.controller.Upload.js: Touch an UploadWizard file to try and fix caching (duration: 00m 05s) [14:50:25] Logged the message, Master [14:53:51] Well, it didn't work. [14:53:53] Thanks though [14:56:03] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014603 (10MZMcBride) I don't think {T85970} is directly related to this task, but while people are poking at dumps, I'll just mention it. [14:58:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T1500). Please do the needful. [15:00:27] marktraceur! do you know anything about our job queue? [15:01:50] Only a little! [15:01:54] YuviPanda: Why? [15:02:09] marktraceur: beta labs, I just re-imaged the job queue [15:02:12] and am wondering if it is working [15:02:15] Ah [15:02:16] Still? [15:02:22] hoo: yes. [15:02:23] There are maintenance scripts, I think, that you can use [15:02:32] uvipanda@deployment-bastion:/srv/deployment/jobrunner/jobrunner$ mwscript showJobs.php --wiki enwiki --group [15:02:32] EchoNotificationJob: 0 queued; 30 claimed (0 active, 30 abandoned); 0 delayed [15:02:32] webVideoTranscode: 2 queued; 0 claimed (0 active, 0 abandoned); 0 delayed [15:02:36] To create some jobs and see some jobs and delete some jobs [15:02:36] <_joe_> YuviPanda: I can take a look [15:02:37] it's been the same for a while now [15:02:46] <_joe_> if you fix dumps in my place [15:02:48] Videos take a long time [15:02:51] _joe_: :P [15:02:57] marktraceur: no, the first one. [15:02:58] just create a bunch of jobs and see how they do [15:03:03] marktraceur: we don't actually have any video transcode nodes [15:03:08] hoo: oh, how do I do that? [15:03:12] <_joe_> YuviPanda: deal? [15:03:14] _joe_: no :P [15:03:25] Ohh. [15:03:42] YuviPanda: I dunno, you could upload some chunked images or something [15:03:47] hmm [15:03:49] Then showJobs.php [15:03:50] no linksupdate? [15:03:56] Ooh, that's a good one [15:04:01] Why not just linksupdate? [15:04:04] recursive [15:04:07] I don't see linksupdate [15:04:09] YuviPanda: I believe jobs running is manually disabled on Wikimedia wikis. [15:04:10] in the joblist [15:04:10] at [15:04:20] There's a cronjob somewhere that runs a maintenance script to process jobs. [15:04:24] Yeah [15:04:29] So if you're emulating Wikimedia's config, you'll need to set that up. [15:04:38] The default config fires a job for every web request [15:05:02] err [15:05:07] so this is exactly like wikimedia config [15:05:11] because this *is* wikimedia config [15:05:18] (I'm using the same puppet roles, and this is beta labs) [15:05:21] Then run runJobs or whatever it is [15:05:50] yes, I just set up a new machine that's running the redis based jobrunner, and I can see that it is running (the process is) [15:05:58] I'm wondering how to verify that it's actually processing jobs [15:06:10] Just fire a couple and see how they do [15:06:11] and also why there's no linksUpdate, etc in showJobs [15:06:11] Sit there and watch it? :) [15:06:13] :P [15:06:22] hoo: how do I fire one? :) [15:06:26] YuviPanda: Which wiki have you been looking at? [15:06:27] Maybe your wiki is boring, and has no links [15:06:30] hoo: enwiki [15:06:42] Heh. [15:06:59] enwiki betalabs [15:07:02] none of this is in prod [15:07:30] http://en.wikipedia.beta.wmflabs.org/ wtf [15:07:44] lol [15:07:45] http://en.wikipedia.beta.wmflabs.org/wikik/Main_Page works [15:07:49] YuviPanda: maintenance/refreshLinks.php maybe? [15:08:01] Might be...not a job [15:08:28] yeah, I don't think that is. [15:09:00] YuviPanda: How's it doing? [15:09:55] better! [15:09:57] refreshLinks: 177 queued; 5 claimed (5 active, 0 abandoned); 0 delayed [15:09:58] cirrusSearchIncomingLinkCount: 0 queued; 0 claimed (0 active, 0 abandoned); 9 delayed [15:09:58] cirrusSearchLinksUpdatePrioritized: 22 queued; 0 claimed (0 active, 0 abandoned); 0 delayed [15:09:59] EchoNotificationJob: 0 queued; 30 claimed (0 active, 30 abandoned); 0 delayed [15:10:01] webVideoTranscode: 2 queued; 0 claimed (0 active, 0 abandoned); 0 delayed [15:10:06] and now [15:10:07] refreshLinks: 161 queued; 4 claimed (4 active, 0 abandoned); 0 delayed [15:10:08] yeah, I triggered these [15:10:14] so it is running! [15:10:57] :) [15:11:25] 3Beta-Cluster, operations: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1014715 (10yuvipanda) [15:13:31] _joe_: How would one clear a varnish cache, assuming there is one? [15:13:41] Because there's still no change in what's happening for the people in -commonss. [15:14:24] 3operations: apache-fast-test non existent pybal config handling - https://phabricator.wikimedia.org/T58013#1014742 (10Jgreen) 5Open>3Resolved I can't reproduce this issue with the current version. It's not pretty but all bad argument error cases I tested are handled with reasonably informative die() message... [15:16:24] !log bounce diamond in batches in eqiad [15:16:30] Logged the message, Master [15:17:07] <_joe_> marktraceur: maybe someone else can help you, I'm trying to fix the enwiki dumps atm [15:17:12] KK [15:17:19] LFM to raid the Varnish cache. [15:17:38] <_joe_> marktraceur: it's matter of varnishadm ban some urls [15:17:43] <_joe_> but it's a complex mechanism [15:17:44] OK [15:18:11] <_joe_> and not painless [15:18:32] <_joe_> marktraceur: can you verify that the file you recieve with debug=true and without are different? [15:19:05] I mean, yes, because the latter is minified and combined into one request [15:19:20] It's hard, if not impossible, to compare the two [15:19:56] But neither of them, for me, have maybeStartProgressBar [15:19:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:20:32] <_joe_> marktraceur: ori had to revert -wmf15 this morning [15:20:39] <_joe_> for a strong perf regression [15:20:50] <_joe_> there is one unbreaknow! bug in phab about that [15:21:01] Oh [15:21:03] wait [15:21:05] he did? [15:21:08] Link! [15:21:10] So wmf15 isn't even the current...damn it [15:21:16] 3Datasets-General-or-Unknown, operations: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014780 (10Joe) After some sweat and cursing (and horrible local hacks, and invocations of the Old Ones) it seems dumps have restarted on enwiki as well. I'm monitoring it and may... [15:21:16] So I need to mess with wmf14 instead [15:21:26] That explains why touching the file didn't help [15:21:37] Also, is there a !log about that that I missed [15:21:59] m( [15:22:06] !log graphite move close to completion, updating dashboards [15:22:10] Logged the message, Master [15:22:13] Oh, no [15:22:14] <_joe_> godog: \o/ [15:22:18] He reverted *to* wmf15 [15:22:23] No [15:22:25] Apparently non-wikipedias were atwmf16 [15:22:26] to wmf14 [15:22:29] no [15:22:31] <_joe_> mmmh no idea [15:22:36] 10:14 logmsgbot: ori Finished scap: I78446aacb: [Regression] Revert "Non wikipedias to 1.25wmf15" (duration: 31m 34s) [15:22:45] Oh. [15:22:45] 3operations: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1014786 (10akosiaris) [15:22:45] typo [15:22:47] I can read [15:22:57] https://meta.wikimedia.org/wiki/Special:Version [15:22:59] It reverts the config patch to push them to wmf15 [15:23:03] OK, so back to tin with me [15:23:56] Anyone mind if I sync real quick here? [15:24:14] <_joe_> marktraceur: /win 34 [15:24:20] <_joe_> oh my [15:24:21] Good point. [15:24:27] <_joe_> this is most definitely not my day [15:24:52] (03PS1) 10Alexandros Kosiaris: Provision the RapidSSL_SHA256_CA_-_G3 CA [puppet] - 10https://gerrit.wikimedia.org/r/188562 (https://phabricator.wikimedia.org/T88507) [15:25:28] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014826 (10akosiaris) Document, puppetize etc pretty pretty please :-) [15:25:46] K, syncing [15:25:56] !log marktraceur Synchronized php-1.25wmf14/extensions/UploadWizard/resources/controller/uw.controller.Upload.js: Touch an UploadWizard file to try and fix caching (duration: 00m 05s) [15:25:57] I'm off to a slow start and just getting coffee, but is there anything I need to get involved in quickly re: bits cache and wmf14/15 and whatever marktraceur is recently saying wasn't minified? [15:25:59] Logged the message, Master [15:26:16] bblack: Maybe not, one sec [15:26:46] (03PS1) 10Filippo Giunchedi: gdash: move from tungsten to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188563 (https://phabricator.wikimedia.org/T85909) [15:27:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: move from tungsten to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/188563 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [15:27:38] bblack: OK, yes apparently [15:28:02] bblack: The uw.controller.Upload module appears to be cached somewhere. I touched it and synced it, but apparently nothing happened [15:28:18] bblack: Wondering if there's something up with Varnish that could be causing this [15:28:48] well its job is to cache things. is it not supposed to be cached due to no-cache headers? or did we serve a bad version earlier and we want to take it back? [15:30:00] bblack: I think we served a bad version, because it has one version of one script and another version of the script I mentioned [15:30:10] Causing an incompatibility that crashes UploadWizard [15:30:13] (03CR) 10Alexandros Kosiaris: [C: 032] Provision the RapidSSL_SHA256_CA_-_G3 CA [puppet] - 10https://gerrit.wikimedia.org/r/188562 (https://phabricator.wikimedia.org/T88507) (owner: 10Alexandros Kosiaris) [15:31:00] (03PS2) 10Jgreen: apache-fast-test strip transaction metadata before comparing response size [puppet] - 10https://gerrit.wikimedia.org/r/188475 [15:31:39] so, I need to clear it based on a URL pattern of some kind... [15:31:53] ideas what that would look like? [15:32:34] marktraceur: ^ [15:32:57] (03CR) 10Jgreen: [C: 032 V: 031] apache-fast-test strip transaction metadata before comparing response size [puppet] - 10https://gerrit.wikimedia.org/r/188475 (owner: 10Jgreen) [15:33:05] bblack: Well, there would definitely be the string "uw.controller.Upload" in the query string. Also the strings "load.php" and "modules=" [15:33:20] have an example? [15:33:24] Uhh. [15:33:25] what is uploadwizard doing? [15:33:32] Can get one. [15:33:38] ori: Crashing for some people but not all. [15:34:49] it's not varnish [15:36:49] OK, well [15:36:54] I'll stop looking for a URL then [15:37:12] ori: Do you have an idea of what it might be? [15:37:12] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1014917 (10Joe) 5Open>3Resolved [15:37:34] Because this has all the earmarks of an RL cache hiccup [15:38:25] the startup module for anons (which contains the manifest with module versions) is cached for 5 minute for anons [15:38:27] (03PS1) 10Alexandros Kosiaris: Actually provision the rapidssl_sha256_ca_G3 [puppet] - 10https://gerrit.wikimedia.org/r/188565 (https://phabricator.wikimedia.org/T88507) [15:38:33] did you wait 5 minutes after syncing? [15:38:42] Yeah, it's been a while [15:38:46] At least 12 minutes [15:38:50] what indication do you have that people are being served the wrong version? [15:39:17] ori: They get the version of mw.UploadWizard.js that calls maybeStartProgressBar but not the version of uw.controller.Upload.js that *has* it. [15:39:24] At least as far as I can determine [15:39:40] Also I get the old versions of both files, so I have no problems, but I don't get the latest patches for whatever reason. [15:39:42] I don't think uw.controller.Upload is the right pattern to look for in the URL, I'm not getting anything for that [15:39:50] (03CR) 10Alexandros Kosiaris: [C: 032] Actually provision the rapidssl_sha256_ca_G3 [puppet] - 10https://gerrit.wikimedia.org/r/188565 (https://phabricator.wikimedia.org/T88507) (owner: 10Alexandros Kosiaris) [15:39:54] bblack: I agree, but ori says it's not varnish [15:39:58] are you sure this isn't just a race condition caused by undeclared dependency between modules [15:40:03] Hm. [15:40:14] I was pretty sure it was a declared dependency, but I'll check. [15:40:32] Well, actually [15:40:33] causing the code that defines maybeStartProgressBar to sometime run before the code that calls it, and sometime after? [15:41:06] ori: maybeStartProgressBar is a method on an object that gets created earlier on in mw.UploadWizard.js, so if the controller file hadn't loaded we'd see errors before then [15:41:45] what do i need to do to reproduce? [15:41:54] ori: Go to UW on Commons and choose a file [15:41:57] also, you are aware that i rolled commons back from wmf15 to wmf14? [15:42:01] ori: https://commons.wikimedia.org/wiki/Special:UploadWizard?uselang=en [15:42:08] ori: I am, pretty sure it doesn't change anything [15:43:02] Uncaught TypeError: wizard.steps.file.maybeStartProgressBar is not a function [15:43:17] Oh, hm. I might have touched the wrong thing [15:43:18] Sec. [15:43:46] !log marktraceur Synchronized php-1.25wmf14/extensions/UploadWizard/resources/controller/uw.controller.Upload.js: Touch an UploadWizard file to try and fix caching (duration: 00m 07s) [15:44:43] ori: Want to refresh and try it again? That may have been good enough. [15:45:58] 3operations: enable HSTS for various fundraising servers - https://phabricator.wikimedia.org/T88570#1014967 (10Jgreen) 3NEW a:3Jgreen [15:46:26] 3operations: enable HSTS for various fundraising servers - https://phabricator.wikimedia.org/T88570#1014975 (10Jgreen) 5Open>3Resolved header set to 180 days [15:47:46] ori: No, rillke says it's still notworking [15:48:20] 3operations: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1014986 (10akosiaris) [15:49:49] (03CR) 10Glaisher: Standardize the name of interface editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186593 (https://phabricator.wikimedia.org/T85731) (owner: 10Glaisher) [15:50:14] so, I was under the general impression that we don't send purges to bits because the objects there are effectively versioned in their URLs, and we'd have to invalidate the text pages or whatever that ref those URLs to effect a quick change [15:50:55] manybubbles, marktraceur, ^d: Who wants to SWAT this morning? [15:51:14] anomie: I should totally do it but I have a meeting. I *promise* to do it tomorrow [15:51:15] so perhaps the issue isn't with whatever's being touched on bits (js), but with whatever's referencing that js [15:51:16] I think there's only one patch...I dunno, I'm on tin anyway, I can do it [15:51:27] marktraceur: ok! [15:51:48] <^d> jouncebot: next [15:51:48] In 0 hour(s) and 8 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T1600) [15:51:50] manybubbles, anomie: i don't think we should deploy [15:51:55] 3operations: enable HSTS for various fundraising servers - https://phabricator.wikimedia.org/T88570#1014990 (10Jgreen) also done for dash and reports [15:51:57] until we figure out the regression in wmf15 [15:51:57] marktraceur: ^^^^ [15:52:00] bblack: Maybe, but I double-checked the dependencies, they're sound, so...well, it could be a bug in ResourceLoader, but then we'd see issues everywhere [15:52:05] ori: I'm sort of in that boat, yeah [15:52:17] ori: It's a config change, not a deploy of a new MW version. [15:52:34] can we reproduce this at all? or do we have a header dump from someone with the prob? [15:52:44] bblack: I can't reproduce it, ori can [15:54:03] where does it go wrong? how far into uploadwizard should I be when something breaks? [15:54:19] bblack: Once you choose a file to upload, you should get the error [15:54:45] And no buttons will appear for the next step, so you won't get any farther [15:54:49] (03PS1) 10Filippo Giunchedi: webperf: handle missing 'duration' in schema [puppet] - 10https://gerrit.wikimedia.org/r/188567 (https://phabricator.wikimedia.org/T85909) [15:56:02] hmmm, works for me. and I just uploaded to commons on monday, so you'd think if that were due to some local cache in the browser I might still have it. [15:56:08] Yeah [15:57:18] bblack: Funny thing was, rillke said it worked fine for him with uselang=en-gb but not with uselang=en [15:57:34] But both worked for me [15:57:52] <_joe_> marktraceur: rillke and you are in the same geographic zone? [15:58:04] I think he's in Germany [15:58:17] Hm, are they all in Europe [15:59:21] OK, russavia's in Australia. [16:00:04] manybubbles, anomie, ^d, marktraceur, hoo: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T1600). Please do the needful. [16:00:12] <_joe_> marktraceur: I see a lot (well, relatively) of 503s for https://commons.wikimedia.org/wiki/Special:UploadWizard?debug=true [16:00:38] ...really [16:00:50] (03CR) 10Ori.livneh: [C: 032] webperf: handle missing 'duration' in schema [puppet] - 10https://gerrit.wikimedia.org/r/188567 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [16:00:59] should there be a lot of debug=true in general? [16:01:01] hoo: Did you see the discussion above about not deploying? [16:01:02] what's a lot? [16:01:11] <_joe_> bblack: no [16:01:12] bblack: No, but we've all been trying to find out what's wrong [16:01:28] <_joe_> it's a lot of people bypassing the cache because they want to help :P [16:01:37] Anything above like...ten different IPs requesting it would be weird, maybe [16:01:37] we once did have an issue like this where a mobile app was deployed with debug URLs in it for users right? [16:01:46] <_joe_> bblack: yeah [16:01:58] <_joe_> it may be the same [16:02:13] <_joe_> requests queueing on the servers [16:02:21] Can't think of where there would be a link to that. [16:02:27] <_joe_> the varnishes I mean [16:02:42] is it like 10 IPs, or like really lots? [16:03:00] <_joe_> like 10 [16:03:06] <_joe_> but doing a lot of requests :) [16:03:08] marktraceur: no [16:03:26] manybubbles: almost ready with ES bounce, had some fallout from upgrading graphite's hw [16:03:27] <_joe_> which probably time out from time to time [16:03:43] hoo: We're not going to go because of the wmf15 regression [16:04:00] that's a configuration change [16:04:01] meh [16:04:55] df67 says he's in Canada, so people with problems are all over the globe [16:09:07] (03PS1) 10Ori.livneh: navtiming: if sslNegotiation is present, log it but no other metrics [puppet] - 10https://gerrit.wikimedia.org/r/188569 [16:10:58] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1015052 (10akosiaris) [16:11:25] (03Abandoned) 10Alexandros Kosiaris: Disable LWP SSL hostname verification [puppet] - 10https://gerrit.wikimedia.org/r/188553 (owner: 10Mark Bergsma) [16:11:29] (03CR) 10Ori.livneh: [C: 032] navtiming: if sslNegotiation is present, log it but no other metrics [puppet] - 10https://gerrit.wikimedia.org/r/188569 (owner: 10Ori.livneh) [16:14:48] Oh. [16:15:03] bblack, _joe_, rillke linked to the debug=true version at the upload help page. [16:15:13] Because it's the only thing working for them. [16:15:42] Which, again, to me, screams "RL cache" [16:18:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:18:35] what's RL? [16:19:01] and I can easily wipe the bad URLs from the bits caches if it helps, but I need a working URL pattern to use [16:19:57] I'm not sure I've seen what the problem is in those specific terms. the bits URLs for JS have versions embedded in them, so are we saying that when we request one version we get another? or that the page that links these js objects contains URLs with the wrong versions in them? [16:20:18] (in which case it's not a bits-cache problem, it's probably a text-cache problem) [16:22:53] bblack: Maybe he's requesting the wrong version, not sure [16:25:28] This is probably a ResourceLoader bug...ping Krinkle|detached and RoanKattouw_away [16:29:03] (03PS1) 10Filippo Giunchedi: gdash: deprecate 75percentile and median [puppet] - 10https://gerrit.wikimedia.org/r/188573 [16:29:58] ori: who's a good candidate to CC to ^ besides you? [16:30:16] 3ops-eqiad, operations: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1015111 (10Cmjohnson) idrac fails to initialize. Past experience this has required a new system board. Ordered and awaiting arrival. WO5823532 Approved 399YBX1 WIKIMEDIA FOUNDATION, INC iDrac fails to initialize 2/3/2015... [16:31:30] 3ops-eqiad, operations: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1015112 (10Cmjohnson) to add to RobH's comment above. First steps were to drain flea power and remove power supplies. This failed hence the new main board. [16:32:34] 3operations: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1015114 (10Jgreen) 3NEW [16:32:46] 3operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1015121 (10Jgreen) [16:34:43] (03PS1) 10BBlack: Revert bits cache sizes to 2G everywhere [puppet] - 10https://gerrit.wikimedia.org/r/188574 [16:34:48] 3operations, hardware-requests, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1015123 (10Jgreen) [16:35:26] (03CR) 10BBlack: [C: 04-1] "Just staging this up in case we want to try it at some point while debugging current issues. Would require a slow process of cache restar" [puppet] - 10https://gerrit.wikimedia.org/r/188574 (owner: 10BBlack) [16:44:41] manybubbles: I'll go with elastic1001 in 10min [16:46:19] godog: cool [16:50:02] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1015150 (10Jgreen) also: *.frdev.wikimedia.org (wildcard cert) civicrm.wikimedia.org frdata.wikimedia.org fundraising.wikimedia.org payments-listener.wikimedia.org [16:52:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Use dh-exec to properly rename ini file for fastcgi [debs/hhvm] - 10https://gerrit.wikimedia.org/r/188332 (owner: 10Giuseppe Lavagetto) [16:52:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: use lru pcre cache for all mediawiki hhvm installations [puppet] - 10https://gerrit.wikimedia.org/r/188531 [16:53:37] 3operations: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1015151 (10RobH) URGH, so even the replacement intermediary isnt really SHA256. I apologize, I should have caught this issue at the time of replacement! I'll go through and poin... [16:54:45] (03CR) 10BBlack: [C: 032] "Going forward with this just to get a question-mark out of the way in guessing about current js issues. Restarts will take a little bit o" [puppet] - 10https://gerrit.wikimedia.org/r/188574 (owner: 10BBlack) [16:54:51] (03PS1) 10RobH: setting dumps.w.o new intermediary cert [puppet] - 10https://gerrit.wikimedia.org/r/188579 [16:56:46] !log restart ES on elastic1001 [16:56:51] Logged the message, Master [16:57:22] (03CR) 10RobH: [C: 032] setting dumps.w.o new intermediary cert [puppet] - 10https://gerrit.wikimedia.org/r/188579 (owner: 10RobH) [16:57:35] bd808 anomie ^ (shall I take https://phabricator.wikimedia.org/T88354 too?) [16:58:30] !log replacing the intermediary cert on dumps.w.o (so nginx will flap on it shortly) [16:58:32] Logged the message, Master [17:05:22] 3operations: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1015172 (10RobH) it seems etherpad was fixed when rt was fixed (same system, and now pulling the cert with openssl shows proper intermediary.) [17:06:26] 3Analytics-Engineering, operations: Puppet Production role class for wikimetrics scheduler/queue - https://phabricator.wikimedia.org/T76791#1015174 (10kevinator) [17:10:45] so the broken uploadwizard issue is still stalled and unresolved? [17:11:38] Yeah, I'm waiting for someone who knows about resourceloader to poke their head up [17:12:46] I wish we had a way to reproduce this, or even some deeper details on the URLs involved and what's serving the wrong what to whom [17:15:51] bblack: Well, the last URL I got was https://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20150204T101420Z [17:16:14] The version matches mine, but returns a different file, as far as I can tell [17:20:45] <^d> marktraceur: Stale files on some set of mw* hosts? [17:20:51] * ^d is guessing entirely [17:21:05] Maybe. [17:21:09] But I synced it... [17:21:28] Well, only that file. Maybe if I tried mw.UploadWizard.js too? [17:21:55] <^d> Can't hurt [17:22:31] !log marktraceur Synchronized php-1.25wmf14/extensions/UploadWizard/resources/mw.UploadWizard.js: Touch an UploadWizard file to try and fix caching (duration: 00m 07s) [17:22:37] Logged the message, Master [17:23:01] are we saying it's possible that people are getting two different sets of content for that URL, even though the URL is exactly the same including &version=20150204T101420Z ? [17:23:23] or that people are getting two different pages, which contain links to differently-versioned js URLs? [17:23:33] The version looks the same to me [17:23:44] But then, maybe it was a different URL that was causing trouble [17:25:04] I've restarted the two newly-very-large bits caches back to their old 2G size in eqiad btw [17:25:23] so we've effectively dumped half of our bits caching in eqiad now, over the past short while [17:25:30] bblack: found by ori: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+eqiad&h=cp1070.eqiad.wmnet&jr=&js=&v=255033&m=varnish.n_object&vl=N&ti=N+struct+object [17:25:45] also: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=&vl=&x=&n=&hreg%5B%5D=cp10%2869|70%29&mreg%5B%5D=cpu_system>ype=line&glegend=show&aggregate=1&embed=1&_=1423039489644 [17:26:11] 3operations: Contigo Specials - https://phabricator.wikimedia.org/T88578#1015233 (10emailbot) [17:26:28] paravoid: the stats differential is because cp1069 has a 2G cache and cp1070 has a 47G cache [17:26:42] or did, but I've restarted it now as 2G just in case that's indirectly causing problems [17:27:04] yeah I remember the nice table [17:27:06] (03PS1) 10Aude: Set useLegacyChangesSubscription to true for Wikidata etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188584 [17:27:33] the same was true of cp1056 as well (also eqiad bits, but still precise), which is also now restarted back to 2G [17:27:46] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+eqiad&h=cp1056.eqiad.wmnet&jr=&js=&v=255033&m=varnish.n_object&vl=N&ti=N+struct+object [17:27:53] ^ it had been on the new size for more days [17:29:28] OK, touching all the JS files as a hail mary [17:30:33] find -name *.js -exec touch {} \ ; # muahahaha [17:30:44] like that? [17:30:49] !log marktraceur Synchronized php-1.25wmf14/extensions/UploadWizard/: Touching pretty much everything in UploadWizard, maybe it will help (duration: 00m 07s) [17:30:49] Something like that [17:30:54] Logged the message, Master [17:31:59] 3hardware-requests, operations, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1015264 (10mark) Why long-term and not short-term? Virtualization is a project for this quarter. [17:33:01] 3operations: Contigo Specials - https://phabricator.wikimedia.org/T88578#1015265 (10chasemp) 5Open>3Invalid a:3chasemp [17:33:02] ok, as of now, all the bits varnishes that had >2G sizes running have all been restarted back to the 2G size [17:33:11] :) [17:33:21] i've deliberately kept it low in the past even if some servers had more mem [17:33:29] bits is intentionally meant to be a small cache [17:33:33] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1015268 (10RobH) So I was installing these, and Alex noticed that the intermediary certificate changed (stupid of me, I should have noticed.) The systems that were already replaced have been fixed, however, the NEW int... [17:33:35] and if it no longer is due to software changes, we should know about it [17:33:42] because we don't invalidate it properly? [17:33:54] invalidate what? [17:33:59] the bits cache objects? [17:34:02] bits doesn't have any invalidation by design [17:34:10] right, but it has versioned URLs [17:34:11] that's one of the points [17:34:12] yes [17:34:33] I guess my question is, are we relying on quick evictions from the tiny cache size to avoid deployment issues? [17:34:39] no [17:34:43] definitely not [17:35:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:35:44] when I upsized the others, I upsized those as well, but it had only taken effect on 2/4 eqiad and 2/4 esams (and 0/4 ulsfo) from restarts so far. I patched it back to 2G configuration and restarted those over the past hour or so, just in case it's related. [17:38:08] manybubbles: still going :| https://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&h=elastic1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [17:38:48] mark: in any case, what's the rationale for keeping it tiny if versioning should work fine? [17:40:40] what's with all the "no valid datapoints" alerts? graphite? [17:40:52] bblack: keeping it fast, basically [17:41:01] it used to be a tiny dataset [17:41:08] so if that changes for some reason we want to know about it [17:41:14] might be something wrong [17:41:22] it used to have like 99.9% cache hit rate too [17:41:28] not sure what it is now, probably not that anymore :P [17:42:15] (03PS1) 10Aaron Schulz: Revert "Revert "Use ProfilerSectionOnly to handle DB/filebackend entries and the like"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188586 [17:42:31] yeah I'm sure it's not, because when I massively upsized the caches, they continuously filled up with new objects (but didn't run long enough yet to actually finish filling the upsized cache) [17:42:48] (03CR) 10Aaron Schulz: [C: 04-2] "Needs wmf15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188586 (owner: 10Aaron Schulz) [17:42:53] I would've expected them to self-limit on objects shortly since there shouldn't be that many unique objects for bits [17:45:19] paravoid: likely, looking [17:45:32] 3wikidata-query-service, Wikidata, operations: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1015304 (10JanZerebecki) [17:45:32] so in any case you can increase the size at least a little bit [17:45:46] but it might make sense to not give it unlimited mem [17:46:04] well, for the moment I'm gonna leave them reverted at 2G just to eliminate one of many variables in whatever's going on with upload bits js [17:46:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:47:14] are those 5xx spikes you? [17:47:40] I don't think so, it's been a while since I last touched anything [17:48:13] well a while being 15m [17:48:29] hah [17:48:38] some of the earlier ones could be, but not the one at the leading edge [17:48:51] so we lost 5xx stats for the past ~48h it looks like? [17:49:05] well up until a few hours ago [17:49:11] yeah that's the graphite migration I guess [17:51:45] it is, I'll backfill the most important ones tomorrow [17:57:31] 3Wikimedia-General-or-Unknown, operations: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#1015370 (10Jgreen) DMARC/DKIM/SPF and other specific technologies aside, the long-standing trend is for organizations to take responsibility for how their domai... [18:05:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:08:26] !log bounce txstatsd on cache hosts in esams [18:08:32] Logged the message, Master [18:10:23] (03PS1) 10BBlack: re-enable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188587 [18:11:26] (03CR) 10BBlack: [C: 032] re-enable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188587 (owner: 10BBlack) [18:11:27] A bunch of folks are in #mediawiki-core trying to work on https://phabricator.wikimedia.org/T88528 (wmf15 perf regression). Help is welcome [18:11:44] (please) [18:11:49] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1015398 (10Jdouglas) * My LDAP username is `jdouglas` * Here's my ssh key: ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDkf8FmZYJ48LRvKWNv93t582NN6ikgl586fsXnCptxignPYi+E8yRN8GPWVCCY+6qUBYjVlWrYzYTRhOD39... [18:12:13] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1015399 (10Jdouglas) a:5Jdouglas>3Andrew [18:14:25] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1015405 (10RobH) @Jdouglas, If this is your first time getting access to the cluster via ssh, you'll need to review the following: https://wikitech.wikimedia.org/wiki/Requesting_shell_access This in... [18:17:55] !log bounce txstatsd on cache hosts in ulsfo [18:18:02] Logged the message, Master [18:21:48] 3RESTBase, Services, Ops-Access-Requests: Shell access for @mobrovac - https://phabricator.wikimedia.org/T88465#1015423 (10Andrew) Sorry -- more steps that I didn't know about: If this is your first time getting access to the cluster via ssh, you'll need to review the following: https://wikitech.wikimedia.org... [18:23:35] 3RESTBase, Services, Ops-Access-Requests: Shell access for @mobrovac - https://phabricator.wikimedia.org/T88465#1015428 (10mobrovac) @Andrew check and check. Signed the doc last week :) [18:27:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:38] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1015439 (10Jdouglas) @RobH reviewed, read, and signed. [18:28:34] RoanKattouw, Krinkle, cancel earlier ping, apparently UW started working again [18:29:13] marktraceur: Ooooh, did UW break due to the wmf15->14 rollback? [18:30:28] !log bounce txstatsd on cache hosts in eqiad [18:30:33] Logged the message, Master [18:30:50] Yeah [18:31:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:32:29] RoanKattouw: tgr says it might be affecting other JS stuff, too, so BOLO for JS errors today [18:34:44] 3Wikimedia-OTRS, operations, hardware-requests: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1015460 (10Jgreen) It's fine with me to wait until virtualization is available. I added the ticket mostly to make sure the CODFW server is in the plan. [18:39:37] I thought the 15->14 rollback was to try to fix the UW? [18:39:42] marktraceur: Yeah this is an expected bug with the way RL cache invalidation currently works [18:39:48] bblack: No it was because of a performance regression ori found [18:39:52] oh ok [18:41:05] RL semi-assumes that timestamps only ever increase [18:41:29] When you switch back to an older source tree, timestamps decrease, and while I think that should still mostly work, it clearly gets confused sometimes [18:41:48] (timestamps = file mtimes here) [18:42:12] RoanKattouw: Is the fix just to touch every file in wmf14? [18:43:20] I'm not quite sure, it might be [18:43:45] The thing is that my explanation of why deployments sometimes break doesn't explain this [18:44:10] I intuitively know that decreasing timestamps will probably cause bad things to happen, but I don't know precisely what bad things happen where [18:44:44] RoanKattouw: FWIW VE seems fine on meta.wikimedia.org, so hopefully there's no issues there. [18:47:43] gwicke: For Marco you asked that he have ‘the right to execute trebuchet from tin.’ That’s just ‘deployment’, right? Or do you have some other rights group in mind? Are there other users whose rights you’d like me to duplicate for him? [18:49:21] manybubbles: cluster still yellow and I belive elastic1001 is still recovering, I'll leave it be and see how long it takes [18:49:37] mobrovac: any idea? ^^ [18:50:27] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [18:50:27] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: puppet fail [18:50:28] andrewbogott: I think deployer rights, yes. and any special restbase groups that may be out there [18:50:46] godog: cool [18:51:06] bd808: the phrase ‘restbase’ doesn’t appear in any rights definition. So I’m going to go with deployer for now, and see what’s missing. [18:51:14] *nod* [18:51:37] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [18:52:18] andrewbogott: dunno about exact names, but what i'd need is to deploy (restbase|cassandra) and manage them (start/stop/etc) [18:52:21] does that help? [18:53:18] mobrovac: Unless you know of another user who already has the rights that you need… I don’t think we have anything that specific defined. [18:53:32] But deployers can /probably/ do all that. [18:54:13] (03PS1) 10Andrew Bogott: Add deployment rights for Marco Obrovac [puppet] - 10https://gerrit.wikimedia.org/r/188598 [18:54:30] if we can modify them along the way, let's go with the deployment rights and adjust them if needed? [18:54:34] andrewbogott: ^^ [18:54:59] yep. mobrovac can you review that patch ^^ and +1 if it looks OK to you? [18:57:17] 3Services, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1015531 (10GWicke) [18:59:46] (03PS1) 10Andrew Bogott: Give James Douglas deployment rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 [19:00:04] Reedy, greg-g, legoktm: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T1900). [19:00:24] no jouncebot; bad jouncebot [19:00:28] (03CR) 10jenkins-bot: [V: 04-1] Give James Douglas deployment rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [19:00:43] (03CR) 10Mobrovac: [C: 031] Add deployment rights for Marco Obrovac [puppet] - 10https://gerrit.wikimedia.org/r/188598 (owner: 10Andrew Bogott) [19:00:49] :) [19:00:53] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1015551 (10Andrew) https://gerrit.wikimedia.org/r/#/c/188600/ Please +1 if this looks correct, and I will merge. [19:01:02] greg-g: so I guess the GlobalUserPage deployment is postponed..? [19:02:11] (03PS2) 10Andrew Bogott: Give James Douglas deployment rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 [19:02:46] legoktm: yeah [19:03:36] (03CR) 10Andrew Bogott: [C: 032] Add deployment rights for Marco Obrovac [puppet] - 10https://gerrit.wikimedia.org/r/188598 (owner: 10Andrew Bogott) [19:03:50] (03CR) 10GWicke: [C: 04-1] "Andrew, both Marko and James also need (root) shell access to the restbase / cassandra cluster." [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [19:04:44] gwicke: is the restbase cluster the /same/ as the cassandra cluster? [19:05:40] gwicke: and, I take it the ‘test cluster’ is in production, not in labs? [19:06:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [19:07:09] (03CR) 10Andrew Bogott: "Gwicke -- you asked that they get restbase and cassandra access 'on the test cluster'. What does 'test' mean in this context?" [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [19:07:13] (03CR) 10Jdouglas: "GWicke, what tasks will we need root access for? Any chance we could limit ourselves to sudo access for just those commands? For risk re" [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [19:07:20] gwicke: it would be asesome if we could take here rather than via gerrit [19:07:23] *talk [19:07:29] so, those 5xx CRITICALs.... [19:08:36] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:08:36] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:09:36] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:10:22] (03CR) 10GWicke: "@Andrew, praseodymium xenon and cerium." [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [19:11:03] andrewbogott: here I am [19:11:14] (03PS1) 10John F. Lewis: ganglia: add more ferm rules for services [puppet] - 10https://gerrit.wikimedia.org/r/188603 [19:11:37] gwicke: cool. Is the access level that you need for these two guys similar to any other existing users? Or is it your intent that we create a new kind of access privs just for them? [19:11:54] andrewbogott: no, they should have the same access that I have to those boxes [19:12:52] andrewbogott: https://github.com/wikimedia/operations-puppet/blob/25e0075b7e6b692ba28e87716e8774f7952eacbb/modules/admin/data/data.yaml#L64 [19:13:07] cassandra-roots should do the trick [19:13:10] gwicke: ah, so when you say ‘test cluster' [19:13:18] there is only a test cluster? [19:13:22] yes [19:13:27] Because the name ‘cassandra-roots’ doesn’t really have a ‘test’ vibe to it :) [19:13:39] yeah [19:13:42] Does that cover the ‘restbase’ portion of your request as well? [19:13:46] HW for the prod cluster is underway [19:13:53] we might have to revisit the naming at some point [19:14:01] ok :) [19:14:10] currently it covers both restbase and cassandra, as they run on the same bare metal [19:14:24] cool, this makes more sense now. Thank you [19:14:47] thank you too! [19:17:02] (03PS3) 10Andrew Bogott: Give James Douglas deployment and cassandra rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 [19:18:29] (03PS1) 10Andrew Bogott: Add mobrovac to cassandra-roots. [puppet] - 10https://gerrit.wikimedia.org/r/188605 [19:19:03] (03Abandoned) 10John F. Lewis: base: move base::firewall to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188423 (owner: 10John F. Lewis) [19:19:10] (03PS1) 10John F. Lewis: base: move base::firewall to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188606 [19:19:47] (03CR) 10Mobrovac: [C: 031] Add mobrovac to cassandra-roots. [puppet] - 10https://gerrit.wikimedia.org/r/188605 (owner: 10Andrew Bogott) [19:20:47] (03CR) 10Andrew Bogott: [C: 032] base: move base::firewall to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188606 (owner: 10John F. Lewis) [19:21:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:24:10] (03PS1) 10John F. Lewis: base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188610 [19:24:13] (03Abandoned) 10John F. Lewis: base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188419 (owner: 10John F. Lewis) [19:24:18] (03CR) 10jenkins-bot: [V: 04-1] base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188610 (owner: 10John F. Lewis) [19:27:36] awight: I'm getting "no data" from either of the urls in your email [19:27:43] aargh [19:27:58] (03Abandoned) 10John F. Lewis: base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188610 (owner: 10John F. Lewis) [19:28:05] greg-g: https://graphite.wikimedia.org/render/?title=navigationStart%20to%20loadEventEnd%20on%20desktop%20sites,%20last%20day&vtitle=milliseconds&from=-21day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias%28color%28frontend.navtiming.totalPageLoadTime.desktop.overall.median,%22blue%22%29,%22Median%22%29&target=alias%28color%28frontend.navtiming.totalPageLoadTime.desktop.overall.75perce [19:28:07] (03PS1) 10John F. Lewis: base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188611 [19:28:13] the link in Etherpad seems to work [19:28:34] what the flip, why isn't it working for me [19:28:57] ino! [19:29:05] doesn't work in iceweasel, does in chromium [19:29:11] wtf, anyway [19:29:14] \o/ [19:30:00] (03Abandoned) 10John F. Lewis: base: move instance-upstarts to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188420 (owner: 10John F. Lewis) [19:30:09] (03PS1) 10John F. Lewis: base: move instance-upstarts to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188612 [19:30:30] andrewbogott: got there eventually :D [19:31:11] JohnLewis: I don’t want to merge right before I go to lunch but will push those out when I get back. [19:32:00] that's alright as long as they get merged before they're impossible to merge like last time ;) [19:32:00] (03CR) 10John F. Lewis: "https://gerrit.wikimedia.org/r/#/c/188603/" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [19:39:03] greg-g: wonder if https://gerrit.wikimedia.org/r/#/c/188584/ (config change) could be deployed [19:39:18] i can't stick around all day, but that is needed if we do switch test* to wmf16 [19:40:14] or if we are sure we are not today, then can do tomorrow [19:44:24] (03CR) 10GWicke: "LGTM too. How is the SSL termination & possibly single-layer Varnish going to be handled if it is all sharing an IP with Parsoid & others?" [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) (owner: 10Filippo Giunchedi) [20:06:38] (03CR) 10Dzahn: [C: 04-1] ganglia: add more ferm rules for services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/188603 (owner: 10John F. Lewis) [20:12:59] aude: hey, sorry, was out for lunch for a bit. what happens if we don't switch but that's deployed? [20:13:34] (03PS2) 10John F. Lewis: ganglia: add more ferm rules for services [puppet] - 10https://gerrit.wikimedia.org/r/188603 [20:14:01] greg-g: nothing [20:14:08] (03Abandoned) 10Dzahn: let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 (https://phabricator.wikimedia.org/T73974) (owner: 10Dzahn) [20:14:33] the setting doesn't exist yet (until wmf16) but no harm in having it htere already [20:15:11] aude: then go ahead, but we won't switch to wmf16 until next week, I think at this point we're holding [20:15:14] we'll have to add the table at some point and enable it on test* and then wikidata, but not today and don't know when [20:15:21] ok, sounds good [20:15:42] thanks [20:16:10] (03Abandoned) 10Dzahn: WIP: add port forwarding to ferm [puppet] - 10https://gerrit.wikimedia.org/r/185340 (https://phabricator.wikimedia.org/T84713) (owner: 10Dzahn) [20:16:36] (03CR) 10Aude: [C: 032] Set useLegacyChangesSubscription to true for Wikidata etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188584 (owner: 10Aude) [20:16:46] (03Merged) 10jenkins-bot: Set useLegacyChangesSubscription to true for Wikidata etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188584 (owner: 10Aude) [20:17:45] (03Abandoned) 10Dzahn: monitoring service: parameter for event_handlers [puppet] - 10https://gerrit.wikimedia.org/r/188477 (owner: 10Dzahn) [20:18:18] !log aude Synchronized wmf-config/Wikibase.php: set useLegacyChangesSubscription to true for Wikidata (duration: 00m 07s) [20:18:25] Logged the message, Master [20:19:31] hey, so we have these in role classes in a bunch of places: @monitoring::group { 'redis_eqiad': [20:19:34] also https://phabricator.wikimedia.org/T88478 fyi, in case anyone reports the issue [20:19:49] probably nothing we can do about it [20:19:52] so that role has a neutral name but hardcoded dc name in it [20:20:12] what's better: just repeat it for codfw, or @monitoring::group { "redis_${site}" [20:20:57] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:13] <_joe_> mutante: not sure [20:27:10] _joe_: i think i just changed my mind from $site to just adding codfw .. and be done [20:27:20] btw, ori is going to revert the revert (aka: put wmf15 back on Commons) [20:28:06] yeah, I was just about to say. I haven't investigated this issue any further since my initial e-mail, but there is some plausible speculation that the issue was either ephemeral or has since been resolved, and folks are willing to monitor the graphs to see if the regression recurs. [20:28:18] That works for me, so I'm reverting the revert for now. [20:29:15] (03PS1) 10Ori.livneh: Revert "[Regression] Revert "Non wikipedias to 1.25wmf15"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188617 [20:29:25] wait... is "folks" not you, ori ? [20:29:30] (03PS2) 10Ori.livneh: Revert "[Regression] Revert "Non wikipedias to 1.25wmf15"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188617 [20:29:34] (03PS3) 10Dzahn: redisdb: add codfw monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/188274 (https://phabricator.wikimedia.org/T86898) [20:29:58] greg-g: it includes me. If bd808 can keep an eye on it once he's back from lunch that WFM. [20:30:03] * greg-g nods [20:30:05] ok, whew [20:30:14] <_joe_> mutante: please let's wait to make this right [20:30:26] <_joe_> in principle using $::site seems the correct choice [20:30:40] <_joe_> but I'm not sure at all of where it will be evaluated [20:30:50] <_joe_> probably on neon, if I'm not wrong [20:30:55] greg-g: btw, the silver lining to this whole escapade is that it shows the rolling release train as working well and doing what it should [20:31:22] \o/ [20:31:25] greg-g: could you +1 the patch? [20:31:25] _joe_: yes, i know what it creates on neon in icinga config [20:32:28] (03CR) 10Greg Grossmeier: [C: 031] "Godspeed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188617 (owner: 10Ori.livneh) [20:32:36] thanks [20:32:40] added a bunch of those before, it will prevent icinga fail when the role is applied [20:32:43] (03CR) 10Ori.livneh: [C: 032] Revert "[Regression] Revert "Non wikipedias to 1.25wmf15"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188617 (owner: 10Ori.livneh) [20:32:47] (03Merged) 10jenkins-bot: Revert "[Regression] Revert "Non wikipedias to 1.25wmf15"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188617 (owner: 10Ori.livneh) [20:32:53] or it will complain about missing host group [20:33:29] <_joe_> mutante: ok, the good thing to do would be to do something slick and not copy-pasting the code, though [20:33:37] <_joe_> I'll think about it [20:33:38] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: I4fb67945b: Revert "[Regression] Revert "Non wikipedias to 1.25wmf15" [20:33:44] Logged the message, Master [20:34:04] yea, that was PS1 :p [20:34:05] 3operations: Remove old Tampa servers from racktables - https://phabricator.wikimedia.org/T87288#1015852 (10Cmjohnson) 5Open>3Resolved Moved remaining servers and switches to decom racks, removed tampa racks/rows and location from racktables. [20:34:08] greg-g: if things look good, would wmf16 still get cut today? [20:34:39] Reedy: are you able to shepherd wmf16, fingers crossed, soon? [20:34:52] have we got the go ahead? [20:35:39] not yet, but if an hour passes with no incident, it'd be fine by me [20:35:44] _joe_: ok, thanks, feel free to revert it to the earlier patch set .. and no rush to it [20:35:55] well, now we wait and watch and determine if wmf15's perf issue was transient or not [20:36:04] Reedy: if/when you do it, could you make sure that https://gerrit.wikimedia.org/r/#/c/188320/ gets cherry-picked to the new branch? [20:36:04] <_joe_> no no, I think you are doing the right thing there [20:36:34] i can also just merge that one now and later we can replace $site in ALL the places in a bunch of roles [20:36:41] either way [20:37:37] btw, here is the similar thing for mediawiki [20:37:47] https://gerrit.wikimedia.org/r/#/c/188275/ [20:39:49] ori: yeah [20:42:35] thanks [20:44:57] <_joe_> mutante: it will not work [20:45:03] <_joe_> I guess [20:45:08] <_joe_> but lemme recheck the code [20:45:33] _joe_: ok, thx [20:45:45] <_joe_> mutante: not tonight though [20:46:02] <_joe_> I have a pressing hw order to discuss with rob [20:46:06] greg-g: what is the right phab project for platform eng when platformeng is archived? [20:46:18] _joe_: no worries, yep [20:46:50] mutante: if you want MW Core team members, the MediaWiki Core Team project [20:47:06] mutante: if you want multimedia the multimedia project... and so on :) [20:47:06] greg-g: i want the ones in between operations and mw-core [20:47:15] greg-g: like the ones that write maintenance scripts [20:47:20] sounds like that's mwcore [20:47:20] :) [20:47:26] ok, thx [20:47:28] np [20:48:07] 3MediaWiki-Core-Team, operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1015891 (10Dzahn) 3NEW [20:50:56] (03PS3) 10John F. Lewis: ganglia: add more ferm rules for services [puppet] - 10https://gerrit.wikimedia.org/r/188603 [20:51:38] 3MediaWiki-Core-Team, operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1015900 (10Dzahn) [20:53:36] (03PS4) 10Dzahn: move mediawiki maintenance scripts to module [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) [20:53:44] (03CR) 10jenkins-bot: [V: 04-1] move mediawiki maintenance scripts to module [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [20:56:23] Reedy: ori bd808: I'm about to go into 2.5 hours of 1:1s [20:56:33] I deputize you three :) [20:56:35] greg-g: shame you're doing this weeks deploy [20:56:36] :P [20:57:05] greg-g: *nod* have fun managing [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150204T2100). Please do the needful. [21:04:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [21:05:05] greg-g, bd808, Reedy ^ [21:05:12] :( [21:05:12] I'm off for a doc apt sadly [21:05:14] that's been doing that all day [21:05:38] greg-g: if that's true, that's a reason to halt deployments [21:07:44] lots of -- Fatal error: request has exceeded memory limit in /srv/mediawiki/php-1.2 [21:07:45] 5wmf14/includes/db/DatabaseMysqli.php on line 183 -- in fatalmonitor [21:08:31] bd808: talking about fatals... [21:08:38] fatal.log still is very incomplete [21:09:38] hoo: yeah fatal.log is only for php5 [21:09:51] hhvm.log has the hhvm version but not backtraces [21:10:08] I know... but I really really really need backtraces, often [21:10:47] we thought at one point that we could get away without porting the php5 extension that makes fatal.log but that may be wrong [21:11:51] didn't MaxSem do some work towards that? [21:12:00] he looked at it yeah [21:12:19] there should be a phab tasks somewhere [21:13:10] well [21:13:31] you can configure hhvm to log stacktraces [21:15:32] mediaWikiLoadComplete data shows a jump but it started 1 hour before the wmf15 revert [21:15:35] https://graphite.wikimedia.org/render?from=-3hours&until=now&width=900&height=400&target=cactiStyle%28color%28alias%28frontend.navtiming.mediaWikiLoadComplete.desktop.overall.mean%2C%22mean%22%29%2C%22blue%22%29%29&target=cactiStyle%28color%28alias%28frontend.navtiming.mediaWikiLoadComplete.desktop.overall.99percentile%2C%2299%25%22%29%2C%22red%22%29%29&title=mediaWikiLoadComplete.desktop [21:16:04] the revert of the revert was ~20:33 [21:16:18] !log updated Parsoid to version dd4721f4 [21:16:19] Hmmm, database oom? [21:16:24] Logged the message, Master [21:16:30] I wonder if that corresponds to some of the spikes I saw on the flame graph [21:16:45] (rather, OOM in MW during DB actions, not that the DBs OOMd) [21:17:26] MaxSem: So why don't we do that? [21:18:02] love em philosophical questions [21:19:26] I thought there might be an actual reason why that didn't work for us [21:27:53] 3hardware-requests, ops-codfw, operations: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1015972 (10RobH) tracking HP quote on https://rt.wikimedia.org/Ticket/Display.html?id=9183 [21:28:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:30:59] JohnLewis: uhm.. gmetad ..it's actually 8653 AND 8654 :p [21:31:18] mutante: srsly >.> [21:31:37] tcp 0 0 0.0.0.0:8653 [21:32:37] (03CR) 10Dzahn: ganglia: add more ferm rules for services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188603 (owner: 10John F. Lewis) [21:34:11] (03PS4) 10John F. Lewis: ganglia: add more ferm rules for services [puppet] - 10https://gerrit.wikimedia.org/r/188603 [21:34:14] mutante: there :p [21:34:26] (03PS1) 10RobH: setting mgmt entries for rbf2002 [dns] - 10https://gerrit.wikimedia.org/r/188669 [21:34:48] ah, _xml ? [21:34:54] (03CR) 10RobH: [C: 032] setting mgmt entries for rbf2002 [dns] - 10https://gerrit.wikimedia.org/r/188669 (owner: 10RobH) [21:35:35] It's only used for XML stuff and naming stuff off somewhat good is a nice idea :P [21:36:06] yea, it is [21:38:12] (03CR) 10Dzahn: [C: 032] "yep, confirmed ports and protos with netstat on uranium. srange is like in role/logging.pp. also, this will still be noop, base::firewall " [puppet] - 10https://gerrit.wikimedia.org/r/188603 (owner: 10John F. Lewis) [21:39:34] mutante: any more code/puppet blockers for adding base::firewall or is it just a 'give it some time to see if thats all'? [21:40:29] Anyone within earshot of James Douglas? [21:40:31] JohnLewis: well, it's not so much about waiting, but checking that pastebin [21:40:52] (03PS1) 10RobH: setting rbf2002 mac in install-server module [puppet] - 10https://gerrit.wikimedia.org/r/188671 [21:41:13] JohnLewis: like, we should be able to match netstat output to roles [21:41:30] JohnLewis: .. ferm::services inside roles i should say [21:42:04] (03CR) 10RobH: [C: 032] setting rbf2002 mac in install-server module [puppet] - 10https://gerrit.wikimedia.org/r/188671 (owner: 10RobH) [21:42:30] JohnLewis: but also addin some more eyes before we switch it, yea [21:42:35] 3RESTBase, Services, Ops-Access-Requests: Shell access for @mobrovac - https://phabricator.wikimedia.org/T88465#1015987 (10Andrew) 5Open>3Resolved merged! [21:42:36] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1015989 (10Andrew) [21:42:46] mkay [21:44:00] Howdy, ops [21:50:04] earldouglas: hey, you are here for a shell, right [21:51:37] PROBLEM - puppet last run on thallium is CRITICAL: CRITICAL: Puppet has 1 failures [21:51:47] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Puppet has 1 failures [21:55:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [21:57:41] bd808: Reedy how's things? [21:57:49] got an ETA? :P [21:57:50] we're just talking in -core [21:57:54] ah... [21:58:19] greg-g: can we run the train tomorrow? [21:58:30] it's getting late for Reedy [21:58:42] but we don't want to hold until next week; too much change [21:59:03] yeah, dangit, just answered over there --> [21:59:26] answer: yes [21:59:58] bd808: Reedy so, looks like we're ok where we are for now, right? and do wmf16 tomorrow so we don't get behind [22:00:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [22:00:21] is the HTTP 5xx req/min on tungsten errors worrisome right now? [22:00:29] mutante: yep [22:00:31] (they keep coming and going) [22:00:56] greg-g: There have been a lot of HHVM OOM errors that I think correlate with the 5xx errors [22:01:03] no idea what's up there honestly [22:01:04] yuck [22:01:37] anything I should worry about/poke others about? [22:01:49] * greg-g has 10 minutes until next 1:1 starts [22:02:02] SpecialPage and DatabaseMysqli are in the hhvm output but no stack traces so hard to tell what specfically [22:02:19] :/ [22:02:32] that saddens me [22:02:33] the desktop timing graph is up again but jumped *before* the wmf15 re-revert [22:02:34] bd808, I suspect that might be bots hitting api with a list query and limit=max :P [22:02:47] MaxSem: It could be [22:03:00] if I were MW I would've OOM'd on this :P [22:03:53] (03CR) 10RobH: [C: 04-2] "There is no access-request task linked to this, thus my vote is -2 until it has one." [puppet] - 10https://gerrit.wikimedia.org/r/188605 (owner: 10Andrew Bogott) [22:04:15] * MaxSem wonders how much memory usage can be reduced by ditching manual output gzipping [22:05:50] cmjohnson1, Coren, can we catch up about NFS expansion? Is there a phab ticket for this? [22:06:00] MaxSem: If we can show that we don't need to control gzip in PHP-land I'd be all for turning that over to apache or hhvm's fcgi layer [22:06:30] the layers of code we have for it make me nervous that there was a really good reason though [22:06:54] MaxSem: We could provision you as an app server if you're MW [22:07:12] (03PS2) 10Andrew Bogott: Add mobrovac to cassandra-roots. [puppet] - 10https://gerrit.wikimedia.org/r/188605 [22:07:14] wingswednesday, I'm too lazy to be a server! [22:09:36] RECOVERY - puppet last run on thallium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:09:46] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:10:58] (03CR) 10RobH: "So my -2 still stands, since the ticket has granting sudo/root on a box to a bunch of folks, but has no notation that this request was rev" [puppet] - 10https://gerrit.wikimedia.org/r/188605 (owner: 10Andrew Bogott) [22:13:00] andrewbogott: https://phabricator.wikimedia.org/T84770 I think [22:15:28] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 2 failures [22:15:33] (03PS3) 10Andrew Bogott: Add mobrovac and jdouglas to cassandra-roots. [puppet] - 10https://gerrit.wikimedia.org/r/188605 [22:15:35] (03PS4) 10Andrew Bogott: Give James Douglas deployment rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 [22:15:57] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [22:15:57] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [22:16:05] https://phabricator.wikimedia.org/T84770 is the original one [22:16:11] andrewbogott: ^^ [22:16:17] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 2 failures [22:16:27] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Puppet has 1 failures [22:16:27] PROBLEM - puppet last run on mw1033 is CRITICAL: CRITICAL: Puppet has 1 failures [22:16:41] (03CR) 10Andrew Bogott: [C: 032] Give James Douglas deployment rights. [puppet] - 10https://gerrit.wikimedia.org/r/188600 (owner: 10Andrew Bogott) [22:17:59] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1016099 (10Andrew) [22:18:00] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1016097 (10Andrew) 5Open>3Resolved merged! [22:19:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [22:20:46] 3operations: YOUR FINAL MAIL ACCOUNT NOTICE!!! - https://phabricator.wikimedia.org/T88609#1016104 (10emailbot) [22:21:24] (03PS1) 10Andrew Bogott: Revert "Give Cassandra access to smalyshev." [puppet] - 10https://gerrit.wikimedia.org/r/188678 [22:22:31] (03PS2) 10Andrew Bogott: Revert "Give Cassandra access to smalyshev." [puppet] - 10https://gerrit.wikimedia.org/r/188678 [22:23:28] (03CR) 10Andrew Bogott: [C: 032] Revert "Give Cassandra access to smalyshev." [puppet] - 10https://gerrit.wikimedia.org/r/188678 (owner: 10Andrew Bogott) [22:27:24] (03PS4) 10Andrew Bogott: Add smalyshev, mobrovac and jdouglas to cassandra-roots. [puppet] - 10https://gerrit.wikimedia.org/r/188605 [22:27:48] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1016139 (10Andrew) James and Marco now have shell and deployment rights. All three cassandra-root privs are pending a discussion in the upcoming Ops meet... [22:32:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:32:47] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:33:07] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:33:17] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:33:17] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:33:26] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:33:47] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:34:17] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [22:35:16] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 138 seconds ago with 0 failures [22:36:11] (03PS1) 10BBlack: cp1063 backend disable for hw T84809 [puppet] - 10https://gerrit.wikimedia.org/r/188684 [22:36:52] (03CR) 10BBlack: [C: 032 V: 032] cp1063 backend disable for hw T84809 [puppet] - 10https://gerrit.wikimedia.org/r/188684 (owner: 10BBlack) [22:38:27] !log Elasticsearch wasn't initializing shards to elastic1001 after its restart. Didn't check why. Set allocation to primaries then back to all and that unstuck it. [22:38:33] Logged the message, Master [22:39:47] godog: ^^^^ [22:39:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [22:41:19] !log looks like elastics1001 doesn't have much free space left. I think that might have something to do with this.... [22:41:23] Logged the message, Master [22:44:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [22:49:09] !log not sure what happened but now space if freeing up on 1001. the disk was never in danger of filling up but it was full enough not to allocate more to it. Now that stuff is allocating elsewhere elasticsearch is clearing the used space. [22:49:14] Logged the message, Master [22:49:35] !log this is certainly a bug in Elasticsearch, but I imagine its one solved in newer versions. i hope, more like. [22:49:38] Logged the message, Master [22:49:43] godog: ^^^^^^ for your logs [22:49:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [22:50:16] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 2 failures [22:51:05] grrr. stupid puppet [22:52:57] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:54:56] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:55:16] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 99 seconds ago with 0 failures [22:57:55] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1016256 (10Jdouglas) In case it helps, I would gladly take **limited** sudo access! Of course, the complexity comes in determining the set of commands I... [22:59:27] PROBLEM - Host silver is DOWN: PING CRITICAL - Packet loss = 100% [23:01:06] RECOVERY - Host silver is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [23:04:46] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:44] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1016305 (10Tgr) [23:12:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:13:41] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1016321 (10Tgr) See also T75935 and T74328. [23:19:46] RECOVERY - Disk space on stat1002 is OK: DISK OK [23:19:50] 3operations, hardware-requests, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1016347 (10RobH) a:3RobH I'll be allocating WMF3298 (old name zinc), new hostname will be sterope. I'll put in the linked tickets for the on-site work and claim this one for the OS installation. [23:20:48] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:21:37] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:21:55] !log Manual failover of Hadoop namenode from analytics1001 to analytics1002, as analytics1001 had Heap space errors [23:21:59] Logged the message, Master [23:23:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:23:19] (03PS1) 10BBlack: temporarily depool cp1064 upload cache backend [puppet] - 10https://gerrit.wikimedia.org/r/188690 [23:23:38] (03CR) 10BBlack: [C: 032 V: 032] temporarily depool cp1064 upload cache backend [puppet] - 10https://gerrit.wikimedia.org/r/188690 (owner: 10BBlack) [23:24:27] (03PS1) 10RobH: setting mgmt dns for sterope [dns] - 10https://gerrit.wikimedia.org/r/188691 [23:25:57] (03CR) 10RobH: [C: 032] setting mgmt dns for sterope [dns] - 10https://gerrit.wikimedia.org/r/188691 (owner: 10RobH) [23:30:08] 3ops-codfw, operations: rename and setup base hardware settings for WMF3298 (zinc/sterope) - https://phabricator.wikimedia.org/T88624#1016355 (10RobH) 3NEW a:3Papaul [23:32:33] (03CR) 10Dzahn: [C: 031] "m1-master.eqiad.wmnet is an alias for db1001.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/188508 (owner: 10Springle) [23:33:07] 3ops-codfw, operations, hardware-requests, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1016363 (10Aklapper) [23:34:26] PROBLEM - Varnish HTTP upload-backend on cp1064 is CRITICAL: Connection refused [23:36:47] PROBLEM - Memcached on virt1000 is CRITICAL: Connection refused [23:37:36] RECOVERY - Varnish HTTP upload-backend on cp1064 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.006 second response time [23:37:57] andrewbogott: is that intentional or bug? memcached on virt1000 [23:38:26] (03PS2) 10Springle: use m1-master CNAME [puppet] - 10https://gerrit.wikimedia.org/r/188508 [23:38:32] 3ops-codfw, operations, hardware-requests, Wikimedia-OTRS: CODFW OTRS server - https://phabricator.wikimedia.org/T88575#1016385 (10RobH) FYI: db2010 and db2030 are part of the m1-shard (misc db cluster). So we have codfw replication of the OTRS backend already. (So for testing you could just clone the local db... [23:38:47] RECOVERY - Memcached on virt1000 is OK: TCP OK - 0.000 second response time on port 11000 [23:38:48] !log starting memcached on virt1000 [23:38:56] Logged the message, Master [23:40:37] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1016386 (10Tgr) > I'm sure this is known already but it'd seem more logical for such errors to return a 400 to the client since it isn't strictly the server's fault, what... [23:42:29] (03PS1) 10BBlack: Revert "temporarily depool cp1064 upload cache backend" [puppet] - 10https://gerrit.wikimedia.org/r/188701 [23:42:48] (03CR) 10BBlack: [C: 032 V: 032] Revert "temporarily depool cp1064 upload cache backend" [puppet] - 10https://gerrit.wikimedia.org/r/188701 (owner: 10BBlack) [23:44:32] mutante: no intentional, I will check [23:45:27] ah, and it’s back. weird. [23:46:14] andrewbogott: well, i started it [23:46:20] oh [23:46:23] well, thanks! [23:46:25] it wasn't runnin, i started it, it ran again [23:46:41] that's all i have though [23:46:56] ah, here, oom killer did it [23:47:01] Feb 4 23:35:30 virt1000 kernel: [681459.927553] Out of memory: Kill process 19241 (memcached) score 55 or sacrifice child [23:47:13] Hm, yeah, I was trying to do a dump, that probably ate too much memory [23:47:21] :( Hard to migrate away from a box if it’s too busy to dump [23:47:27] hrmm.. yea [23:47:38] to move the db away from localhost? [23:47:50] (03PS1) 10Chad: WIP: Begin converting Elasticsearch configuration to use hiera [puppet] - 10https://gerrit.wikimedia.org/r/188702 [23:47:56] well, just trying to copy it to silver at the moment [23:48:17] springle: have any time to spare? [23:48:42] Based on that phab ticket I was planning to abandon the ‘proper db host for wikitech’ issue and just copy things to silver. [23:48:52] But I could use some advice about how to do so [23:49:13] Is there any more to it than marking the wiki as read only and then mysqldump -uuser -ppassword myDatabase | mysql -hremoteserver -uremoteuser -premoteserver ? [23:49:51] andrewbogott: yes, that won't work :) i tried it, but virt1000 is too slammed to finish inside a week [23:50:04] we should xtrabackup [23:50:17] ok, that I definitely don’t know how to do. Simple? [23:50:19] just dump to file instead of also importing at the same time? [23:50:30] not sure if that is a difference [23:50:43] mutante: that's what i tried [23:50:49] If anything dumping to a file should be harder work, since it’s not limited by network IO [23:50:55] the buffer pool is too small, the load too high, the data too big [23:50:57] 3operations: deploy services on rbf2001-2002 - https://phabricator.wikimedia.org/T88309#1016424 (10RobH) 5Open>3declined a:3RobH rejecting this ticket as https://phabricator.wikimedia.org/T86898 handles the service deployment [23:51:15] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1016430 (10RobH) [23:51:16] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1016429 (10RobH) [23:51:17] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1016427 (10RobH) 5Open>3Resolved rbf2002 is now online with OS, awaiting puppet signing [23:52:36] andrewbogott: it's all here https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica ... but it you like i can start it :) there will be some waiting. this approach does not need db to be read-only [23:53:00] andrewbogott: silver is ready to go? [23:53:02] springle: If it’s not time consuming then I’m happy for you to do it. [23:53:12] springle: I think so — I messed with it a bit this afternoon. You can have a look to make sure... [23:53:24] apparmor/my.sql aren’t properly puppetized. But it’s working. [23:53:32] andrewbogott: why are we stil lusing /a ? :) [23:53:35] There’s some cruft there from my recent failed attempt to migrate. [23:53:39] springle: are we? It should be in /srv [23:53:42] (which is the same as /a) [23:53:49] (03CR) 10Ejegg: [C: 04-1] "Don't merge this, still discussing what to do" [puppet] - 10https://gerrit.wikimedia.org/r/188395 (https://phabricator.wikimedia.org/T45250) (owner: 10Ejegg) [23:54:09] springle: you should feel free to erase all the data that’s there now. [23:54:14] I was just tinkering [23:54:29] /a is mounted. /srv is empty, and not a link [23:54:35] wtf? [23:54:37] will remount [23:54:45] wait -- [23:54:47] you’re on silver? [23:54:51] I disagree, can see /srv just fine [23:54:52] yes [23:54:56] hmm [23:55:03] ls /srv [23:55:03] backup deployment mediawiki mysql [23:55:27] andrewbogott: ah indeed, not empty [23:55:37] andrewbogott: but also not a link and part of / [23:55:51] nope [23:56:00] /dev/md2 on /srv type xfs (rw) [23:56:00] /dev/md2 on /a type xfs (rw) [23:56:00] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1016439 (10RobH) Since the OS's are now installed, I've accepted the key for rbf2002 and its doing its initial puppet run. [23:56:36] andrewbogott: oh ok. i only checked df -h [23:56:51] yeah, I don’t know why df doesn’t show /srv — I was confused by that as well, an hour ago :) [23:57:11] thats a horrible and tricky way to do mounts :) [23:57:54] springle: I’m not attached to the current setup — I just know that I’m going to need a big /a and a big /srv [23:58:03] And… taking the partitions as they were given [23:58:14] andrewbogott: what stil luses /a ? just out of interest [23:58:26] (i won't change anything) [23:58:33] springle: The logic that replicates to wikitech-static does, and I was hoping to postpone rewriting it [23:58:42] ooohh [23:58:46] right fair enough [23:58:49] also the backup cron [23:58:55] * springle runs away from that