[00:00:01] greg-g: Evil. :-) [00:00:04] RoanKattouw, ^d: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T0000). [00:00:19] (03CR) 10MZMcBride: Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [00:00:26] I can't guarantee I'll be available every day at midnight :) [00:00:26] legoktm, poke [00:00:35] Krenair: o/ [00:00:39] (03CR) 10Alex Monk: [C: 032] Re-enable wgCentralAuthAutoMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188554 (owner: 10Hoo man) [00:00:46] (03Merged) 10jenkins-bot: Re-enable wgCentralAuthAutoMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188554 (owner: 10Hoo man) [00:01:43] Krenair: :) [00:02:02] James_F: it's my style of "first one's free" :) [00:02:19] * James_F laughs. [00:03:07] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/188554/ (duration: 00m 07s) [00:03:14] Logged the message, Master [00:03:17] James_F aren't you in a meeting? [00:03:31] Krenair: Yes. [00:03:34] :P [00:03:38] Meetings can have laptops. :-) [00:03:42] tgr, ping [00:03:58] Just I can't guarantee my availability if you're doing a production deployment to verify that you didn't just take the site down. :-) [00:04:49] Krenair: here [00:04:50] ok legoktm? [00:05:22] * legoktm is checking lgos [00:05:24] logs* [00:06:11] Krenair: doesn't look to be working properly...hmm.. [00:08:31] Krenair: nvm, looks good now :) [00:08:36] ok, good [00:14:20] !log krenair Synchronized php-1.25wmf16/extensions/UploadWizard/resources/mw.ApiUploadFormDataHandler.js: https://gerrit.wikimedia.org/r/#/c/189860/ (duration: 00m 05s) [00:14:21] tgr, please test [00:14:27] Logged the message, Master [00:14:41] Krenair: will take a while [00:15:11] hm, a js fix in UploadWizard [00:15:25] ok [00:15:37] only effects 100+M uploads [00:24:31] !log krenair Synchronized php-1.25wmf16/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/189867/ (duration: 00m 06s) [00:24:39] Logged the message, Master [00:29:17] Krenair: works, thanks! [00:29:47] (for the record the VE fix appears to be working as well) [00:29:51] ok, thanks tgr [00:32:38] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1029620 (10Tgr) 5Open>3Resolved a:3Tgr Backported to wmf16; this should be fixed on Commons n... [00:34:49] twentyafterfour, after the MW deployment train can you update https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap please? thanks [00:35:40] also, you standing in for Reedy on those now? [00:35:46] (yep) [00:36:05] he might be eating dinner, he's in MO (central US timezone) [00:36:12] ok [00:36:18] * greg-g goes afk towards home [00:36:56] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1029638 (10Tgr) [00:37:32] (03CR) 10Alex Monk: "Want to put this up for a SWAT deploy, Mjbmr? What about https://gerrit.wikimedia.org/r/#/c/188928/2 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188887 (https://phabricator.wikimedia.org/T60655) (owner: 10Mjbmr) [00:38:37] Krenair: yes, though looks like someone already updated it? [00:38:46] I did it :) [00:39:03] looks like FlorianSW has been updating it a lot from the history [00:40:02] but would be good if someone involved in the deployment can do it [00:42:22] Nothing like quintuple book keeping >_< [00:45:15] aka sulfurous busywork [00:45:33] * twentyafterfour adds it to the list of annoying things that I'm gonna have to automate [00:48:20] twentyafterfour: please name your automation project "ReedyBot" [00:48:57] bd808: can we call him captain reed? [00:49:10] or CptReedy [00:49:14] +1 [01:00:04] andrewbogott: Respected human, time to deploy Wikitech (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T0100). Please do the needful. [01:00:21] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1029665 (10bd808) Input from one logging customer: ``` [14:44] < bd808> hoo: how important is debug message capture loss to you as a consum... [01:01:10] jouncebot, are you off by a day? [01:02:16] hm, yep. Sorry jouncebot that was yesterday. [01:15:05] greg-g, ping. you around? me and hoo want to deploy something [01:16:17] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [01:17:18] Krenair: whatcha need to be deploying? [01:17:34] https://gerrit.wikimedia.org/r/#/c/189876/ https://gerrit.wikimedia.org/r/#/c/189877/ [01:17:48] fix global group membership change logging [01:17:56] and cross-wiki UR logging [01:18:52] andrewbogott, you have a window... [01:19:05] It is Wednesday the 11th of February, 01:19 [01:19:06] Krenair: it’s a calendar mistake, should’ve been yesterday. [01:19:27] (03PS1) 10Hoo man: role::mariadb: Remove references to undefined $shard [puppet] - 10https://gerrit.wikimedia.org/r/189878 [01:19:40] ok, not jouncebot's fault :p [01:25:14] (03PS1) 10Andrew Bogott: Allow silver to query its own server status. [puppet] - 10https://gerrit.wikimedia.org/r/189881 [01:27:18] (03CR) 10Andrew Bogott: [C: 032] Allow silver to query its own server status. [puppet] - 10https://gerrit.wikimedia.org/r/189881 (owner: 10Andrew Bogott) [01:28:13] (03PS2) 10Springle: role::mariadb: Remove references to undefined $shard [puppet] - 10https://gerrit.wikimedia.org/r/189878 (owner: 10Hoo man) [01:29:18] (03CR) 10Springle: [C: 032] role::mariadb: Remove references to undefined $shard [puppet] - 10https://gerrit.wikimedia.org/r/189878 (owner: 10Hoo man) [01:34:26] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:35:48] Krenair: greg-g is commuting home. I'm familiar with simple deployments but I'm not experienced with deploying a hot-fix ...should probably wait for greg [01:36:08] I can do the deployment fine [01:36:40] just wanting greg's +1? [01:36:50] yeah [01:36:56] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1029705 (10bd808) I took a look at GELF as a transport option and found it lacking. The Monolog GELF formatter works around GELF's lack of supp... [01:40:19] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1029707 (10chasemp) >>! In T88732#1028273, @chasemp wrote: > Thought about this a bit and actually have had this same conversation in the past.... [01:41:48] hoo: https://gerrit.wikimedia.org/r/#/c/189879/ https://gerrit.wikimedia.org/r/#/c/189888/ [01:42:19] no one else is deploying and I'd rather avoid loss of log entries... shall I? [01:42:56] I once deployed a CentralAuth fix on a Sunday evening (for a similar reason, probably even nastier) [01:43:00] Go ahead! [01:43:06] ok [01:43:34] twentyafterfour, want a walk through deployng a fix tomorrow? [01:43:53] MaxSem: yes, that would be great [01:44:22] !log puppet disabled on lanbdsb1001 labsdb1002. needs restart [01:44:29] Logged the message, Master [01:44:33] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1029711 (10bd808) As I see the debate thus far, the discussion has settled down to whether or not to employ rsyslog on the localhost as an inte... [01:47:39] (03PS1) 10Andrew Bogott: Switch to the 2.4-style 'Require host' syntax [puppet] - 10https://gerrit.wikimedia.org/r/189889 [01:52:04] (03PS2) 10Andrew Bogott: Switch to the 2.4-style 'Require host' syntax [puppet] - 10https://gerrit.wikimedia.org/r/189889 [01:52:28] !log krenair Synchronized php-1.25wmf16/extensions/CentralAuth/includes/CentralAuthGroupMembershipProxy.php: https://gerrit.wikimedia.org/r/#/c/189888/ - fix lack of global group membership change logging (duration: 00m 05s) [01:52:35] Logged the message, Master [01:52:38] hoo, can you test that? [01:52:52] I could but I'd have to add system admin into my global groups via the db, so... maybe best if you do it :p [01:53:15] (03CR) 10Andrew Bogott: [C: 032] Switch to the 2.4-style 'Require host' syntax [puppet] - 10https://gerrit.wikimedia.org/r/189889 (owner: 10Andrew Bogott) [01:53:15] Sure [01:53:16] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1029718 (10chasemp) Seems reasonable to me :). We can always adjust as you say. [01:54:41] springle, is it possible to get a list of updates made against a given table recently? do you log that stuff or...? [01:56:30] Krenair: we have the binary log for production, and nothing for labsdb [01:56:41] centralauth.global_user_groups specifically [01:56:44] !log krenair Synchronized php-1.25wmf16/includes/UserRightsProxy.php: https://gerrit.wikimedia.org/r/#/c/189879/ - same thing for interwiki user rights logs (duration: 00m 07s) [01:56:46] binary logs go back about a week [01:56:47] hoo, ^ [01:56:48] Logged the message, Master [01:56:58] Krenair: Shall I test as well? [01:57:05] yes [01:57:44] springle, that's good, we need back to about 20:28 UTC yesterday [01:58:57] Krenair: ticket? is it desperate or sometime today ok? [02:03:19] we don't have a ticket [02:03:29] should make one... [02:03:34] probably sometime today is OK [02:04:31] If you really want all binlogs of all wikis scanned... that will eat a bit of time, I guess [02:04:58] Unless we have some crazy automation of some kind to pull that all together nicely (and fast) [02:10:08] we should probably check local user rights as well :/ [02:10:55] Ok... open a ticket telling from when to when those are needed (I guess intial wmf16 deploy until the fix was synced out) and tell Sean which table(s) we're looking for [02:12:36] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:12:43] Logged the message, Master [02:13:43] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-11 02:12:40+00:00 [02:13:48] Logged the message, Master [02:16:07] how can i get a random number within puppet? [02:16:22] Call out to Ruby? [02:17:12] I will write a ticket with details of the entire issue, hoo [02:17:28] nice [02:17:31] hoo: i see this in an example /etc/puppet/modules/puppet/bin/randomnum.pl' ..hrmm [02:17:44] then they do stuff like $timeoffset1 = generate('/usr/bin/env', '/etc/puppet/modules/puppet/bin/randomnum.pl', "$fqdn", "30") [02:17:48] that's kind of what i want too [02:25:59] mh... seems rather complicated [02:26:51] !log l10nupdate Synchronized php-1.25wmf16/cache/l10n: (no message) (duration: 00m 02s) [02:26:56] Logged the message, Master [02:27:59] !log LocalisationUpdate completed (1.25wmf16) at 2015-02-11 02:26:55+00:00 [02:28:02] Logged the message, Master [02:28:21] irb(main):001:0> Random.rand(0..60) [02:28:21] => 2 [02:28:33] hoo: fqdn_rand() it seems [02:28:47] just using <%- Random.rand(0..60) -%> in a template would work [02:29:09] or that [02:29:12] https://docs.puppetlabs.com/references/latest/function.html#fqdnrand [02:29:21] except i need several numbers on one node [02:29:27] but i can if i change the seed [02:31:53] hoo, springle: https://phabricator.wikimedia.org/T89205 [02:32:50] (03PS1) 10Ori.livneh: vbench: log console messages; abort when target crashes [puppet] - 10https://gerrit.wikimedia.org/r/189891 [02:33:23] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: log console messages; abort when target crashes [puppet] - 10https://gerrit.wikimedia.org/r/189891 (owner: 10Ori.livneh) [02:33:40] Thanks :) [02:39:32] RoanKattouw: https://developers.google.com/google-apps/spreadsheets/#adding_a_list_row [02:40:08] git review and waaaaaaiting and waiting [02:40:52] (03PS1) 10Dzahn: randomize times the planet cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/189893 (https://phabricator.wikimedia.org/T89174) [02:43:51] (03PS2) 10Dzahn: randomize times the planet cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/189893 (https://phabricator.wikimedia.org/T89174) [02:47:05] (03CR) 10Dzahn: [C: 032] randomize times the planet cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/189893 (https://phabricator.wikimedia.org/T89174) (owner: 10Dzahn) [02:48:29] hoo: worked. minute changed '0' to '16' .. minute changed '0' to '4' .. etc :) [02:48:50] !log Manually logged a missing global rights log change entry on meta "Ajraddatz changed global group membership for Benoit Rochon from (none) to OTRS-member with the following comment: [[Special:Diff/11227486|request]]". See also T89205 [02:48:57] Logged the message, Master [02:49:09] Nice :) [02:50:42] 3operations: investigate etherpad service interrruptions / possible migrate service - https://phabricator.wikimedia.org/T89174#1029820 (10Dzahn) I don't have any evidence that planet updates really affected Etherpad, but it made me look at the cronjob define again and i randomized the minute they run with the pa... [03:01:36] Krenair, hoo: for centralauth.global_user_groups, are we expecting quite a small number of updates? like <10 [03:02:04] probably [03:02:18] Unless someones massively exploited it, yes [03:03:11] * someone has [03:07:46] Krenair: the ticket is public. where do you want the dumped queries put? [03:10:21] (03PS1) 10Ori.livneh: vbench: on chrome crash, try to continue rather than bailing [puppet] - 10https://gerrit.wikimedia.org/r/189895 [03:10:53] springle: Should be ok to ahve them public if they're only affecting that table [03:11:00] ok [03:11:06] or the user_groups one [03:11:39] (03CR) 10Ori.livneh: [C: 032] vbench: on chrome crash, try to continue rather than bailing [puppet] - 10https://gerrit.wikimedia.org/r/189895 (owner: 10Ori.livneh) [03:14:42] yeah, user_groups are replicated to labs entirely AFAIK [03:20:58] !log restarting labsdb1001 https://lists.wikimedia.org/pipermail/labs-l/2015-February/003354.html [03:21:06] Logged the message, Master [03:30:31] (03CR) 10Mattflaschen: "You added them because you were trying to explain why "It's also believed to not leak private data"." [puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [03:51:58] (03CR) 10MZMcBride: "I still think opt-in is unnecessary. We should trust local admins to use (or not use) tools appropriately. But this is a step in the right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) (owner: 10Florianschmidtwelzow) [04:02:41] (03PS1) 10Dzahn: fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 [04:03:28] (03CR) 10jenkins-bot: [V: 04-1] fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [04:06:59] (03CR) 10Glaisher: [C: 031] "I was hoping that the other one could be merged before this so that we won't run into annoying merge conflict but oh well." [puppet] - 10https://gerrit.wikimedia.org/r/189195 (https://phabricator.wikimedia.org/T88776) (owner: 10John F. Lewis) [04:09:15] (03CR) 10Glaisher: "Where is the community discussion for this? It might also be useful to create a task for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189661 (owner: 10Cenarium) [04:11:02] springle, oh! we get User vs. UserRightsProxy etc. for free in the binlogs? [04:11:03] nice [04:12:33] Krenair: it stores the comments verbatim [04:12:45] right, but MW puts them there in the first place [04:12:49] yep [04:12:58] Thank you to whoever wrote that. [04:13:05] :) [04:13:06] And thanks springle for going through those logs :) [04:13:17] yw [04:14:11] springle, do you have the timestamp for that zhwiki entry? [04:14:59] and that earliest centralauth entry as well [04:16:01] (03PS2) 10Dzahn: fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 [04:16:33] Krenair: that was hard to include in the grep, but let me find those specific ones... [04:21:58] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1029890 (10BBlack) I pulled that log line from looking directly at /var/log/nginx/wikimedia.org.err... [04:22:31] Krenair: first centralauth entry 2015-02-10 23:48:15, zhwiki entry 2015-02-11 01:45:04 [04:22:39] great, thanks [04:27:42] Krenair: made a mistake trying to speed up S3. posted the updated results there, 3 hits [04:28:13] right. can we get the timestamps of those UserRightsProxy ones? [04:32:24] springle, actually, those testwiki entries came after we deployed the fix [04:32:33] so they were logged fine [04:32:44] just the anwiktionary one needed [04:33:44] argh, no, sorry, that came after as well. I just looked springle [04:33:50] Krenair: ah well posted them all [04:33:52] :) [04:33:58] Sorry :/ [04:34:06] np [04:35:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Feb 11 04:34:30 UTC 2015 (duration 34m 29s) [04:35:40] Logged the message, Master [04:44:42] (03CR) 10Dzahn: "chris, you can safely abandon this. IPv6 has been added to dumps in a different patch (and used the other interface eth2)" [puppet] - 10https://gerrit.wikimedia.org/r/183061 (owner: 10Cmjohnson) [04:54:19] !log restarting labsdb1002 https://lists.wikimedia.org/pipermail/labs-l/2015-February/003354.html [04:54:22] Logged the message, Master [04:55:53] PROBLEM - puppet last run on labsdb1002 is CRITICAL: CRITICAL: Puppet last ran 5 hours ago [04:57:02] RECOVERY - puppet last run on labsdb1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:25:44] (03PS1) 10Springle: incorrect private variable reference, mysql_repl_pass [puppet] - 10https://gerrit.wikimedia.org/r/189906 [05:26:49] (03CR) 10Springle: [C: 032] incorrect private variable reference, mysql_repl_pass [puppet] - 10https://gerrit.wikimedia.org/r/189906 (owner: 10Springle) [05:34:04] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [05:36:21] (03CR) 10Cenarium: "The community discussion was provided in the bug request for the previous commit, T59073. It's at https://en.wikipedia.org/wiki/Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189661 (owner: 10Cenarium) [05:40:26] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 1 failures [05:51:04] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:57:05] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:09:31] (03PS1) 10Springle: clean up basic monitoring checks for all mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/189907 [06:12:08] (03CR) 10Springle: [C: 032] clean up basic monitoring checks for all mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/189907 (owner: 10Springle) [06:28:25] (03PS1) 10Springle: repool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189908 [06:28:43] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:45] (03CR) 10Springle: [C: 032] repool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189908 (owner: 10Springle) [06:28:54] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:24] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] !log springle Synchronized wmf-config/db-eqiad.php: repool db1057 (duration: 00m 05s) [06:29:53] Logged the message, Master [06:29:55] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:04] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] 500s from strontium. but it seems fine now [06:33:04] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:31] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:01] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:22] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:56:49] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1030066 (10Krassotkin) In my opinion, progress bar still work incorrectly. It displays the download... [07:02:28] (03CR) 10Vogone: "I do not agree, it is very easy to break things with the translate extension, even unintentionally and when doing it in good faith, and fi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) (owner: 10Florianschmidtwelzow) [07:19:28] 3ContentTranslation-cxserver, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Provide proxy details to use for Yandex - https://phabricator.wikimedia.org/T89117#1027739 (10Arrbee) [07:21:07] 3ContentTranslation-cxserver, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Separate config for Beta and Production for CXServer - https://phabricator.wikimedia.org/T88793#1030079 (10Arrbee) [08:05:19] (03CR) 10Gerardduenas: "@Matanya there are three active editors in the wiki. The active administrator is one of them. The case is the administrator doesn't know h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187915 (https://phabricator.wikimedia.org/T85713) (owner: 10Glaisher) [08:09:16] 3operations: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1030112 (10Joe) a:5GLavagetto>3Joe [08:33:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [08:36:54] that's a planned maint [08:48:42] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1030142 (10Tgr) >>! In T89018#1030066, @Krassotkin wrote: > In my opinion, progress bar still work... [08:53:41] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030149 (10Nemo_bis) Please don't add this discussion to #HTTPS-by-default. I watch that project to read about HTTPS by default, not about Phabricator processes, and there is no ignore flag in Pha... [08:58:54] (03CR) 10Matanya: [C: 031] "on little nitpick." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [08:59:59] (03CR) 10Matanya: "Ok, i see the logic, but i feel uncomfortable with it for some reason." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187915 (https://phabricator.wikimedia.org/T85713) (owner: 10Glaisher) [09:03:01] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1030165 (10Joe) Some caveats: - Whenever moving a server, we need to change the IP in a few places: # puppet/hieradata/eqiad.yml (adding a label like "shard_N"... [09:08:32] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1030171 (10faidon) What I understood from IRC yesterday is: * mc1001-mc1006 will stay in the existing rack (A5) * mc1007-mc1012 will move to C8 * mc1013-mc1018 will... [09:11:07] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1030174 (10akosiaris) @BBlack, I have a different theory regarding the nginx 1.1.x vs 1.6.x behavio... [09:11:23] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1030175 (10Joe) @faidon yes I meant that, I'd just like to reduce the number of potential issues while moving the servers. [09:19:15] good "I can't remember what time it is" [09:23:08] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1030182 (10Gilles) @akosiaris that sounds consistent with what we've been experiencing when reprodu... [09:23:09] (03PS1) 10KartikMistry: WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 [09:23:26] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030183 (10KartikMistry) 3NEW a:3happy5214 [09:23:44] (03PS1) 10QChris: Revert "Temporarily keep 40 instead of 31 days of webrequest data" [puppet] - 10https://gerrit.wikimedia.org/r/189916 [09:23:48] Feel free to take this up, Ops! :) [09:24:24] (03CR) 10jenkins-bot: [V: 04-1] WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 (owner: 10KartikMistry) [09:24:25] kart_: ? this ? being T89222 ? [09:24:32] yes. [09:24:38] hello akosiaris :) [09:24:43] hey [09:24:51] akosiaris: hope that I've done right. [09:25:13] jenkins is not approving though [09:25:27] I 'll triage it [09:25:59] who is Alexander Jones ? [09:26:07] kart_: you assigned the ticket to him [09:26:16] blah [09:26:22] I assume you wanted to assign it to me ? [09:26:39] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030200 (10KartikMistry) a:5happy5214>3akosiaris [09:26:49] He must be happy now ;) [09:27:01] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030202 (10akosiaris) p:5Triage>3Normal [09:27:33] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Restore asw2-a5-eqiad redundant power - https://phabricator.wikimedia.org/T88792#1030203 (10faidon) p:5Unbreak!>3High [09:29:46] 3operations, ops-eqiad: cr1-eqiad power supply fan failure - https://phabricator.wikimedia.org/T89224#1030206 (10faidon) 3NEW [09:30:48] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030225 (10Qgil) [09:30:49] 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1030224 (10faidon) [09:31:03] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#987167 (10Qgil) [09:31:15] 3operations, ops-eqiad: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1030227 (10faidon) p:5Triage>3Normal [09:31:25] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR alexandros kosiaris known, tracked in RT #9194 [09:31:36] 3operations, ops-eqiad: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1030240 (10faidon) p:5Triage>3Normal [09:32:09] (03PS2) 10KartikMistry: WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 [09:32:40] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1030243 (10faidon) What is left to be done here? [09:34:30] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1030248 (10faidon) [09:35:32] 3operations, ops-eqiad: mc1016 mgmt not working - https://phabricator.wikimedia.org/T82259#1030251 (10faidon) p:5Normal>3High Bumping priority because of being a dependency in the split memcache task, which is prio: high. [09:35:35] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#915542 (10faidon) [09:35:37] 3operations, ops-eqiad: mc1016 mgmt not working - https://phabricator.wikimedia.org/T82259#1030255 (10faidon) 5stalled>3Open [09:38:15] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030264 (10Qgil) >>! In T87262#1028378, @Krenair wrote: > This broke the policy at https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects#New_projects that all project creations... [09:41:03] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1030272 (10faidon) [09:44:01] 3operations: Puppet broken on silver.wikimedia.org - https://phabricator.wikimedia.org/T88513#1030277 (10akosiaris) 5Open>3Resolved Since puppet is now running I am resolving this. The DB move should be tracked in a different task [09:45:15] <_joe_> akosiaris: someone solved that and didn't see the ticket I'd say [09:45:53] _joe_: probably [09:46:20] _joe_: so... [09:46:21] https://phabricator.wikimedia.org/T84819 [09:46:45] root@mw1041:~# dmesg |grep -c CMCI [09:46:45] 300 [09:47:18] so is it broken or is it not [09:48:17] <_joe_> well, hhvm seems to be working well, but lemme take a better look [09:49:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [09:49:40] 3hardware-requests, ops-codfw, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1030294 (10faidon) p:5Normal>3High [09:50:39] 3operations, ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1030297 (10faidon) p:5Normal>3High [09:50:51] 3operations, ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#976931 (10faidon) So, what's left to be done here? [09:51:17] <_joe_> paravoid: I expressedly asked to have the memcached hosts and the redis hosts as soon as possible [09:51:51] 3operations, ops-codfw: Set up pdu's - https://phabricator.wikimedia.org/T84416#1030303 (10faidon) a:5Cmjohnson>3Papaul @Papaul, what's left here? [09:51:53] <_joe_> and racking the remaining appservers can wait after that [09:52:12] _joe_: last update on the ticket says "all appservers are racked" [09:52:26] <_joe_> oh, ok [09:52:40] so if there's something silly like mgmt remaining, it needs to be prioritized as high [09:52:52] after rbd/rbf (rbf was prio: normal btw) [09:53:01] rdb even [09:53:17] https://phabricator.wikimedia.org/maniphest/?statuses=open%2Cstalled&allProjects=PHID-PROJ-heihjeaiasruuvneirzh#R [09:53:22] you can drag-n-drop here [09:53:30] even within the same priority, apparently :) [09:55:22] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 620 [09:58:17] !log restarting Jenkins to upgrade the Credentials plugin [09:58:22] Logged the message, Master [09:59:48] 3operations: Set up cr1-eqord & cr1-eqdfw - https://phabricator.wikimedia.org/T89227#1030334 (10faidon) 3NEW [10:00:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 140642 Threads: 2 Questions: 2413040 Slow queries: 1007 Opens: 2859 Flush tables: 2 Open tables: 64 Queries per second avg: 17.157 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:00:52] 3operations, ops-codfw: Please rack & connect the Tampa MX80s in row D - https://phabricator.wikimedia.org/T84658#1030343 (10faidon) p:5Normal>3Low [10:01:04] 3operations, ops-codfw: Please rack & connect the Tampa MX80s in row D - https://phabricator.wikimedia.org/T84658#929941 (10faidon) [10:01:05] 3operations: Set up cr1-eqord & cr1-eqdfw - https://phabricator.wikimedia.org/T89227#1030347 (10faidon) [10:01:12] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [10:02:02] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [10:02:17] 3operations, ops-codfw: Please rack & connect the Tampa MX80s in row D - https://phabricator.wikimedia.org/T84658#929941 (10faidon) >>! In T84658#929975, @mark wrote: > Oh yeah, besides serial management, connect management ethernet too (to the rack management switch) Is this done too? Also, has this: > We'll... [10:04:01] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 138, initializing_shards: 0, number_of_data_nodes: 3 [10:05:16] 3operations, ops-codfw: Where to put the netapp (nas1) in codfw - https://phabricator.wikimedia.org/T84796#1030353 (10faidon) 5Open>3Resolved [10:06:20] 3operations, ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1030362 (10faidon) [10:06:23] 3operations, ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#976946 (10faidon) What's left here? [10:09:41] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 47 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 46, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 1, number_of_data_nodes: 2 [10:09:56] 3operations, ops-codfw: codw pfw* serial connections problem - https://phabricator.wikimedia.org/T84737#1030366 (10faidon) As far as I remember from back then, this was debugged in the end as faulty pins on both of the serial ports of the SRXes. Multiple reports on the web confirmed that this was a hardware desi... [10:10:22] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 47 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 46, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 1, number_of_data_nodes: 2 [10:10:29] what's going on with logstash? [10:10:32] anyone looking? [10:10:37] godog maybe? [10:14:21] nope, taking a look [10:14:47] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1030391 (10faidon) [10:16:02] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 138, initializing_shards: 0, number_of_data_nodes: 2 [10:16:02] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 138, initializing_shards: 0, number_of_data_nodes: 2 [10:16:35] akosiaris: yeah clinic doctor. zeljkof is missing in the WMF-NDA Phabricator group. Any specfic procedure to follow to have him added in please ? [10:16:41] mhh these two recovered by themselves [10:16:56] 3operations: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1030401 (10faidon) [10:18:17] hashar: yeah, he should answer on this https://phabricator.wikimedia.org/T87597 [10:18:36] missing shell user name, RSA/DSA keys etc [10:18:49] I see he signed the https://phabricator.wikimedia.org/L3 which is nice [10:18:51] !log restart elasticsearch on logstash1003, OOM [10:19:00] Logged the message, Master [10:19:00] but the rest is still missing [10:19:22] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 44 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 42, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 94, initializing_shards: 2, number_of_data_nodes: 3 [10:19:22] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 43 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 41, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 95, initializing_shards: 2, number_of_data_nodes: 3 [10:21:15] 3operations, ops-eqiad: Replace asw-c5-eqiad or asw-c8-eqiad with EX4550 - https://phabricator.wikimedia.org/T82509#1030421 (10faidon) [10:21:17] 3operations, ops-eqiad: Replace asw-c5-eqiad or asw-c8-eqiad with EX4550 - https://phabricator.wikimedia.org/T82509#901912 (10faidon) [10:21:21] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 135, initializing_shards: 2, number_of_data_nodes: 3 [10:21:32] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 135, initializing_shards: 2, number_of_data_nodes: 3 [10:21:32] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 135, initializing_shards: 2, number_of_data_nodes: 3 [10:24:29] 3operations: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1030431 (10hashar) Random possibility: logstash has a plugin to act as a SNMP trap receiver http://logstash.net/docs/1.4.2/inputs/snmptrap For BGP peelings there must be a Nagios plugin handling it. IIRC there is a standard MIB that list... [10:26:22] hashar: we know how to do it, it's just that we haven't had the time too (although BGP peerings *are* monitored nowadays) :) [10:26:40] that's how it works usually, these are mostly "we should do that" more than "wonder how to do that" [10:27:19] 3operations, ops-ulsfo: fan reversed on asw1-ulsfo - https://phabricator.wikimedia.org/T83978#1030437 (10faidon) >>! In T83978#1004360, @Gage wrote: > Erroneous part is in SF office awaiting return shipment. Has this been done yet? Is there a procurement ticket tracking this? [10:30:53] paravoid: hey! Yeah sorry for starting the obvious (use snmp for monitoring). I forgot you used to work in a networking organization and must know about it already :D [10:31:02] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030453 (10faidon) Logs shouldn't be local nor should be inspected on the machines themselves. These should be sent to fluorine, to which you should have access already. [10:31:09] maybe openNMS has build in support for ton of Juniper mibs already [10:31:40] akosiaris: thx :) [10:31:52] zeljkof: seems you could use your shell access to be granted which is https://phabricator.wikimedia.org/T87597 [10:32:06] zeljkof: I guess we can then get you added to the Phabricator WMF-NDA group [10:32:36] hashar: ok, will work on shell access today [10:32:46] akosiaris: should we fill a Task to get zeljkof added to the wmf-nda group? [10:33:42] hashar: yes, please do [10:33:54] and I 'll do the actual adding [10:37:46] zeljkof: your turn. Just create a task to get you added to WMF-NDA group :) Should be filled against project #Ops-Access-Requests [10:39:25] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030470 (10faidon) >>! In T87262#1030264, @Qgil wrote: > Theoretically it is possible to have private projects, but there should be a reason for that. In this case, no reason has been presented so... [10:41:25] 3Ops-Access-Requests: Please add me to WMF-NDA group - https://phabricator.wikimedia.org/T89230#1030476 (10zeljkofilipin) 3NEW [10:41:42] akosiaris: ^ [10:47:08] 3Ops-Access-Requests: Please add me to WMF-NDA group - https://phabricator.wikimedia.org/T89230#1030490 (10akosiaris) 5Open>3Resolved p:5Triage>3Normal a:3akosiaris [10:47:25] zeljkof: ^ [10:47:38] akosiaris: thanks :) [10:48:03] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030497 (10Krenair) >>! In T87262#1030470, @faidon wrote: >>>! In T87262#1030264, @Qgil wrote: >> Theoretically it is possible to have private projects, but there should be a reason for that. In t... [10:48:26] paravoid: blah. I was checking cxserver logs on sca1001. [10:48:29] akosiaris: ^ [10:51:42] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030501 (10KartikMistry) Sure. apertium-admin request still stands like cxserver-admin. Alex, can you give me insight on how we can move logs to fluorine? [10:55:36] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1030507 (10hashar) Related, @zeljkofilipin has been added to the WMF-NDA Phabricator group ( T89230 ). [11:02:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [11:04:49] 3ContentTranslation-cxserver, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Provide proxy details to use for Yandex - https://phabricator.wikimedia.org/T89117#1030525 (10akosiaris) 5Open>3Resolved Hello, we will be using url-downloader.wikimedia.org, TCP port 8080 in production. In... [11:08:41] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030535 (10Qgil) [11:16:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:17:31] 3operations, Project-Creators, Phabricator: Create projects for Ops goals - https://phabricator.wikimedia.org/T87262#1030544 (10faidon) [11:18:50] 3operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030548 (10faidon) This isn't something that should be solved on a case-by-case basis and certainly shouldn't be solved after the fact with ops involvement. We have two pieces of infrastructure right now: flu... [11:45:20] (03PS1) 10QChris: Mark udp2log jobs that are duplicated already on Hive [puppet] - 10https://gerrit.wikimedia.org/r/189926 [12:05:21] (03PS14) 10KartikMistry: WIP: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [12:20:50] (03CR) 10QChris: Correcting docs and thresholds for eventlogging alarms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/189588 (owner: 10Nuria) [12:46:33] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:52] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.525 second response time [12:54:56] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1030703 (10Gilles) Should we also change the 500 happening when people request a larger (or equal) size than the original to a 400? [12:55:12] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1030705 (10Gilles) p:5Triage>3Normal [12:55:32] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 27 failures [13:07:21] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1030766 (10Gilles) [13:31:13] 3operations, Datasets-General-or-Unknown: Dumps (or dump progress page) stuck since 28 Jan - https://phabricator.wikimedia.org/T88209#1030833 (10ezachte) Please see https://phabricator.wikimedia.org/T85970 for stats on wikidata dumps. Nutshell: run time almost doubled in 6 months June 2014: 195 hrs November 20... [13:33:22] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:41:12] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [13:46:22] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 20 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 18, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 118, initializing_shards: 2, number_of_data_nodes: 3 [13:46:32] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 20 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 18, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 118, initializing_shards: 2, number_of_data_nodes: 3 [13:46:41] godog: ^ ? [13:47:32] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:47:41] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:47:41] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:47:57] hands-off management :P recovery without action! [14:14:59] (03PS1) 10Rush: phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 [14:15:52] (03CR) 10jenkins-bot: [V: 04-1] phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 (owner: 10Rush) [14:17:02] (03PS2) 10Rush: phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 [14:19:17] (03PS3) 10Rush: phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 [14:19:41] (03CR) 10Rush: [C: 032] phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 (owner: 10Rush) [14:20:15] (03CR) 10Rush: [V: 032] phab link phd to /etc/init.d for service management [puppet] - 10https://gerrit.wikimedia.org/r/189951 (owner: 10Rush) [14:26:53] 3operations, Wikimedia-Git-or-Gerrit: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1031006 (10Dereckson) >>! In T37611#1025228, @Dzahn wrote: > unless we still add the iptables rule on ytterbium itself (that's different from my abandoned patch above which expected we'd have t... [14:27:33] (03PS1) 10Rush: phab local.json should trigger a phd restart [puppet] - 10https://gerrit.wikimedia.org/r/189953 [14:29:03] (03CR) 10Rush: [C: 032] phab local.json should trigger a phd restart [puppet] - 10https://gerrit.wikimedia.org/r/189953 (owner: 10Rush) [14:33:42] chasemp == Rush right? [14:33:47] yes [14:34:00] too many different names/nicks etc :] [14:34:05] maybe even rush === chasemp [14:34:10] yes :) [14:34:15] I was not wise at the time [14:35:23] well Gerrit considers my real name is "Hashar" [14:35:39] have to write down all that Rush is Chase [14:36:16] ottomata: hey [14:36:29] it's all verbs [14:36:29] ottomata: welcome back :) [14:37:55] hiya! [14:37:57] thanks! [14:38:04] how's it goiiiin? (so many emails!) [14:43:07] bonjour andrew! [14:46:10] (03PS1) 10Rush: phab fail back to mysql search for now [puppet] - 10https://gerrit.wikimedia.org/r/189959 [14:50:53] good mornin! [14:53:19] hei hei ottomata welcome back [14:56:09] heyoooo [15:00:04] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T1500). [15:00:13] (03CR) 10Rush: [C: 032] phab fail back to mysql search for now [puppet] - 10https://gerrit.wikimedia.org/r/189959 (owner: 10Rush) [15:02:28] "fail back" [15:02:51] it's the more fun version of retreat [15:03:01] :D [15:18:15] anomie: I'll do the SWAT today - I'll build the submodule updates for it [15:18:25] manybubbles: Ok! [15:19:26] manybubbles: Note it'll have to be a full scap, since it's i18n changes. :/ [15:19:43] k. I haven't scap-ed in months. I assume it'll be ok [15:20:49] manybubbles: 30-40 minutes, judging by recent SAL entries. [15:20:59] <^d> 17m yesterday [15:21:18] That was no changes though. [15:21:35] <^d> There were i18n picked up, had to rebuild :) [15:21:43] 34m for the one before that "syncing ZeroBanner i18n" [15:25:23] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago [15:26:03] 3operations, Analytics-Cluster: Increase and monitor Hadoop NameNode heapsize - https://phabricator.wikimedia.org/T89245#1031236 (10Ottomata) [15:26:22] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:32:05] 3operations, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1031238 (10Ottomata) [15:33:02] 3operations, Wikimedia-Git-or-Gerrit: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1031239 (10faidon) p:5Normal>3Low The idea was to use the REDIRECT target. However, this won't work for IPv6 unless we upgrade to a more recent kernel (3.7+ I believe). Honestly... I'm not... [15:33:42] PROBLEM - Disk space on cp1064 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%): [15:45:13] 3operations, ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1031266 (10Papaul) what left: update racktable wiried mgt & data network wired power setup mgt settings [15:45:44] 3operations, ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1031267 (10Papaul) what left: update racktable wiried mgt & data network wired power setup mgt settings [15:46:42] 3operations, ops-codfw: codw pfw* serial connections problem - https://phabricator.wikimedia.org/T84737#1031268 (10Papaul) ok will coordinate with Rob or Chris to do that [15:46:51] RECOVERY - Disk space on cp1064 is OK: DISK OK [15:49:30] 3operations, ops-codfw: Set up pdu's - https://phabricator.wikimedia.org/T84416#1031285 (10Papaul) I supposed to set up root login information; that was done i don't know if Chris has anything to do on this task. if not i will close the task. @chris [15:50:42] chasemp: how is phab update going? I was wondering if I could start my SWAT window a few minutes early [15:50:56] go for it [15:51:01] thanks [15:51:06] I'm just swapping out search backend and reindexing etc [15:51:13] been tracking the performance with springle a bit [15:51:16] but we should be good [15:51:18] cool [15:51:34] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1031300 (10hashar) From our 1/1 make sure to use a dedicated key to connect to the wikimedia production cluster and configure your ssh client to only use that one: ``` Host *.wikimedia.org *.wmnet Id... [15:51:36] I've just +2ed my submodule updates. I'll wait for jenkins to verify them which will probably take 10 minutes anyway [15:52:22] (03PS3) 10Nuria: Correcting docs and thresholds for eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/189588 [15:53:05] (03CR) 10Nuria: Correcting docs and thresholds for eventlogging alarms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/189588 (owner: 10Nuria) [15:53:10] (03PS2) 10Ottomata: Revert "Temporarily keep 40 instead of 31 days of webrequest data" [puppet] - 10https://gerrit.wikimedia.org/r/189916 (owner: 10QChris) [15:53:39] 3operations, Wikimedia-Git-or-Gerrit: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1031305 (10fgiunchedi) FWIW I think the port is only part of the story, what's useful is tooling that will DTRT and setup other things like the commit hook. I've been using such a script for so... [15:55:26] manybubbles: myself and ^d eventually understood what was wrong with restarting ES! instructions on wikitech weren't correct, I'm going to bounce elastic1003 now with es-tool fast-restart (instructions fixed tho) [15:55:27] (03CR) 10Ottomata: [C: 032 V: 032] Revert "Temporarily keep 40 instead of 31 days of webrequest data" [puppet] - 10https://gerrit.wikimedia.org/r/189916 (owner: 10QChris) [15:55:52] godog: sorry! thanks [15:56:18] (03PS2) 10Giuseppe Lavagetto: base: add the service_unit init wrapper [puppet] - 10https://gerrit.wikimedia.org/r/189753 [15:56:36] manybubbles: not to worry! [15:57:42] anomie: I'll just scap both the i18n changes at the same time [15:58:33] !log restart elasticsearch on elastic1003 [15:58:57] morebots: ? [15:58:57] I am a logbot running on tools-exec-11. [15:58:57] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:58:57] To log a message, type !log . [15:59:36] !log restart elasticsearch on elastic1003 [15:59:56] Y U NO LOG [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T1600). [16:00:22] manybubbles: Good idea [16:01:31] <^d> manybubbles: My fault. The updated instructions basically needed you to wait for green before enabling non-primary replication. That condition never happens. [16:01:48] ^d: ah, yeah, that doesn't really happen :) [16:01:52] <^d> Also explains why disk started filling up :) [16:02:06] I believe the instructions on elasticsearch's site were similarly busted for a while [16:02:11] <^d> (had old replicas it couldn't let go of, took hold of new primaries) [16:02:14] someone fixed them and I reviewed the pull request and +1ed it [16:02:19] yeah [16:02:23] ^d: I've stuck a lame curl in there, should be enough [16:02:37] <^d> That's what they used to do so no big deal :) [16:02:44] <^d> es-tool was just meant to make it easier hehe [16:03:09] indeed, I should have used that in the first place [16:03:38] <^d> manybubbles: So the only bug, and I can't even really call it one, is that "restarting a node with non-primary replication disabled results in disk slowly filling up" [16:03:53] <^d> Because it's sorta expected behavior from the allocation [16:04:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [16:04:39] !log manybubbles Started scap: SWAT i18n changes for ZeroBanner [16:04:59] ^d: thats bad-ish [16:05:01] (03PS2) 10Ottomata: Mark udp2log jobs that are duplicated already on Hive [puppet] - 10https://gerrit.wikimedia.org/r/189926 (owner: 10QChris) [16:05:54] (03CR) 10Ottomata: [C: 032] Fix bad symlinks for kafka-common [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 (owner: 10Mattrobenolt) [16:05:59] (03CR) 10Ottomata: [V: 032] Fix bad symlinks for kafka-common [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 (owner: 10Mattrobenolt) [16:06:08] 5xx is me, it was a short spike [16:06:48] (03CR) 10Ottomata: [C: 032] Mark udp2log jobs that are duplicated already on Hive [puppet] - 10https://gerrit.wikimedia.org/r/189926 (owner: 10QChris) [16:07:27] <^d> manybubbles: Bad enough to throw a bug over the wall? Not bad enough to bother fixing it. Easy enough to avoid: "don't do that" [16:07:36] too late to sneak another patch into swat? [16:11:22] manybubbles: scap cries when deployment is done, right? [16:11:36] kart_: it should [16:11:41] ebernhardson: eh - probably not [16:11:42] It better [16:11:48] its not done yet :) [16:12:06] * bd808 guesses 19 more minutes [16:12:15] maybe only 11 [16:12:23] * manybubbles grin [16:13:01] manybubbles: ok i'll add them, they are already prepped submodule bumps against core. [16:13:25] dbbot-wm: preped submodule updates against core are pretty much perfect [16:14:31] A new branch scap has been clocking in around 30 minutes. The "no-op" scap Chad did yesterday was something like 17 minutes. When we figure out how to do something different for l10n caches it will be much faster. [16:15:02] If we switched to a diff mechanism that didn't recreate the diffs per-host it would be faster still [16:15:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:16:06] how are the logs doing this fine morning? [16:16:16] :) [16:16:44] There's a patch for core. I should make sure ^d adn twentyafterfour are cc'd on it [16:17:10] manybubbles: added to schedule. https://gerrit.wikimedia.org/r/189983 and https://gerrit.wikimedia.org/r/189984 [16:17:13] thanks [16:17:32] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [16:17:32] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [16:17:49] stupid sad logstash cluster [16:18:07] getting hugged to death? [16:18:07] kart_: I've never done the cxserver - that one might take me a bit of reading before I can do the deploy there [16:18:20] ebernhardson: just eating itself to death [16:18:44] ebernhardson: it needs moar rams! [16:19:08] Lots of JVM OOMs and nasty GC pauses [16:19:12] ebernhardson: your patches are funny - I just synced out some update/revert for the same thing. in wmf15 I believe. It looked like a noop so I synced it [16:19:17] at least thats easy to solve :) untill the heap gets too big [16:19:33] ebernhardson: 30GB is teh max [16:19:36] manybubbles: heh, krenair was doing swat deploy i should probably have double checked [16:19:38] and that is pretty huge [16:19:42] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [16:19:46] greg-g: Here's my proposal for getting mediawiki logs back into logstash -- https://phabricator.wikimedia.org/T88732#1029711 [16:20:17] sync-common @5% [16:21:07] manybubbles: I can do that. [16:21:25] manybubbles: just waiting for SWAT to finish :) [16:21:43] manybubbles: let me know when to go ahead. [16:21:45] kart_: oh, I meant to add another separate window, not in SWAT :) [16:21:45] kart_: k. got it. scap is scapping now and I have two other patches to push next. should be another 15 or 20 [16:21:51] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 138, initializing_shards: 0, number_of_data_nodes: 2 [16:21:51] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 138, initializing_shards: 0, number_of_data_nodes: 2 [16:21:55] if you can do it during scap its probably ok too [16:22:03] like, if that is known safe [16:22:15] greg-g: facepalm. [16:22:47] oh [16:22:59] manybubbles, are you swating? [16:22:59] <^d> greg-g: Lots of yelling from CentralNotice actually [16:23:01] kart_: :) [16:23:04] yurikR: yeah yeah [16:23:05] ^d: :( [16:23:11] <^d> filing a bug [16:23:50] ^d: ? I'm looking at logs - maybe grepping it out or something [16:24:06] <^d> `fatalmonitor` on fluorine [16:24:52] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 49 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 47, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 89, initializing_shards: 2, number_of_data_nodes: 2 [16:24:53] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 49 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 47, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 89, initializing_shards: 2, number_of_data_nodes: 2 [16:24:54] <^d> !phab T89258 [16:25:00] <^d> Meh, filed that [16:25:03] :) [16:25:31] <^d> And I'm going to fix the TorBlock one [16:25:34] <^d> That's getting old [16:25:38] ^d: manybubbles: that inactive shard error occurred a few hours ago but self fixed magically [16:25:56] hashar: that is logstash being unhappy [16:26:07] its cluster is pretty sad [16:26:16] cheer it up! [16:26:21] * bd808 takes a look [16:27:11] <^d> bd808 fixed this on Jan 16? [16:27:17] <^d> Why the hell is that still showing up then? [16:27:40] Which one? TorBlock? [16:27:51] <^d> yeah [16:27:53] * bd808 made a lot of index patches [16:28:03] maybe I missed something? [16:28:04] those are part of wmf16 [16:28:12] so the torblock stuff will go away today ;) [16:28:44] <^d> Ahh ok [16:28:57] !log Elasticsearch dead on logstash1002; restarting [16:30:07] This dance is beyond old [16:30:32] someday we will win the procurement lottery for the logstash cluster [16:30:47] (03PS1) 10Giuseppe Lavagetto: dhcpd: mc1018 and mc1017 are ubuntu precise [puppet] - 10https://gerrit.wikimedia.org/r/189985 [16:30:58] elastic1003 bounced \o/ [16:31:00] <^d> bd808: What's the hold up right now? Do we have specs? [16:31:26] ^d: yeah. just waiting to some to the top of the queue for getting a quote I think [16:31:33] *come to the top [16:32:01] (03PS2) 10Giuseppe Lavagetto: dhcpd: mc1018 and mc1017 are ubuntu precise [puppet] - 10https://gerrit.wikimedia.org/r/189985 [16:32:02] <_joe_> please note that we're building the new DC now [16:32:15] <_joe_> so there is quite a few things in procurement right now [16:32:19] It is espeically sad that we can't event deal with the non-MediaWiki traffic now [16:32:19] <_joe_> and restbase, and... [16:32:39] _joe_: *nod* I know it's a busy time [16:32:46] just venting ;) [16:32:52] <_joe_> bd808: I'd say this is an opportunity to chase people with a machete so that they stop logging useless shit [16:33:04] (03PS1) 10Rush: phab update labs instance(s) search to mysql [puppet] - 10https://gerrit.wikimedia.org/r/189990 [16:33:06] (03PS1) 10Rush: phab better header docs for manifests files [puppet] - 10https://gerrit.wikimedia.org/r/189991 [16:33:08] <^d> ...what about stealing just 1 ES box? [16:33:17] <^d> We could live, and it's the exact specs you need. [16:33:46] bandaids will get wet and fall off. we need sutures [16:34:01] or at least super glue [16:34:14] (03CR) 10Giuseppe Lavagetto: [C: 032] dhcpd: mc1018 and mc1017 are ubuntu precise [puppet] - 10https://gerrit.wikimedia.org/r/189985 (owner: 10Giuseppe Lavagetto) [16:34:20] <^d> But we bleed out until the ambulance arrives with them? [16:34:26] (03PS2) 10Rush: phab update labs instance(s) search to mysql [puppet] - 10https://gerrit.wikimedia.org/r/189990 [16:34:26] * ^d is enjoying this metaphor [16:34:39] meh. we have fluorine [16:35:07] to fix the bleeding? eep [16:35:30] <^d> [[w:Biological_aspects_of_fluorine#Medical_applications]]? [16:35:35] (03CR) 10Rush: [C: 032] phab update labs instance(s) search to mysql [puppet] - 10https://gerrit.wikimedia.org/r/189990 (owner: 10Rush) [16:35:46] (03PS2) 10Rush: phab better header docs for manifests files [puppet] - 10https://gerrit.wikimedia.org/r/189991 [16:35:51] as an alternate live log source. nobody used logstash to look at logs anyway ;) [16:36:23] and now logstash1001 is braindead too [16:36:28] frack [16:36:34] !log manybubbles Finished scap: SWAT i18n changes for ZeroBanner (duration: 31m 54s) [16:36:51] (03CR) 10Rush: [C: 032] phab better header docs for manifests files [puppet] - 10https://gerrit.wikimedia.org/r/189991 (owner: 10Rush) [16:37:05] my guess was close [16:37:39] ebernhardson: just +2ed your patches. jenkins should merge soon [16:37:45] 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1031444 (10Cmjohnson) a:3Cmjohnson Taking this until fixed [16:37:54] excellent [16:38:04] sigh no wikitech no !log no party [16:38:29] scap looks to have completed just fine - no new exceptions.... [16:38:39] !log Restarted elasticsearch on logstash1001; OOM [16:39:31] manybubbles: I'll go ahead. [16:39:37] kart_: k. go nuts [16:40:02] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 3, number_of_data_nodes: 3 [16:40:02] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 3, number_of_data_nodes: 3 [16:40:02] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 3, number_of_data_nodes: 3 [16:40:41] so, !log isn't working here either? [16:40:48] morebots: whatup yo? [16:40:48] I am a logbot running on tools-exec-11. [16:40:48] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:40:48] To log a message, type !log . [16:40:51] !log tst [16:41:13] greg-g: wikitech problems afaik, see -labs [16:43:19] !log updated cxserver to 84ad472 [16:43:27] Hope that works :) [16:43:30] nope. [16:45:04] (03PS1) 10Rush: phab move logmail script to /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/189993 [16:46:32] kart_: apparently morebots is smart and queues it up. there's a wikitech wiki issue... we'll see if it works or not later :) [16:46:37] (03CR) 10Rush: [C: 032] phab move logmail script to /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/189993 (owner: 10Rush) [16:47:40] greg-g: nice to know. Thanks. [16:48:53] kart_: ok - I'm ready to deploy my last set of changes. are you done? [16:49:08] 3operations: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#980209 (10RobH) [16:49:11] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1031457 (10RobH) 5Open>3Resolved a:3RobH This should have been resolved, as this task is resolved. It is a blocking task for the installation of the systems (T86958) which I can now move... [16:50:21] (03PS3) 10Rush: Add documentation link to 'create bug by email' text. [puppet] - 10https://gerrit.wikimedia.org/r/189326 (https://phabricator.wikimedia.org/T865) (owner: 10Merlijn van Deen) [16:50:32] (03CR) 10Rush: [C: 032 V: 032] Add documentation link to 'create bug by email' text. [puppet] - 10https://gerrit.wikimedia.org/r/189326 (https://phabricator.wikimedia.org/T865) (owner: 10Merlijn van Deen) [16:52:00] manybubbles: done [16:52:06] thanks! [16:52:17] oh any service deployment expert? [16:52:21] (03PS3) 10Rush: Observe the remote IP reported by X_FORWARDED_FOR header from proxy server [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [16:52:29] kartik@sca1001:/srv/deployment/cxserver/deploy still says old commit. [16:52:34] any issues? [16:53:15] ^d: or akosiaris ^^ [16:53:39] <^d> I know zilch about sca* [16:53:53] need service restart? [16:53:53] !log manybubbles Synchronized php-1.25wmf16/extensions/Echo/: SWAT update (duration: 00m 06s) [16:53:56] ebernhardson: ^^^^^^^^ [16:54:08] kart_: trebuchet restarts don't work, you have to manually do that with dsh [16:54:41] manybubbles: ok testing. thanks [16:54:47] (03PS1) 10Cmjohnson: Updating dns entries for codfwe pdu's [dns] - 10https://gerrit.wikimedia.org/r/189998 [16:55:04] kart_: reason is https://phabricator.wikimedia.org/T63882 [16:55:29] manybubbles: working right in wmf16 [16:55:38] ebernhardson: ok - syncing wmf15 then [16:56:59] !log manybubbles Synchronized php-1.25wmf15/extensions/Echo/: SWAT update flow (duration: 00m 06s) [16:56:59] ebernhardson: ^^^^ [16:57:10] gwicke: reading.. [16:57:14] (03PS1) 10Giuseppe Lavagetto: memcached: add mc1017 to the mediawiki pool as shard 17 [puppet] - 10https://gerrit.wikimedia.org/r/189999 [16:57:16] (03PS1) 10Giuseppe Lavagetto: memcached: add mc1018 to the mediawiki pool as shard 18 [puppet] - 10https://gerrit.wikimedia.org/r/190000 [16:57:32] <_joe_> 190000 [16:57:47] kart_: need any help ? [16:58:03] kart_: is this your first deploy / restart? [16:58:05] (03CR) 10Rush: [C: 04-1] Observe the remote IP reported by X_FORWARDED_FOR header from proxy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [16:58:05] akosiaris: cxserver on sca1001 is old [16:58:19] akosiaris: while sca1002 is uptodate [16:58:26] gwicke: 2nd. [16:58:29] !log restart elasticsearch on elastic1004 [16:58:30] manybubbles: all looks great. thanks for deploying. [16:58:36] cool! [16:58:42] * manybubbles is done with SWAT [16:58:58] kart_: do you have the rights to do the dsh restart with sudo? [16:58:58] <_joe_> paravoid, cmjohnson if you want to take a look ^^ [16:59:18] gwicke: yes. [16:59:31] 'dsh -g sca sudo service cxserver restart' or the like [16:59:37] not sure about the group [16:59:48] oh, also dsh is broken from tin [16:59:53] kart_: I see sca1002 at 8ab3d56 and sca1001 at e85e8df7 [16:59:53] you have to run that from bast1001 [17:00:12] lemme check what it hasn't updated [17:00:15] yes [17:00:49] akosiaris: thanks. [17:00:50] (03PS2) 10Rush: Make Gerrit only comment for published drafts that add new task references [puppet] - 10https://gerrit.wikimedia.org/r/182751 (https://phabricator.wikimedia.org/T77961) (owner: 10QChris) [17:01:01] akosiaris: ping me when something I can do. [17:01:02] (03CR) 10Rush: [C: 031] "I get it, seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/182751 (https://phabricator.wikimedia.org/T77961) (owner: 10QChris) [17:01:04] (03PS1) 10Giuseppe Lavagetto: sessions: add redis server on mc1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190001 [17:01:06] (03PS1) 10Giuseppe Lavagetto: sessions: add redis server on mc1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190002 [17:01:24] (03CR) 10Rush: [V: 032] "I get it, seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/182751 (https://phabricator.wikimedia.org/T77961) (owner: 10QChris) [17:01:41] (03CR) 10Rush: [C: 032] "I get it, seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/182751 (https://phabricator.wikimedia.org/T77961) (owner: 10QChris) [17:02:01] (03PS1) 10Chad: Remove StrategyWiki extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 [17:02:06] <^d> hehe [17:02:14] manybubbles: almost a full hour swat, nice [17:02:17] manybubbles: thank you :) [17:02:34] godog: its cool - it was mostly me letting the progress bars roll by in the background [17:02:51] wrong g [17:02:52] :) [17:03:27] kart_: I think we are OK. I just did another git deploy from tin [17:03:45] (03CR) 10Chad: "https://strategy.wikimedia.org/w/index.php?title=Special%3ASearch&profile=all&search=insource%3Aactivity+insource%3A%2F%5C%7B%5C%7B%5C%23a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:05:04] kart_: btw, cxserver's logging sucks [17:05:12] (03CR) 10Chad: "Crap, it's a tag not a pfunc :( https://strategy.wikimedia.org/w/index.php?title=Special%3ASearch&profile=all&search=insource%3Aactivity+i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:05:28] akosiaris: ah. [17:05:33] akosiaris: file a bug. [17:05:48] kart_: yeah, I will. Phab project ? [17:05:50] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1031495 (10fgiunchedi) name wise let's go with `restbase` unless someone has better ideas [17:05:51] akosiaris: format? [17:05:59] kart_: yeah [17:06:09] it misses timestamp, severity and other nice stuff [17:06:23] akosiaris: ^ https://phabricator.wikimedia.org/T88805#1031495 [17:06:50] <^d> greg-g: Is breaking 2 pages (one of which is actually called /Testing) on a closed wiki an acceptable fallout for undeploying a whole extension from prod? [17:06:55] akosiaris: https://phabricator.wikimedia.org/tag/contenttranslation-cxserver/ [17:07:12] godog: better ideas for naming ? I am not into bikeshedding thank you [17:07:17] restbase sounds fine [17:07:22] (03PS1) 10RobH: setting mgmt entries for dbproxy1003-1011 [dns] - 10https://gerrit.wikimedia.org/r/190004 [17:07:35] ^d: what's the other one? :) [17:07:54] <^d> https://strategy.wikimedia.org/wiki/User:Werdna/testing, https://strategy.wikimedia.org/wiki/List_of_proposals/By_incoming_links [17:08:01] what I do? [17:08:29] <^d> I'm thinking of undeploying your strategy extension [17:08:39] <^d> It would break those 2 pages on strategywiki, which is long closed [17:08:49] ask Philippe? [17:08:58] I can approve [17:09:09] https://strategy.wikimedia.org/wiki/List_of_proposals/By_incoming_links hmm let me remember [17:09:10] (03PS2) 10Giuseppe Lavagetto: memcached: add mc1017 to the mediawiki pool as shard 17 [puppet] - 10https://gerrit.wikimedia.org/r/189999 [17:09:11] Just copy a static copy to that page? [17:09:18] <^d> That works too [17:09:33] akosiaris: what was the issue? [17:09:55] akosiaris: yeah for bikeshed^Wnaming :) restbase it is [17:09:59] <^d> hoo: Static copy the list, delete werdna's test page :p [17:10:00] This page was never really used [17:10:15] +2 :D [17:10:20] yeah, I would say make the code static [17:10:22] that's a good idea [17:10:24] So I +1 deprecation [17:10:49] (03CR) 10Giuseppe Lavagetto: [C: 032] memcached: add mc1017 to the mediawiki pool as shard 17 [puppet] - 10https://gerrit.wikimedia.org/r/189999 (owner: 10Giuseppe Lavagetto) [17:10:53] kart_: I 'd say git deploy. I redid the sync and it worked [17:11:07] <^d> Ok, now who can edit? [17:11:10] <^d> A steward? [17:11:19] Yes [17:11:33] But you could as well copy the page on Meta [17:11:34] * hoo can [17:11:47] <_joe_> !log disabling puppet on the mw* hosts, and progressively merging mc1017 in the memcached cluster [17:12:30] _joe_: Out of interest... how are the sessions distributed in the end? Evenly? [17:12:46] (03CR) 10RobH: [C: 032] "i hate how these are not ordered properly like codfw mgmt, but thats a refactoring for a later date." [dns] - 10https://gerrit.wikimedia.org/r/190004 (owner: 10RobH) [17:12:49] <_joe_> hoo: not properly, but it's not so terrible [17:12:55] <^d> hoo: Can you just delete that test page? Then we've only got the one page to fix [17:13:44] (03CR) 10QChris: [C: 031] Correcting docs and thresholds for eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/189588 (owner: 10Nuria) [17:14:52] PROBLEM - nutcracker port on mw1201 is CRITICAL: Connection refused [17:15:03] PROBLEM - nutcracker process on mw1201 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [17:16:12] <_joe_> that's me sorry [17:16:22] PROBLEM - nutcracker port on mw1017 is CRITICAL: Connection refused [17:16:37] <_joe_> on both, but puppet is disabled everywhere else [17:16:55] 3operations, ops-esams: setup the 2 new esams ms-be systems - https://phabricator.wikimedia.org/T86784#1031586 (10faidon) p:5Triage>3Low [17:16:58] (03PS2) 10Nemo bis: Remove StrategyWiki extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:17:09] 3operations, ops-esams: setup the 2 new esams ms-be systems - https://phabricator.wikimedia.org/T86784#976440 (10faidon) No, they're not racked yet. [17:17:12] PROBLEM - nutcracker process on mw1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (nutcracker), command name nutcracker [17:17:12] PROBLEM - nutcracker port on terbium is CRITICAL: Connection refused [17:17:22] PROBLEM - nutcracker process on terbium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [17:18:44] <_joe_> gee gerrit is slow today [17:18:47] (03PS1) 10Giuseppe Lavagetto: memcached: avoid unnecessary escaping [puppet] - 10https://gerrit.wikimedia.org/r/190007 [17:18:50] (03CR) 10Nemo bis: [C: 031] "Ok to go, this ranking page was never really used. (Says the overactive strategywiki editor.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:19:04] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] memcached: avoid unnecessary escaping [puppet] - 10https://gerrit.wikimedia.org/r/190007 (owner: 10Giuseppe Lavagetto) [17:20:21] PROBLEM - nutcracker port on tmh1001 is CRITICAL: Connection refused [17:20:22] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 4.28 ms [17:20:31] PROBLEM - nutcracker process on tmh1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker [17:20:52] PROBLEM - nutcracker port on tmh1002 is CRITICAL: Connection refused [17:21:32] PROBLEM - nutcracker process on tmh1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker [17:21:32] Nemo_bis: ^d: Inlined all the stuff now [17:21:38] <^d> \o/ [17:22:14] (03PS2) 10Alexandros Kosiaris: Grant access to milimetric to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189483 (https://phabricator.wikimedia.org/T88769) [17:22:27] <_joe_> ook I'm gonna revert both changes :/ [17:22:32] <^d> https://strategy.wikimedia.org/w/index.php?title=Special%3ASearch&profile=all&search=insource%3Aactivity+insource%3A%2F%5C%3Cactivity.%2B%2F+local%3A&fulltext=Search [17:22:37] (03Abandoned) 10Cmjohnson: Updating dns entries for codfwe pdu's [dns] - 10https://gerrit.wikimedia.org/r/189998 (owner: 10Cmjohnson) [17:22:37] <^d> Nemo_bis, hoo: ^ :) [17:22:51] <_joe_> I trusted "ordered_yaml" to DTRT [17:22:55] <_joe_> and it doesn't [17:22:59] :) [17:23:26] (03CR) 10Alexandros Kosiaris: [C: 032] Grant access to milimetric to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189483 (https://phabricator.wikimedia.org/T88769) (owner: 10Alexandros Kosiaris) [17:23:39] Thanks, Alex [17:23:43] (03CR) 10Chad: [C: 032] Remove StrategyWiki extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:23:47] (03Merged) 10jenkins-bot: Remove StrategyWiki extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190003 (owner: 10Chad) [17:23:51] <_joe_> akosiaris: did you merge that? [17:24:32] !log demon Synchronized wmf-config/CommonSettings.php: strategywiki ext is no more (duration: 00m 05s) [17:24:36] (03PS1) 10Giuseppe Lavagetto: memcached: revert adding mc1017 [puppet] - 10https://gerrit.wikimedia.org/r/190009 [17:24:40] <^d> undeploying extensions makes me feel all warm and fuzzy [17:24:47] (03PS2) 10Giuseppe Lavagetto: memcached: revert adding mc1017 [puppet] - 10https://gerrit.wikimedia.org/r/190009 [17:25:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "sadly so :/" [puppet] - 10https://gerrit.wikimedia.org/r/190009 (owner: 10Giuseppe Lavagetto) [17:25:05] ^d: is this the same as ActiveStrategy on https://strategy.wikimedia.org/wiki/Special:Version [17:25:11] <^d> Yes [17:25:27] Probably yes, because it disappeared [17:25:27] <^d> Which is now gone :p [17:25:31] Yep [17:25:36] For a moment it was listed without l10n [17:26:07] 3operations: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#1031653 (10RobH) [17:26:08] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1031651 (10RobH) 5Resolved>3Open Reopening: dbproxy1007-dbproxy1011 do not respond to mgmt. They had previous dns entries for the asset tags, but the mgmt network ports seem unreachable fo... [17:26:12] RECOVERY - nutcracker port on mw1017 is OK: TCP OK - 0.000 second response time on port 11212 [17:26:18] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1031654 (10RobH) a:5RobH>3Cmjohnson [17:26:49] What was the name of the extension used for ranking [17:26:52] RECOVERY - nutcracker process on mw1017 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [17:27:43] <^d> Check extension-list? [17:27:43] RECOVERY - nutcracker port on tmh1001 is OK: TCP OK - 0.000 second response time on port 11212 [17:27:52] <^d> We have 146 extensions & skins deployed [17:28:01] RECOVERY - nutcracker process on tmh1001 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [17:28:13] RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [17:28:45] ^d: /me assumes you're ok re that strategy wiki question [17:28:49] * greg-g goes into next 1:1 [17:28:50] <^d> yes [17:28:54] :) [17:29:02] <^d> deleted the test page, just inlined the data on the other [17:29:02] RECOVERY - nutcracker process on tmh1002 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [17:29:12] RECOVERY - nutcracker port on terbium is OK: TCP OK - 0.000 second response time on port 11212 [17:29:12] my Wednesday calendar looks like Rob.la's Tuesday [17:29:13] Good thing I always relied on a static list :P https://strategy.wikimedia.org/wiki/Favorites/Nemo [17:29:22] RECOVERY - nutcracker port on tmh1002 is OK: TCP OK - 0.000 second response time on port 11212 [17:29:50] (03PS1) 10RobH: dbproxy1003-1006 mac entries and netboot update [puppet] - 10https://gerrit.wikimedia.org/r/190010 [17:30:22] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [17:30:51] ^d: err https://strategy.wikimedia.org/wiki/List_of_proposals [17:31:06] (03PS1) 10Filippo Giunchedi: es-tool: flush stdout when talking to the user [puppet] - 10https://gerrit.wikimedia.org/r/190012 [17:31:11] <^d> Crap! [17:31:15] <^d> How did my search miss that? [17:31:50] Poor tokenization? https://strategy.wikimedia.org/w/index.php?title=Template:Proposal_Dashboard&action=edit [17:31:53] It's supposed not to matter [17:32:03] <^d> bleh, ffs [17:32:20] Still, I maintain my point [17:32:33] The only ranking which was widely used on that wiki is one which was totally broken [17:33:09] (03CR) 10RobH: [C: 032] dbproxy1003-1006 mac entries and netboot update [puppet] - 10https://gerrit.wikimedia.org/r/190010 (owner: 10RobH) [17:33:19] IIRC https://www.mediawiki.org/wiki/Extension:CommunityVoice which was nice but was unable to make the ranks [17:33:21] _joe_: yeah, why ? [17:33:23] hi all... are there login issues with wikitech? I'm having trouble logging, quite sure my password is right [17:33:28] <^d> Oh yeah I remember that thing [17:33:41] <_joe_> akosiaris: np just I needed to puppet-merge quickly [17:33:43] <^d> Long undeployed [17:33:45] <_joe_> I already did btw [17:33:47] (03PS2) 10Alexandros Kosiaris: Grant access to nuria to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189481 (https://phabricator.wikimedia.org/T88760) [17:34:11] AndyRussG: yeah same here. There is a ticket about it [17:34:38] AndyRussG: https://phabricator.wikimedia.org/T88300 [17:34:39] akosiaris: ah OK! thanks... :) I was worried for a second [17:34:40] <_joe_> !log reenabling puppet on mw* hosts, after aborted change [17:35:00] andrewbogott: https://phabricator.wikimedia.org/T88300 btw... People can't login into wikitech [17:35:16] (03CR) 10Alexandros Kosiaris: [C: 032] Grant access to nuria to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189481 (https://phabricator.wikimedia.org/T88760) (owner: 10Alexandros Kosiaris) [17:35:22] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [17:36:46] akosiaris: looking [17:37:04] hoo: can you add a link to http://web.archive.org/web/20130525190404/https://strategy.wikimedia.org/wiki/List_of_proposals on the page itself? [17:38:05] 3operations: access request - https://phabricator.wikimedia.org/T89264#1031731 (10leila) 3NEW [17:38:11] (03CR) 10Chad: [C: 031] es-tool: flush stdout when talking to the user [puppet] - 10https://gerrit.wikimedia.org/r/190012 (owner: 10Filippo Giunchedi) [17:38:11] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1031736 (10akosiaris) 5Open>3Resolved [17:38:20] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1031737 (10akosiaris) 5Open>3Resolved [17:38:29] (03PS4) 10Chad: Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 [17:40:55] (03PS1) 10Rush: phab change security drop down text [puppet] - 10https://gerrit.wikimedia.org/r/190013 [17:41:43] (03PS2) 10Filippo Giunchedi: es-tool: flush stdout when talking to the user [puppet] - 10https://gerrit.wikimedia.org/r/190012 [17:41:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] es-tool: flush stdout when talking to the user [puppet] - 10https://gerrit.wikimedia.org/r/190012 (owner: 10Filippo Giunchedi) [17:41:54] (03PS2) 10Rush: phab change security drop down text [puppet] - 10https://gerrit.wikimedia.org/r/190013 [17:42:02] (03CR) 10Rush: [C: 032 V: 032] phab change security drop down text [puppet] - 10https://gerrit.wikimedia.org/r/190013 (owner: 10Rush) [17:42:02] RECOVERY - nutcracker process on mw1201 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:42:16] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#1031747 (10chasemp) [17:42:23] 3operations, ops-codfw: Set up pdu's - https://phabricator.wikimedia.org/T84416#1031749 (10Dzahn) a:5Papaul>3Cmjohnson @cmjohnson ^ do we need to set passwords? (and have them on iron?) [17:43:02] RECOVERY - nutcracker port on mw1201 is OK: TCP OK - 0.000 second response time on port 11212 [17:43:12] RECOVERY - Varnish HTTP bits on cp1070 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.005 second response time [17:43:22] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:43:48] akosiaris: what was that link again? [17:45:58] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#1031766 (10faidon) So, where do e.g. Parsoid security bugs should be filled under? How about e.g. Heartbleed-type of issues? This "MediaWiki security bug" makes... [17:48:29] andrewbogott: https://phabricator.wikimedia.org/T88300 [17:48:42] akosiaris: thanks — found and resolved. [17:48:51] 3operations: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#1031769 (10faidon) p:5Triage>3Normal a:5mark>3BBlack [17:49:04] btw, I could use your help on optimizing puppet performance in labs, if in fact there is such a thing as ‘optimizing puppet performance’ [17:49:41] !log logging test [17:49:46] Logged the message, Master [17:49:52] aye morebots \o/ [17:50:03] optimizing puppet performance ? [17:50:14] well, um [17:50:21] well... I am not sure what you mean but I 'd be glad to help [17:50:26] virt1000 OOMs sometimes, when too many puppet hosts hit it at once. [17:50:36] Wondering if there’s anything to try short of distributed puppetmasters. [17:50:55] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#1031780 (10Parent5446) I believe the security drop-down is an indication of the severity of the bug, not the project with which it is associated. So any bug tha... [17:51:11] To avoid an OOM ? either give the box more memory or remove something that is consuming memory [17:51:22] heh, ok :) [17:51:28] for example does virt1000 have VMs ? [17:51:32] nope [17:51:37] damn [17:51:39] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#1031785 (10chasemp) Updated the text and https://www.mediawiki.org/wiki/Phabricator/Security can someone verify and see if more is required here? [17:51:40] Just a few tiny openstack services. [17:52:05] akosiaris: it is serving puppet via apache though. That’s different from production puppet, right? [17:52:27] Throughput of event logging NavigationTiming events - CRITICAL: 6.67% of data under the critical threshold [1.0] - what is the call to action here ?:p [17:53:32] andrewbogott: no [17:53:40] production puppet is also apache+passenger [17:53:45] oh, ok. [17:53:49] Hm [17:54:07] (03PS1) 10RobH: dbproxy1003-1011 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/190016 [17:54:18] So, I guess ‘more memory’ then [17:54:52] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:54:53] !log running puppet on ruthenium (last was 2 days ago but also not admin disabled..) [17:54:57] andrewbogott: there is also a mysqld running on that host [17:55:02] Logged the message, Master [17:55:14] and it has a 9.5G virtual memory and 400 Resident memory [17:55:25] (03CR) 10RobH: [C: 032] dbproxy1003-1011 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/190016 (owner: 10RobH) [17:55:28] perhaps it is part of the problem ? [17:55:50] akosiaris: yeah, it probably is. [17:55:52] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: puppet fail [17:55:55] andrewbogott: what's going to happen to virt1000 now. [17:55:57] But the openstack services need a db. [17:56:05] although I see it only serves puppet and openstack [17:56:06] mutante: ? [17:56:07] are we recycling it for something else? [17:56:15] after wikitech moved to silver i mean [17:56:16] mutante: it does lots of things. Just, one fewer now. [17:56:17] oh and wikitech ? [17:56:57] akosiaris: I just moved wikitech off virt1000 in hopes ot stopping this problem. [17:56:59] But, no dice. [17:57:01] oh, it was just the web interface part ,right, got it [17:57:06] (03PS1) 10Rush: phab link to mw Security page in description for field [puppet] - 10https://gerrit.wikimedia.org/r/190017 [17:57:18] I mean, old wikitech data is still in the db on virt1000, but dropping unused tables won’t help with ram will it? [17:57:28] andrewbogott: i was asking because i saw it in icinga for puppet run [17:57:49] mutante: yeah. I thought moving wikitech off would give us enough headroom but seems not. [17:58:06] andrewbogott: unused tables ? no it will not [17:58:21] yeah [17:58:51] are you planning to move the entire db to a db server? [17:59:19] wasn’t planning to, but... [17:59:26] (03CR) 10Rush: [C: 032] phab link to mw Security page in description for field [puppet] - 10https://gerrit.wikimedia.org/r/190017 (owner: 10Rush) [17:59:27] !log ran puppet on virt1000 - finished just fine, not sure why icinga said fail [17:59:32] Logged the message, Master [17:59:38] cmjohnson, robh: Maybe an easy question, maybe not… https://phabricator.wikimedia.org/T89266 [18:00:00] ..... it may be easier to just replace it entirely but lets see how old it is [18:00:13] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:00:16] replacing it entirely sounds like a lot of trouble :) [18:00:39] moving the db to a db host might make sense though [18:00:45] yeah [18:00:50] And also I can make sean do that :) [18:01:04] andrewbogott: its an old R610, checking the memory usage now [18:01:20] seems its all 4GB dimms presently [18:01:42] andrewbogott: Sean could help with some tuning as well.. [18:01:54] not sure if it would help much though [18:01:56] and its very low memory =[ [18:02:00] akosiaris: I think he did already, that’s what got us the last couple of weeks worth of uptime :( [18:02:03] andrewbogott: there are empty slots though yes [18:02:12] oh ok then [18:02:18] robh: those 4GB dimms could be replaced with 16s? [18:02:24] (03PS1) 10Rush: Revert "phab link to mw Security page in description for field" [puppet] - 10https://gerrit.wikimedia.org/r/190018 [18:02:35] DIMM DDR3 Synchronous 1333 MHz [18:02:37] andrewbogott: or just add more [18:02:38] (03CR) 10Rush: [C: 032 V: 032] Revert "phab link to mw Security page in description for field" [puppet] - 10https://gerrit.wikimedia.org/r/190018 (owner: 10Rush) [18:02:41] like i said, it has spare slots [18:02:51] andrewbogott: now, purchasing ram for an out of warranty system is kind of a losing idea [18:02:56] but, we likely have spare ram [18:02:57] since this is old [18:03:03] cmjohnson will have to comment [18:03:08] ok [18:03:21] andrewbogott: could you create a phab task detailing why you want it (obvious but please anyhow) and link? [18:03:30] i'll update with the memory specs so chris can then check onsite if he has them [18:03:44] put it in both ops-eqiad and hardware-requests projects i'd say. [18:04:03] i think he may be mid mc1018 reinstall and such [18:04:05] robh: wait, how would that task differ from the task I just made? [18:04:18] robh: yeah I will take a look as soon as I have a chance [18:04:58] andrewbogott: whats the task you just made? [18:05:13] until you pinged me i wasnt paying attention in here, im working on stuff ;D [18:05:28] https://phabricator.wikimedia.org/T89266 [18:05:36] 3operations: access request for researcher to analytics-users in Hadoop - https://phabricator.wikimedia.org/T89264#1031836 (10Aklapper) [18:05:38] which is how you knew what my question was in the first place :) [18:05:51] sorry, task thrashing and still installing half a dozen systems durn it! ;D [18:05:54] yea lemme steal it [18:05:59] np thanks [18:08:33] 3Labs, hardware-requests, ops-eqiad, operations: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1031856 (10RobH) virt1000 is an R610 with a total of 16GB ram, installed via 4 * DIMM DDR3 Synchronous 1333 MHz 4GB sticks. I show the system has 12 dimm slots, and 4 of them are filled. If w... [18:08:41] ok i updated and assgined to chris [18:08:50] for memory check for onsite spares, including the spec [18:09:01] (trying to make it easy on him we just threw a shit ton of onsite work at him today ;) [18:11:12] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1031869 (10Dzahn) Can we amend the policy to say "this should reflect users in admin.yaml .. and NDAed volunteers" or similar and then add him... [18:12:05] (03CR) 1020after4: Observe the remote IP reported by X_FORWARDED_FOR header from proxy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [18:12:51] 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1031908 (10Cmjohnson) 5Open>3Resolved Replaced both CPU's and system board [18:13:17] 3Labs, hardware-requests, ops-eqiad, operations: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1031910 (10RobH) fyi: determining what memory banks are in use: sudo lshw -class memory (or just pull the -class and following for a full hardware output, but its a bit overwhelming.) [18:16:28] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1031914 (10Tgr) >>! In T88412#1030703, @Gilles wrote: > Should we also change the 500 happening when people request a larger (or equal) size than the original to a 400? I... [18:17:07] (03CR) 10Rush: "eh? https://secure.phabricator.com/T7114" [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [18:19:18] (03CR) 1020after4: "ok, sorry, I assumed upstream had their shit together ;)" [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [18:19:24] <^d> springle: db1033 ok? [18:19:37] <^d> It's spamming dbperformance.log and slow queries getting logged by MW [18:19:47] 3Labs, hardware-requests, ops-eqiad, operations: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1031919 (10coren) Honestly, I'm a little worried to know that those services manage to explode 16G of ram and suspect there is something broken that more memory is more likely to hide than fix. [18:20:00] !log restarted Elasticsearch on logstash1003; preventative, other nodes restarted today [18:20:01] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1031920 (10chasemp) Nooooooo....well maybe. We can't effectively use #operations as an ACL object if it has non-ops people I think. It would... [18:20:08] Logged the message, Master [18:20:21] 3Labs, hardware-requests, ops-eqiad, operations: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1031921 (10Andrew) Supporting 400 puppet clients? It doesn't surprise me that that uses a lot of ram. [18:20:52] <^d> eg: 2015-02-11 18:20:37 mw1254 eswiki: LoadBalancer::reallyOpenConnection: 11+ connections made (master=db1033) [18:21:45] twentyafterfour, greg-g: has the wmf17 branch been cut yet? [18:22:02] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 46 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 42, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 92, initializing_shards: 4, number_of_data_nodes: 3 [18:22:11] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 43 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 39, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 95, initializing_shards: 4, number_of_data_nodes: 3 [18:22:13] kaldari: I was just typing into #wikimedia-releng to ask how I go about doing that [18:22:31] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 42 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 38, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3 [18:22:34] it doesn't seem to be documented as part of the deployment docs [18:22:46] twentyafterfour: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [18:22:49] 3Labs, hardware-requests, ops-eqiad, operations: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1031925 (10coren) That's 40M per connection even if they were all simultaneous - probably a lot more given that we stagger much of it - that's a //lot// even for cruddy code like puppet. [18:23:47] twentyafterfour: php make-wmf-branch 1.25wmf17 master [18:23:51] <^d> Those docs are wrong [18:23:54] <^d> Don't use master [18:24:00] <^d> 1.25wmf17 1.25wmf16 [18:24:12] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 133, initializing_shards: 3, number_of_data_nodes: 3 [18:24:13] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 135, initializing_shards: 3, number_of_data_nodes: 3 [18:24:33] they were right a year ago. Sam told me that using the prior branch didn't matter [18:24:42] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 137, initializing_shards: 1, number_of_data_nodes: 3 [18:25:13] oh? [18:25:20] <^d> Oh, meh, probably works either way [18:25:25] <^d> Yeah master is fine [18:25:40] <^d> --help on script is kinda misleading. I wrote confusing code here [18:25:45] lol [18:25:48] uhm [18:26:32] Here's an exact checklist I followed for 1.23wmf20 -- https://github.com/bd808/wmf-kanban/issues/71 [18:27:15] (03CR) 10Rush: "going to bow out so this isn't on my dashboard for now :) not much we can do until patched" [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4) [18:28:27] ok running make-wmf-branch [18:29:16] <^d> twentyafterfour: Oh protip, do that in like your homedir somewhere on cluster. Way faster than cloning everything to your local machine [18:29:38] +1 [18:29:49] ^d: yeah I'm running it on tin, isn't that where I'm supposed to do this stuff? [18:30:00] <^d> You can do this on tin, yeah [18:30:15] <^d> The make-wmf-branch can be done /anywhere/, including your local machine [18:30:22] <^d> Just wanted to save you the bandwidth :p [18:30:26] and as for homedir, it runs in /tmp/ mostly, probably would be better if it didn't use a hard-coded tmp location [18:30:39] +1 as well [18:30:50] (file boogz!) [18:30:51] :) [18:30:58] I think I patched it locally when I was using it [18:31:02] (in case two instances of the script try to run at the same time, by the same or two different people) [18:31:06] <^d> It's a funny little script [18:31:24] There's a bug somewhere to move it into Jenkins [18:31:25] ~/tmp would be one step better. mktmp would also work [18:31:45] <^d> It's all php which is doubly silly [18:32:00] * ^d went through this phase of writing cli tools in php [18:32:09] I don't mind cli php [18:32:23] better than bash [18:32:24] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1031952 (10GWicke) @cmjohnson: How is this going? Do you think the first three boxes will be ready by the end of this week? [18:32:40] ^d: twentyafterfour and I used to work at the place king of php cli tools [18:32:46] all tools were php for like...a decade [18:32:49] :) [18:32:54] * ^d re-nags on db1033 [18:32:56] sounds like WMF :) [18:33:38] php may not be the best language for many things but it isn't horrible for quick and dirty stuff [18:33:40] * aude waves [18:34:01] ok make-wmf-branch failed ... [18:34:24] <^d> twentyafterfour: failed on what? [18:34:45] er hang on [18:34:50] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1031955 (10GWicke) >>! In T88805#1031495, @fgiunchedi wrote: > name wise let's go with `restbase` unless someone has better ideas I would actually call them something that's descriptive of the... [18:35:03] there's also a drymode afaik for testing the script [18:35:08] w/o pushing to gerrit [18:35:14] (03PS4) 10Ottomata: Correcting docs and thresholds for eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/189588 (owner: 10Nuria) [18:36:19] (03CR) 10Ottomata: [C: 032] Correcting docs and thresholds for eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/189588 (owner: 10Nuria) [18:37:53] (03PS1) 10RobH: adding other dbproxy systems to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/190024 [18:40:03] 3RESTBase, operations: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1031981 (10GWicke) @akosiaris, @fgiunchedi: Thanks for setting up the domain. For RESTBase, I believe the main missing bits are: - set up an LVS for RESTBase - Point rest.wikimedia.org to it Anything else I'm forg... [18:40:12] (03CR) 10RobH: [C: 032] adding other dbproxy systems to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/190024 (owner: 10RobH) [18:42:41] 3RESTBase, Services, operations, Scrum-of-Scrums: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1031989 (10GWicke) [18:44:39] 3operations, Phabricator: Mysql search issues flagged by Phabricator setup - https://phabricator.wikimedia.org/T89274#1031992 (10chasemp) 3NEW [18:46:11] twentyafterfour leaves me hanging... ;) [18:46:51] greg-g: huh? [18:46:58] 18:34 < ^d> twentyafterfour: failed on what? [18:46:58] 18:34 < twentyaft> er hang on [18:47:00] :) [18:47:41] sorry, it's complicated ... I was bugging ^d in private to spare the channel noise [18:47:48] 3operations, Phabricator: Mysql search issues flagged by Phabricator setup - https://phabricator.wikimedia.org/T89274#1031999 (10chasemp) p:5Triage>3High [18:48:48] 3RESTBase, operations: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1032007 (10GWicke) @fgiunchedi, the test hosts are all running 2.1.2 now. Would you like to test the deb there? [18:49:02] twentyafterfour: s'ok :) [18:52:42] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#1032022 (10JanZerebecki) >>! In T76564#950388, @Qgil wrote: > I have changed "MediaWiki security bug" for "Software security bug" because in addition to MediaWi... [18:56:41] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [18:58:11] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [18:59:52] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [19:00:04] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T1900). [19:01:00] (03CR) 10Jackmcbarn: [C: 04-1] "This is 100% intentional, not unintended at all, and happened even before the commit that was mentioned. This commit would essentially tur" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189661 (owner: 10Cenarium) [19:01:49] 3operations: zirconium: more space for /srv - https://phabricator.wikimedia.org/T89004#1032049 (10Dzahn) [19:03:59] greg-g: you can stop hanging now. It's branching properly now [19:04:55] 3operations: zirconium: more space for /srv - https://phabricator.wikimedia.org/T89004#1032052 (10Dzahn) I take that back. Nothing needs to be resized here. Actually there were tons of free extents on the physical volume. (1TB disk but just a dozen Gs used in logical volumes) making a new lv for the entire /srv... [19:09:16] !log powerdown graphite1002 T88992 [19:09:23] Logged the message, Master [19:10:27] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 137, initializing_shards: 0, number_of_data_nodes: 3 [19:10:28] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 137, initializing_shards: 0, number_of_data_nodes: 3 [19:10:28] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 137, initializing_shards: 0, number_of_data_nodes: 3 [19:11:03] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1032059 (10Gilles) 5Open>3Resolved Great! I missed that. Sounds like this is completely done, then. [19:12:14] gah. what happened there to logstash? [19:13:37] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 90 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 86, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 45, initializing_shards: 4, number_of_data_nodes: 3 [19:13:38] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 90 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 86, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 45, initializing_shards: 4, number_of_data_nodes: 3 [19:13:38] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 90 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 86, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 45, initializing_shards: 4, number_of_data_nodes: 3 [19:14:11] 3operations: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#1032067 (10RobH) Installation is proceeding on all but dbproxy1008, which has an issue iwth mgmt detailed on the blocking ticket about racking and setup. [19:14:41] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1032069 (10RobH) update from irc: chris fixed all but dbproxy1008, which is still unresponsive to mgmt (so ive installed all but dbproxy1008) [19:14:57] 1001 and 1003 dropped out of the cluster and then rejoined almost immediately. I think 1002 had a nasty gc pause that kept it from answering the heartbeat requests [19:15:25] !log moved docroots on zirconium to new logical volume for /srv [19:15:32] Logged the message, Master [19:16:20] oh even worse 1002 OOMd but remained master [19:16:27] (03PS1) 10Cmjohnson: Adding dhcpd for restbase1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/190030 [19:18:29] bd808, hmm .. should i turn off logging to logstash .. parsoid load seems to have spiked and that seems correlated to logstash issues. [19:18:31] 3operations: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#1032076 (10RobH) @springle All the installs EXCEPT dbproxy1008 are complete. Puppet keys are NOT signed. I'm keeping this task assigned to me until dbproxy1008 is done. You should feel free to claim the rema... [19:18:32] !log restarted elasticsearch on logstash1002 after OOM [19:18:36] Logged the message, Master [19:18:45] parsoid logs to logstash1003 [19:18:58] subbu: if you can that would be great. Or at least turn down the verbosity [19:18:59] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 123, initializing_shards: 3, number_of_data_nodes: 3 [19:19:08] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 125, initializing_shards: 3, number_of_data_nodes: 3 [19:19:08] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 7, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 125, initializing_shards: 3, number_of_data_nodes: 3 [19:19:26] and we should fix parsoid to spread the joy across all the nodes instead of pounding one [19:19:26] we are logging at warn level to logstash. [19:19:37] yes, we should. [19:20:14] What I did for MW was make it pick a host from the list of 3 on each PHP request. Not sure what the parsoid equivalent would be [19:21:30] mostly we need hardware/RAM for elasticsearch. Until that comes in all changes will just be bandaids [19:23:18] cxserver doesn't seem to have updated after deploy earlier today. Kartik said he restarted the service. Any chance the code wasn't properly deployed? [19:23:20] bd808, so, am i turning off logging to logstash temporarily till these are resolved today? [19:23:47] subbu: If you can live with that, yes please [19:23:54] ok. [19:25:38] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [19:25:41] The ocg volume is actually higher than parsoid. That might be what's making 1002 sad (I think ocg is pinned to 1002) [19:25:58] mutante: progress with the BZ static dump thing? [19:26:20] JohnFLewis: yes, progress, because now i have space for it [19:26:32] Sweet :) [19:26:37] JohnFLewis: i moved the entire /srv to a separate logical vol [19:26:39] cscott, ^^ [19:27:04] kk [19:28:08] PROBLEM - Disk space on einsteinium is CRITICAL: DISK CRITICAL - free space: / 1658 MB (3% inode=97%): [19:28:25] 3operations: zirconium: more space for /srv - https://phabricator.wikimedia.org/T89004#1032117 (10Dzahn) made a new 10G lv for /srv/ rsynced old /srv/ over and mounted new /srv/ added to /etc/fstab /dev/mapper/zirconium-srv on /srv type ext3 (rw,errors=remount-ro) /dev/mapper/zirconium-srv 9.9G 1.1G 8... [19:28:58] 3operations: zirconium: more space for /srv - https://phabricator.wikimedia.org/T89004#1032118 (10Dzahn) a:3Dzahn [19:29:06] 3operations: zirconium: more space for /srv - https://phabricator.wikimedia.org/T89004#1032128 (10Dzahn) 5Open>3Resolved [19:29:58] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [19:30:53] bd808|LUNCH: i don't understand how logstash can be making parsoid unhappy [19:30:58] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [19:31:12] cscott: other way around possibly [19:31:17] logstash should be UDP, parsoid shouldn't be able to tell anything about the logstash status [19:31:43] i think it is correct that both parsoid and ocg are pinned to specific logstash servers -- i thought that was intentional [19:32:08] bd808|LUNCH, parsoid load has spiked around the same time logstash issues cropped up. [19:32:38] so, i am turning off logging to see if it fixes it and we'll investigate it separately as to why that is happening. [19:32:43] subbu: i think you've got cause and effect backwards [19:32:58] i think there's something that caused parsoid load to spike, which caused it to log more, which caused logstash to be unhappy. [19:33:13] but i agree that turning off logging is a useful temporary diagnostic [19:33:21] It was when MW was pinned to one as well. The path forward is to spread the load more evenly. but today it doesn't matter much because logstash is generally sad [19:33:31] (and i'm double-checking the gelf-stream code to make sure that we're using udp as I expect) [19:34:06] cscott: That order is what subbu is responding to. He saw me figthig all morning to keep logstash alive and offered to turn down the input volume [19:34:09] !log repooled cp1070 (eqiad bits) in pybal [19:34:19] Logged the message, Master [19:34:33] (03PS1) 10BBlack: repool cp1070 for x-dc T88889 [puppet] - 10https://gerrit.wikimedia.org/r/190034 [19:34:43] cscott, yes .. if it doesn't fix it, we know load isn't being caused by some weird logging issue we have. [19:34:53] (03CR) 10BBlack: [C: 032 V: 032] repool cp1070 for x-dc T88889 [puppet] - 10https://gerrit.wikimedia.org/r/190034 (owner: 10BBlack) [19:35:34] subbu: i don't think that logging is directly related, i bet it's something else causing load which is just incidentally producing extra logging. but we'll see. [19:36:11] !log temporarily turn off logging to logstash till logstash isssues are resolved. [19:36:15] Logged the message, Master [19:36:38] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:37:02] cscott, we should check that gelf-stream is really using udp as it claims. [19:37:41] just did so. https://github.com/mhart/gelfling/blob/master/gelfling.js -- there's actually no code there to do tcp. [19:37:59] so udp is all gelf-stream can do. (there's no networking code in gelf-stream, it just passes stuff on to gelfling) [19:38:02] hmm .. ok. [19:38:20] <^d> bd808|LUNCH: I was about to ask the next logical step for "spread things evenly" which is "lvs" [19:38:27] <^d> But then I realized we'd had this discussion twice before [19:38:38] subbu: https://github.com/mhart/gelf-stream/blob/master/gelf-stream.js if you want to verify that parenthetical for yourself. [19:38:39] twentyafterfour: whew, I almost got to: http://2.bp.blogspot.com/-2F7daSyAU2Y/ThcSOU9oo3I/AAAAAAAAA8Q/yRV-i7t8Acs/s1600/giveup.jpg [19:39:20] greg-g: nice, I need that on my wall [19:39:45] cscott, --> #parsoid [19:42:37] (03PS2) 10Cmjohnson: Adding dhcpd for restbase1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/190030 [19:43:48] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 131, initializing_shards: 2, number_of_data_nodes: 3 [19:43:48] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 132, initializing_shards: 2, number_of_data_nodes: 3 [19:43:57] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 132, initializing_shards: 2, number_of_data_nodes: 3 [19:46:26] !log all eqiad-upload-https -> cp1064 [19:46:33] Logged the message, Master [19:48:40] what's the preferred way to apply security fixes from previous branch onto my newly cut release branch? [19:49:33] twentyafterfour: You can either get them from the old branch, or get them from my home on tin [19:49:57] so just manually apply the patches one by one? [19:50:18] twentyafterfour: Yes, and they need to be in order. Let me forward the email I sent reedy. [19:51:29] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd for restbase1003/1004 [puppet] - 10https://gerrit.wikimedia.org/r/190030 (owner: 10Cmjohnson) [19:52:06] (03PS1) 10Dzahn: mv TransparencyReport docroot to /srv/org/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/190036 [19:53:24] <^d> twentyafterfour: And remember those don't get pushed. [19:54:26] what about other patches like this: Merge "Update VisualEditor submodule" into wmf/1.25wmf16 [19:54:38] RECOVERY - Disk space on einsteinium is OK: DISK OK [19:55:11] I guess those are merged upstream [19:55:13] twentyafterfour: not those [19:55:15] nevermind [19:56:14] 3Multimedia, operations, MediaWiki-extensions-UploadWizard: Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs - https://phabricator.wikimedia.org/T89018#1032305 (10BBlack) >>! In T89018#1030174, @akosiaris wrote: > @BBlack, I have a different theory re... [20:00:08] PROBLEM - Disk space on einsteinium is CRITICAL: DISK CRITICAL - free space: / 1097 MB (2% inode=97%): [20:01:34] 3operations, Phabricator: Mysql search issues flagged by Phabricator setup - https://phabricator.wikimedia.org/T89274#1032330 (10Chad) * +1 to changing the boolean syntax to AND instead of OR. Nobody expects OR by default. * +1 to lowering min word length to 3, as long as it doesn't have insane performance impli... [20:02:08] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [20:03:09] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 135, initializing_shards: 0, number_of_data_nodes: 3 [20:03:25] SMalyshev: noticed einsteinium is running out of disk. is there stuff in your home that can be deleted? [20:03:41] mutante: not currently [20:04:00] I'm running a test which needs a lot of space... [20:04:09] mutante: is it causing any issues? [20:05:02] SMalyshev: icinga is reporting it to as as a problem because low disk, there is like 1.3G free, as long as you are aware it's fine [20:05:14] i would just acknowledege it for now then, because it's a test host [20:05:46] your home is about 40G of the 46G it has [20:06:27] ACKNOWLEDGEMENT - Disk space on einsteinium is CRITICAL: DISK CRITICAL - free space: / 1314 MB (2% inode=97%): daniel_zahn SMalyshev running tests [20:07:28] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [20:07:38] mutante: yeah, I know, I'm running neo4j test, it needs a lot of space [20:08:06] csteipp: there are more than 5 patches in your home dir (under currentsecuritypatches)...and the naming is inconsistent. which ones apply? [20:08:33] I guess it's easier to just merge the tip of the previous branch? [20:09:04] !log eqiad-upload-https -> back to even weighting [20:09:06] SMalyshev: ok. if it's not enough it could be extended [20:09:13] Logged the message, Master [20:10:37] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 45, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 90, initializing_shards: 0, number_of_data_nodes: 2 [20:11:28] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 2, unassigned_shards: 45, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 90, initializing_shards: 0, number_of_data_nodes: 2 [20:14:18] twentyafterfour: They all apply [20:15:33] (03PS1) 10Filippo Giunchedi: restbase: allocate production addresses [dns] - 10https://gerrit.wikimedia.org/r/190038 [20:16:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: allocate production addresses [dns] - 10https://gerrit.wikimedia.org/r/190038 (owner: 10Filippo Giunchedi) [20:18:41] 3operations: 13/02/2015 - https://phabricator.wikimedia.org/T89286#1032385 (10emailbot) [20:24:12] which bot was it that once had the trigger "!change" [20:24:18] !change 12345 mutante [20:24:20] that one^ [20:24:41] i want to resurrect it [20:24:42] mutante: what did it do? [20:25:15] T13|mobile: poked the user to review the relevant Gerrit change [20:25:19] T13|mobile: convert a gerrit change id into the full URL and then nicely ask the person to look at it and review [20:25:35] Keyword 'nicely' ;) [20:25:40] wm-bot? http://meta.wikimedia.org/wiki/Wm-bot [20:25:44] wm-bot can do it. [20:25:46] so that the humans dont have to be :) *g* [20:26:05] all i know is once that trigger worked here , then it stopped [20:26:35] I'll set it up once i get home to computer. [20:26:42] I trust: petan|w.*wikimedia/Petrb (2admin), .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@mediawiki/Catrope (2admin), .*@wikimedia/RobH (2admin), .*@wikimedia/Ryan-lane (2admin), petan!.*@wikimedia/Petrb (2admin), .*@wikimedia/Krinkle (2admin), [20:26:42] @trusted [20:26:51] Unless someone else does ot. [20:26:56] T13|mobile: :) thx [20:27:12] It's here, just needs the macros added I would guess [20:27:22] Yes [20:28:21] Something like "!change is https://gerrit.wikimedia.org/r/#/c/$1/" [20:28:46] !change is https://gerrit.wikimedia.org/r/#/c/$1/ [20:28:46] Unable to modify db, access denied, link to database isn't valid [20:28:59] ah. it's broken I guess [20:29:24] Is that the link? [20:29:48] I just can't look up link structure from mobile. [20:30:06] That would link to a gerrit review which I think is what mutante wants [20:31:48] !change act pokes $2* to look at https://gerrit.wikimedia.org/r/#/c/$1/ for $infobot_nick. [20:31:49] Unable to modify db, access denied, link to database isn't valid [20:31:59] !infobot-on [20:32:06] Infobot was already enabled [20:32:06] @infobot-on [20:32:06] T13|mobile: Invalid arguments [20:32:26] Ohh... [20:32:38] "< wm-bot> Unable to modify db, access denied, link to database isn't valid" [20:32:48] Will have to wait until I get home to fix DB. [20:33:21] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [20:33:21] @info [20:33:21] T13|mobile: Invalid arguments [20:34:10] !change 777777 [20:34:42] It's actually already in DB but bugged out [20:38:49] mutante: actually I wouldn't mind extending it a bit if possible [20:39:31] looks like neo4j is more space-hungry than I expected [20:41:09] SMalyshev: after lunch break ok? was about to get food [20:41:21] mutante: sure, no rush [20:41:30] great [20:41:35] thanks! [20:41:45] 3operations, Analytics-Cluster: Install hadoop-lzo on cluster - https://phabricator.wikimedia.org/T89290#1032492 (10Ottomata) 3NEW a:3Ottomata [20:42:39] running out of deployment window time here... is it ok to deploy late? I'm still trying to make sense of these security patches [20:44:35] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [20:44:35] @info [20:44:35] Technical_13: Invalid arguments [20:45:10] !bot [20:45:56] Seen is now enabled in the channel [20:45:56] @seen-on [20:46:09] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032501 (10Dzahn) 3NEW [20:47:06] @infobot-snapshot previousdb [20:47:06] Snapshot snapshots/#wikimedia-operations/previousdb was created for current database as of 2/11/2015 8:47:06 PM [20:47:06] Technical_13: Unknown identifier (previousdb) [20:47:30] There are 1 files: previousdb [20:47:30] @infobot-snapshot-ls [20:47:30] Technical_13: Invalid arguments [20:47:59] You can't configure this channel to share local db, because this channel is using shared db with another channel, thus the local db is locked [20:47:59] @infobot-share-on [20:47:59] Technical_13: Invalid arguments [20:48:20] Good to know... What db are we sharing? [20:48:22] 3operations, ops-eqiad: relocate/wire/setup dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86957#1032515 (10Cmjohnson) dbproxy1008 idrac is not coming up. I get an idrac6 communication failure and it reboots. I recommend we scap this out of warranty server and utilize another spare [20:48:42] @infobot-link [20:48:42] Technical_13: Invalid arguments [20:48:51] not telling.. okay.. [20:49:26] twentyafterfour: Yup. You are in control until you give it up [20:49:58] The first couple of times took me way way longer than Sam's deploys ever did [20:50:35] Infobot disabled [20:50:35] @infobot-off [20:50:35] Technical_13: Invalid arguments [20:50:48] Infobot enabled [20:50:48] @infobot-on [20:50:48] Technical_13: Invalid arguments [20:50:53] !bang [20:50:58] !change test [20:51:36] Shared infobot was disabled [20:51:36] @infobot-share-off [20:51:36] Technical_13: Invalid arguments [20:51:54] @infobot-recovery previousdb [20:51:54] Technical_13: Unknown identifier (previousdb) [20:51:54] Snapshot snapshots/#wikimedia-operations/previousdb was loaded and previous database was permanently deleted [20:52:00] !change [20:52:00] https://gerrit.wikimedia.org/r/ [20:52:04] bingo. [20:52:19] mutante: It should now work [20:52:36] !change 12345|mutante [20:52:36] mutante: https://gerrit.wikimedia.org/r/12345 [20:52:55] <_joe_> another bot? [20:53:12] _joe_: another? [20:53:32] <_joe_> dbbot-wm is new right? [20:53:32] I believe wm-bot has been here for quite some time (years possibly). [20:53:41] Oh, no idea. [20:56:16] I think dbbot was something Krinkle made to report on slave lag [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150211T2100). Please do the needful. [21:05:46] bd808: _joe_: It's not new. [21:07:15] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032546 (10Smalyshev) If it could be bumped to 100G that'd be great. If it doesn't have that then whatever is possible is fine. [21:13:05] (03PS10) 10BryanDavis: logstash: parse json encoded hhvm fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/179759 [21:16:31] <^d> godog: wmf-utils repo? [21:16:36] <^d> Is that meant to be a thing, or...? [21:18:12] <_joe_> Krinkle: ok, it's the first time I see it chatting since I am around here, so not a chatty bot for sure :) [21:18:25] _joe_: It only talks when spoken to. [21:22:35] 3operations, Continuous-Integration: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1032598 (10hashar) I have poked the internal OPS list about creating Debian packages for python software that have conflicting or missing dependencies. Pasted at P284 follow up on OPS list. [21:22:38] 3operations, Continuous-Integration: [upstream] Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1032600 (10hashar) I have poked the internal OPS list about creating Debian packages for python software that have conflicting or missing dependencies. Pasted at P284 follow up on OPS list. [21:31:35] (03PS1) 10Ori.livneh: vbench: add 500ms delay before starting profile; don't wait for init event [puppet] - 10https://gerrit.wikimedia.org/r/190091 [21:31:47] @seen wm-bot [21:31:47] mutante: I am right here [21:32:39] (03CR) 10Catrope: [C: 031] vbench: add 500ms delay before starting profile; don't wait for init event [puppet] - 10https://gerrit.wikimedia.org/r/190091 (owner: 10Ori.livneh) [21:33:35] (03PS2) 10Ori.livneh: vbench: add 500ms delay before starting profile; don't wait for init event [puppet] - 10https://gerrit.wikimedia.org/r/190091 [21:33:44] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: add 500ms delay before starting profile; don't wait for init event [puppet] - 10https://gerrit.wikimedia.org/r/190091 (owner: 10Ori.livneh) [21:33:47] bd808, i am going to re-enable logstash with my parsoid deploy now. [21:33:48] !change 190036 ori [21:33:48] https://gerrit.wikimedia.org/r/190036 [21:33:49] (test) [21:34:27] (03PS1) 10Andrew Bogott: When backup up images, actually back up the image dir. [puppet] - 10https://gerrit.wikimedia.org/r/190097 [21:34:40] _joe_: no, actually it's the old bot and one of it features has been fixed [21:35:38] T13|mobile: thanks for that, only difference seems | instead of space a [21:35:54] (03CR) 10Ori.livneh: [C: 031] "LGTM. I assume you'll remove the current docroot manually after applying this patch." [puppet] - 10https://gerrit.wikimedia.org/r/190036 (owner: 10Dzahn) [21:36:16] If you want a space, I can do that. [21:36:23] (03PS2) 10Ori.livneh: When backing up images, actually back up the image dir. [puppet] - 10https://gerrit.wikimedia.org/r/190097 (owner: 10Andrew Bogott) [21:36:24] 3operations, Analytics-Kanban, Analytics-Cluster: Increase and monitor Hadoop NameNode heapsize - https://phabricator.wikimedia.org/T89245#1032675 (10kevinator) [21:37:13] !change [21:37:14] https://gerrit.wikimedia.org/r/ [21:37:24] !ch-ch-ch-changes [21:37:32] !change del [21:37:33] Successfully removed change [21:38:02] !change act pokes $2 to check out https://gerrit.wikimedia.org/r/$1 upon request of $infobot_nick. [21:38:03] Key was added [21:38:13] !change fooBar mutante [21:38:13] * wm-bot pokes mutante to check out https://gerrit.wikimedia.org/r/fooBar upon request of Technical_13. [21:38:23] ori: https://33.media.tumblr.com/9a824cf2d17eaf76ec6e088c157995fc/tumblr_nhuxa68O5x1qaetdco1_500.gif [21:38:38] :) [21:38:39] mutante: good? [21:38:40] !log deployed parsoid version 4fc3b43d [21:38:47] Logged the message, Master [21:39:33] 3operations, Analytics-Kanban, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1032686 (10ggellerman) [21:39:44] !change 189611 akosiaris [21:39:44] * wm-bot pokes akosiaris to check out https://gerrit.wikimedia.org/r/189611 upon request of mutante. [21:39:53] Technical_13: woo!:) yes [21:39:57] thx [21:40:27] yw [21:41:04] if no nick specified, look up the on in the topic after "Ops duty: " :) [21:41:18] na, not sure about that [21:41:48] I can't set defaults for arguments. [21:41:50] 3Multimedia: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1032690 (10Tgr) [21:42:08] !change noName [21:42:08] * wm-bot pokes $2 to check out https://gerrit.wikimedia.org/r/noName upon request of Technical_13. [21:42:40] it's fine like this:) [21:43:20] <_joe_> /ignore wm-bot [21:43:23] <_joe_> oops sorry [21:43:26] <_joe_> ;) [21:43:28] (03PS4) 10Dzahn: add chromium-admins to visual editor role [puppet] - 10https://gerrit.wikimedia.org/r/189611 (https://phabricator.wikimedia.org/T89038) [21:43:51] <_joe_> I get enough interrupts from this channel, srsly [21:44:03] it's the polite way to "poke" :p [21:44:06] <_joe_> oh come on, bot nonsense. please no. [21:44:06] (03PS1) 1020after4: Remove 1.25wmf10 and 1.25wmf11 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190098 [21:44:08] (03PS1) 1020after4: Add 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190099 [21:44:10] (03PS1) 1020after4: Wikipedias to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190100 [21:44:12] (03PS1) 1020after4: Group0 to 1.25wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190101 [21:44:26] <_joe_> mutante: the polite way to poke is to speak person-to-person, IMO :P [21:44:32] better than "around?" imho [21:46:27] twentyafterfour, you st [21:46:44] ill need deployment help? [21:47:12] MaxSem: I think I got it figured out [21:47:19] he's been pointing out doc errors to me. :) [21:47:35] I'm about to push all of these changes (see gerrit-wm notices above) [21:47:37] :) [21:48:49] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:49:31] (03CR) 1020after4: [C: 032] Remove 1.25wmf10 and 1.25wmf11 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190098 (owner: 1020after4) [21:49:54] (03CR) 1020after4: [C: 032] Add 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190099 (owner: 1020after4) [21:50:22] (03Merged) 10jenkins-bot: Remove 1.25wmf10 and 1.25wmf11 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190098 (owner: 1020after4) [21:50:24] (03Merged) 10jenkins-bot: Add 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190099 (owner: 1020after4) [21:50:28] (03CR) 1020after4: [C: 032] Wikipedias to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190100 (owner: 1020after4) [21:50:33] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190100 (owner: 1020after4) [21:50:44] (03CR) 1020after4: [C: 032] Group0 to 1.25wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190101 (owner: 1020after4) [21:50:52] (03Merged) 10jenkins-bot: Group0 to 1.25wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190101 (owner: 1020after4) [21:52:28] (03CR) 10Dzahn: [C: 032] mv TransparencyReport docroot to /srv/org/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/190036 (owner: 10Dzahn) [21:54:47] (03CR) 10Andrew Bogott: [C: 032] When backing up images, actually back up the image dir. [puppet] - 10https://gerrit.wikimedia.org/r/190097 (owner: 10Andrew Bogott) [21:54:58] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [21:58:40] (03CR) 10Dzahn: "@ori yes, i did. rsynced content to new place, applied puppet, deleted old files.. done" [puppet] - 10https://gerrit.wikimedia.org/r/190036 (owner: 10Dzahn) [21:58:52] !log twentyafterfour Started scap: testwiki to php-1.25wmf17 and rebuild l10n cache [21:58:57] Logged the message, Master [22:01:15] (03PS1) 10Ottomata: Mirror CDH trusty packages from Cloudera in apt [puppet] - 10https://gerrit.wikimedia.org/r/190103 [22:05:34] !log updated wikitech-static to wmf/1.25wmf15 [22:05:39] Logged the message, Master [22:06:06] (03CR) 10Ottomata: [C: 032 V: 032] Mirror CDH trusty packages from Cloudera in apt [puppet] - 10https://gerrit.wikimedia.org/r/190103 (owner: 10Ottomata) [22:06:18] twentyafterfour: remember https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf17 [22:06:58] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [22:08:44] Nemo_bis: how is that generated? [22:09:10] twentyafterfour: make-deploy-notes [22:09:37] part of the release tools [22:10:08] <^d> 10 No such file or directory in /srv/mediawiki/php-1.25wmf16/includes/resourceloader/ResourceLoaderImage.php on line 348 [22:10:32] (03PS1) 10Ottomata: Mirror CDH 5.3.1 trusty packages from Cloudera in apt [puppet] - 10https://gerrit.wikimedia.org/r/190105 [22:11:18] (03PS2) 10Ottomata: Mirror CDH 5.3.1 trusty packages from Cloudera in apt [puppet] - 10https://gerrit.wikimedia.org/r/190105 [22:11:44] ^d: woot [22:12:10] (03CR) 10Ottomata: [C: 032] Mirror CDH 5.3.1 trusty packages from Cloudera in apt [puppet] - 10https://gerrit.wikimedia.org/r/190105 (owner: 10Ottomata) [22:12:11] !log deactivated ocg1003 in pybal [22:12:14] Logged the message, Master [22:12:22] is that production? i'm fairly sure these lines don't execute in production [22:12:33] <^d> Yes [22:12:39] <^d> Copied from hhvm.log [22:12:53] well, why is $wgSVGConverter not rsvg, then? [22:13:03] We're seeing upload breakage in master [22:13:17] Need to revert a commit, could probably wait until after the deploy is done [22:16:06] 3operations, Analytics: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1032804 (10Ottomata) Oof, we might need a new machine for hive server. http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hive_install.html analytics1027 only has... [22:19:34] bblack, hi, how complex do you think it will be to tag everything? [22:30:00] 3operations: investigate etherpad service interrruptions / possible migrate service - https://phabricator.wikimedia.org/T89174#1032861 (10Dzahn) [22:30:23] 3operations: investigate etherpad service interrruptions / possible migrate service - https://phabricator.wikimedia.org/T89174#1032863 (10Dzahn) p:5Triage>3Low [22:30:57] yurikR: I don't know [22:31:01] it's varnish :) [22:31:30] bblack, great answer :) [22:31:34] any chance lack of tagging everything everywhere is the reason for the ticket about some log entries not having expected x-cs? [22:31:48] yurikR: there's almost no way I'm getting to it this week in any case [22:32:07] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032868 (10Dzahn) are you root on that box? can it be anywhere or should it be your home? [22:44:37] akosiaris: yt? [22:45:01] !log apt-get upgrading zirconium [22:45:04] or anybody, i suppose, i'm trying to figure out why I don't see new CDH packages even though they are mirrored in apt [22:45:10] Logged the message, Master [22:45:12] they are here: [22:45:12] http://apt.wikimedia.org/wikimedia/pool/thirdparty/h/hadoop/ [22:45:19] the 5.3.1 packages [22:45:37] but, they aren't showing up on in labs or prod trusty boxes even after an apt-get update [22:45:45] maybe the "thirdparty" part isnt in sources.list ? [22:46:02] deb http://apt.wikimedia.org/wikimedia trusty-wikimedia main universe thirdparty [22:46:04] s'ok, ja? [22:46:17] looks like it, nod [22:46:56] did they get imported with reprepro? [22:47:22] !log twentyafterfour Finished scap: testwiki to php-1.25wmf17 and rebuild l10n cache (duration: 48m 29s) [22:47:25] Logged the message, Master [22:47:35] uh, that was long [22:47:45] mutante: ja, [22:47:46] i did [22:48:00] sudo -i; cd /srv/wikimedia; reprepro update [22:48:37] oh, are the sources in order of priority? [22:48:44] because previously they were mirrored into main [22:48:48] the old versions are there [22:49:07] oh, that might be it , yea [22:49:19] (03PS1) 10Faidon Liambotis: reprepro: add cloudera source to trusty too [puppet] - 10https://gerrit.wikimedia.org/r/190109 [22:49:22] ^ [22:49:27] hmmMMM [22:49:36] ahhhh [22:49:38] hm. [22:49:48] also, you should remove them from main [22:49:48] ahhhh [22:50:08] can do. is there a way to do them all at once? [22:50:13] or do I have to figure out all the names and do it? [22:50:15] nope :) [22:50:17] haha, ok [22:50:25] reprepro remove $(fancy scripting TBD) [22:51:29] (03CR) 10Ottomata: [C: 032] reprepro: add cloudera source to trusty too [puppet] - 10https://gerrit.wikimedia.org/r/190109 (owner: 10Faidon Liambotis) [22:51:55] thanks paravoid [22:56:49] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032919 (10Dzahn) i created a new 100G logical volume, formatted it with ext4 and mounted it into /home/smalyshev/morespace. sound good? [22:57:43] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032923 (10Dzahn) a:3Dzahn [22:57:44] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.25wmf16 [22:57:50] Logged the message, Master [22:58:09] 3operations: create more disk space on einsteinium - https://phabricator.wikimedia.org/T89291#1032926 (10Smalyshev) 5Open>3Resolved Yep I think that's good, thanks! [22:59:59] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf17 [23:00:03] Logged the message, Master [23:00:18] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:01:28] RECOVERY - Disk space on einsteinium is OK: DISK OK [23:02:19] !log twentyafterfour Purged l10n cache for 1.25wmf15 [23:02:24] Logged the message, Master [23:05:39] ^demon|lunch: twentyafterfour when running make-wmf-branch, the arguments are "newBranch" (e.g. 1.25wmf17) and master? [23:05:45] per https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment_v2 [23:06:10] aude: yes [23:06:17] ok [23:06:25] at least that's how I did it [23:06:26] workign on a patch for the script [23:06:27] ok [23:14:19] 3Release-Engineering, Wikimedia-General-or-Unknown, operations, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1032950 (10Jaredzimmerman-WMF) @technical13 your point is taken, as we refine the wording we can think about the relationship between the donation action and the page. I'm... [23:16:13] (03PS2) 10Dzahn: phab: direct_comments_allowed for Domains tickets [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) [23:22:25] 3operations, Analytics-Kanban, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1032967 (10Ottomata) Today I practiced this in Vagrant and in Labs. I'd like to do it one more time in labs. My preliminary procedure will be this: http://www.cloudera.c... [23:22:33] <^demon|lunch> jouncebot: next [23:22:33] In 0 hour(s) and 37 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150212T0000) [23:24:16] (03CR) 10Dzahn: "yea, so, the project in phabricator is "Domains", capitalized, but operations-puppet-pplint-HEAD hates it if i do that, see PS above. Now " [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) (owner: 10Dzahn) [23:25:32] (03PS3) 10Dzahn: fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 [23:26:12] (03CR) 10Dzahn: fix all 'variable not enclosed by {}' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [23:28:04] (03Abandoned) 10Dzahn: Revert "map ipv6 on dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/183061 (owner: 10Cmjohnson)