[00:22:20] (03PS3) 10Rush: shell for proposed admin module [operations/puppet] - 10https://gerrit.wikimedia.org/r/120724 [00:22:22] (03PS1) 10Rush: ops under new admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/120972 [00:29:26] (03CR) 10Rush: "in regards to my plan to roll this out I'll reference mr tyson" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120724 (owner: 10Rush) [00:30:53] @info db1047 [00:30:53] Krinkle: [db1047: ?] 10.64.16.36 [00:31:32] Ah, right, it's not interpreting db-secondary ( dbbot-wm ) [00:31:35] http://noc.wikimedia.org/conf/highlight.php?file=db-secondary.php [00:57:24] (03PS1) 10Rush: puppet repo local linter [operations/puppet] - 10https://gerrit.wikimedia.org/r/120976 [01:01:24] (03CR) 10Rush: "seems to be pretty half one way, half the other for newline on keyword for role declaration. either way would be great to standardize." [operations/puppet] - 10https://gerrit.wikimedia.org/r/120956 (owner: 10Dzahn) [01:51:32] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [02:06:17] !log LocalisationUpdate failed (1.23wmf18) at 2014-03-26 02:06:17+00:00 [02:06:31] Logged the message, Master [02:06:59] !log LocalisationUpdate failed (1.23wmf19) at 2014-03-26 02:06:59+00:00 [02:07:05] Logged the message, Master [02:23:29] RIP. [02:32:36] (03CR) 10Springle: Add a MariaDB module. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119930 (owner: 10Springle) [02:39:34] paravoid, ok, so your argument is not is not with the usage rate, but with desire to get simplification RFC in? Please write your response on the talk page and outline your view of the RFC [03:00:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 26 03:00:41 UTC 2014 (duration 0m 40s) [03:00:50] Logged the message, Master [03:27:08] (03PS11) 10Springle: Add a MariaDB module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/119930 [03:37:13] (03PS12) 10Springle: Add a MariaDB module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/119930 [03:55:58] (03CR) 10Springle: [C: 032] "Made most of the recommended changes. Trialing it on db1044 now but still open to ideas. Please shout." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119930 (owner: 10Springle) [04:14:09] (03PS1) 10Springle: need standard for salt::minion grain-ensure [operations/puppet] - 10https://gerrit.wikimedia.org/r/120988 [04:16:21] (03CR) 10Springle: [C: 032] need standard for salt::minion grain-ensure [operations/puppet] - 10https://gerrit.wikimedia.org/r/120988 (owner: 10Springle) [04:26:13] (03PS1) 10ArielGlenn: system roles for snapshot xml dumps (they have it for everything else) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120989 [04:28:15] (03CR) 10ArielGlenn: [C: 032] system roles for snapshot xml dumps (they have it for everything else) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120989 (owner: 10ArielGlenn) [04:34:56] (03PS1) 10Springle: Puppetize dbstore100[12] [operations/puppet] - 10https://gerrit.wikimedia.org/r/120990 [04:36:44] (03CR) 10Springle: [C: 032] Puppetize dbstore100[12] [operations/puppet] - 10https://gerrit.wikimedia.org/r/120990 (owner: 10Springle) [04:39:38] (03PS1) 10ArielGlenn: add system role for elastic search boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/120991 [04:42:12] (03CR) 10ArielGlenn: [C: 032] add system role for elastic search boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/120991 (owner: 10ArielGlenn) [04:51:45] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [04:55:08] (03PS1) 10Springle: Include group/user blocks for testing tarballs in place of debs [operations/puppet] - 10https://gerrit.wikimedia.org/r/120992 [04:55:57] (03PS1) 10ArielGlenn: system role for the osm dbs [operations/puppet] - 10https://gerrit.wikimedia.org/r/120993 [04:56:40] (03CR) 10Springle: [C: 032] Include group/user blocks for testing tarballs in place of debs [operations/puppet] - 10https://gerrit.wikimedia.org/r/120992 (owner: 10Springle) [04:56:47] bah humbug [04:56:54] (03PS2) 10ArielGlenn: system role for the osm dbs [operations/puppet] - 10https://gerrit.wikimedia.org/r/120993 [04:58:28] (03CR) 10ArielGlenn: [C: 032] system role for the osm dbs [operations/puppet] - 10https://gerrit.wikimedia.org/r/120993 (owner: 10ArielGlenn) [05:22:39] (03PS1) 10ArielGlenn: system role for nova compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/120994 [05:26:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [05:26:42] (03CR) 10ArielGlenn: [C: 032] system role for nova compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/120994 (owner: 10ArielGlenn) [06:07:18] (03CR) 10ArielGlenn: "I added snapshot roles in the module, so that entry should be removed from here." [operations/puppet] - 10https://gerrit.wikimedia.org/r/120956 (owner: 10Dzahn) [06:23:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [07:52:45] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [08:09:12] apergos: i got confirmation from asaf bartov he doesn't need an account on stat1, so i only remove his acount there? or any other step needed? [08:10:50] (03PS1) 10ArielGlenn: download.wikimedia.org to misc web cluster with dataset1001 backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/120998 [08:10:59] disable it there, yes [08:11:43] (03CR) 10ArielGlenn: [C: 04-1] "do not merge til thur mar 27 10 am utc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120998 (owner: 10ArielGlenn) [08:12:08] maybe ottomata has a different idea of how he wants to manage the move [08:12:32] I was expecting that we just disable the accounts that are no longer needed and that at the end of the day the enabled ones will get moved [08:15:01] (03PS1) 10ArielGlenn: download.wikimedia.org moved to misc-web [operations/dns] - 10https://gerrit.wikimedia.org/r/120999 [08:15:31] (03CR) 10ArielGlenn: [C: 04-1] "do not merge til thur mar 27 10am utc" [operations/dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [08:16:30] apergos: yes, but he still needs his shell account (i.e admins.pp) but not access to that machine [08:17:00] so just removing the include accounts::abartov will do. [08:18:13] (03PS1) 10ArielGlenn: dumps.wikimedia.org moved to dataset1001 [operations/dns] - 10https://gerrit.wikimedia.org/r/121000 [08:18:54] heh, actually that won't do anything by itself [08:18:59] but I see what you mean [08:19:06] I meant diable on the particular host [08:19:15] rather than remove on the hot [08:19:16] host [08:19:58] yes, that is my concern [08:20:03] not very doable in puppet though, needs someone to run commands on the host itself [08:20:20] removing on puppet, will just stop tracking it, not remove it [08:20:24] yep [08:20:56] so you can do by hand, or we should write logic to purge per machine <-- pain [08:20:59] on the host itself we should go through and remove the auth keys file for everyone once their class is no longer on the host, I was thinking [08:21:19] yeah don't bother triyng to add that [08:21:30] i'll push, so you can see what i mean [08:22:12] oh I know what it does (or doesn't) [08:22:36] i rewrote the admin.pp module at one point out of sheer frustration but it got nixed in favor of an ldap based approach to be done later [08:23:05] (03PS1) 10Matanya: access: remove abartov from stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/121002 [08:23:10] apergos: ^ [08:23:32] (03CR) 10ArielGlenn: [C: 04-1] "do not merge til thur mar 27 10 am utc" [operations/dns] - 10https://gerrit.wikimedia.org/r/121000 (owner: 10ArielGlenn) [08:24:01] this indicates the removal, but doesn't do it until you remove it manually [08:24:06] yeah that's about as good as you can do, that way the commeent is in there [08:24:27] so we can cull for lal the revokd ones o that ticket at the end and toss their auth key files all at once [08:24:34] *for all the revoked [08:24:54] that is what i thought, and you only merge once you actully have removed the key [08:24:56] if ottomata wants the user actually removed we can do that too at that pint [08:25:24] or even he can do it, since he's managing the move, cut out the middle man [08:31:57] (03CR) 10Nikerabbit: "I will no longer point out typos in commit messages of merged commits, as it hasn't produced the wanted effect: fixing typos in future com" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119930 (owner: 10Springle) [10:10:04] !log restarted gmetad on nickel [10:10:11] Logged the message, Master [10:38:55] (03PS1) 10ArielGlenn: add cert to the downloads wikimedia conf file [operations/puppet] - 10https://gerrit.wikimedia.org/r/121024 [10:45:07] (03CR) 10ArielGlenn: [C: 032] add cert to the downloads wikimedia conf file [operations/puppet] - 10https://gerrit.wikimedia.org/r/121024 (owner: 10ArielGlenn) [10:47:01] anyone happen to know what is entailed in backporting an extension for deploy? [10:53:45] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [10:55:34] hashar, do you happen to know what is entailed in backporting an extension for deploy? [11:40:59] dan-nl: I don't think we have any doc [11:41:12] but basically fetch the wmf branch deployed on the cluster [11:41:28] the extension is a submodule there, you have to update it to the desired commit and submit for review [11:41:52] off for lunch [11:41:55] thanks [11:46:39] dan-nl: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_submodule [11:46:50] most of that applies [11:47:07] thanks [12:18:00] !log gerrit.wikimedia.org interface throws 503 [12:18:04] Does someone know why after merging https://gerrit.wikimedia.org/r/109237 http://www.pywikipedia.org/ is redirected correctly, but http://pywikipedia.org/ not? (https://bugzilla.wikimedia.org/58803) [12:18:06] Logged the message, Master [12:21:09] Nemo_bis: gerrit.wikimedia.org works for me. [12:22:48] I restarted gerrit just before nemo logged that [12:22:53] it seems to be responsive now [12:23:12] !log restarted gerrit (slow or unresponsive, nothing obvious wrong) [12:23:17] Logged the message, Master [12:23:51] scfc_de: it was a lottery [12:23:54] thanks apergos [12:24:13] sure [12:28:58] I think Gloria filed a bug some time ago to track Gerrit unsteadiness. [12:32:03] !log rebuilding cirrus search indexes for all wikipedias with cirrus in preparation a change requiring it in the release on Thursday [12:32:09] Logged the message, Master [12:37:54] (03PS2) 10coren: Tools: Rename references to local-admin to tools.admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/120241 (owner: 10Tim Landscheidt) [12:42:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:47:23] (03CR) 10coren: [C: 032] Tools: Rename references to local-admin to tools.admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/120241 (owner: 10Tim Landscheidt) [12:50:30] hey...i have a problem resetting my password and i need some help from a sysadmin (sending the resetting email manually would be really helpful) [12:51:46] firstroad: i fear no one can help you here [12:52:18] and who can help me... i was told that i may find a sysadmin in this room [12:52:41] yes, but sysadmins don't help random users requesting reeset [12:53:08] so what do you suggest? [12:53:12] what if you are impersonating? they can't really know if you are the account owner [12:53:31] matanya: mhm, I guess it would be sufficient if a sysadmin would just tell firstroad which email address is linked to his account [12:53:35] what is your problem in detail ? [12:54:02] Vogone: that is a privacy breach [12:54:33] after requesting to reset my password no email arrives in my email address [12:54:49] did you check spam firstroad ? [12:54:52] i checked spam and trash folders, i checked all my others email addresses but still nothing [12:55:33] are you sure the mail you provided is valid? [12:55:48] i just want somebody to check the account with username "firstroad" to see if something is wrong and send a password resetting email to the linked email address [12:55:49] matanya: really? I mean, it's just an email address which is getting revealed anyway if you use wikimail [12:56:17] no Vogone. email isn't shared [12:56:33] yes i have checked like thousands of times [12:56:42] if you send a mail through wikimail the one who recieves it can see it [12:57:03] right so? it isn't public [12:57:47] I didn't say it was but revealing it to a single person who probably owns this account isn't "public" either :) [12:58:27] and it would be pointless to ask for it unless you own the account [12:58:53] firstroad: the best i can offer is mail ops at ops-requests@rt.wikimedia.org with your request and hope for the best [12:59:10] yup, I'd say so as well [12:59:22] thank you all for your help [13:01:20] (03CR) 10Ottomata: "This won't really do much, and we don't need to actively 'remove' accounts from stat1. We are just finding accounts that we don't plan to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121002 (owner: 10Matanya) [13:03:15] (03CR) 10Matanya: "The point here is to have a valid puppet change to show the account changed status. I have done this after speaking to Ariel. Once we have" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121002 (owner: 10Matanya) [13:05:06] apergos ^ [13:05:18] why do we need to 'remove' people from stat1 now? [13:05:36] this effort was mainly just to get a list of people we aren't going to include on stat1003 [13:05:54] the WIP list is being kept here: http://etherpad.wikimedia.org/p/stat1_accounts [13:05:58] ottomata: it's up to you what to do; when I initially compiled the list it was 'who doesn't need to be here' [13:06:33] manybubbles: why not throttle the shwiki bot causing the job queue to over load? there is a policy, and it seems like it is violating it [13:06:49] remving cruft makes it easier to maintain the current host too [13:07:01] naww, the current host is gonna be scrapped [13:07:01] +1 on this apergos [13:07:04] its tampa [13:07:18] it is going to be shipped, isn't it ? [13:07:29] not that I know of, if it is it will be wiped [13:07:43] yep, we wipe before shipping [13:07:53] stat1003 will be a new node in eqiad that is a replacement for stat1 [13:08:12] I will migrate over data and selected user accounts from stat1 [13:08:31] ok, so you perfer no puppet changes ? [13:09:00] kinda, i don't see a need to try to clean up stat1, that was one of the reasons we waited for stat1003 to start doing this audit [13:09:15] i mean, they don't hurt…its just more to review and merge :p [13:09:20] sure, i'll drop this change [13:09:27] ok, danke! [13:09:40] (03Abandoned) 10Matanya: access: remove abartov from stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/121002 (owner: 10Matanya) [13:10:00] actually I started this audit, a month and a half ago. however I did it to hlep you out; so however you want to handle the results, it's up to you [13:10:02] enjoy [13:10:14] that's true apergos, sorry I didn't mean to step on toes [13:10:34] you're not (and I don't plan to either) [13:10:45] i guess i should rephrase that as 'push' this audit forward [13:10:50] sure [13:11:02] thanks much to the both of you, btw [13:11:30] I would suggest though that if there are any folks on the list who no longer work with/for us that their accounts actually become goe [13:11:32] gone [13:11:45] oh yes yes [13:11:49] k [13:12:02] if we discover that someone's full shell access should be removed (erosen, giovanni, frank s) [13:12:08] then we should set their keys to absent [13:12:11] yep [13:12:15] that's good, we shoudl def do those changes [13:12:35] i just don't think we should bother actively doing anything directly to stat1 [13:12:47] ok, good luck with the move [13:12:50] ok danke [13:13:05] (03PS1) 10Hashar: beta: drop en/de wikivoyage databases [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121047 [13:14:03] (03PS2) 10Ottomata: Adding CNAME archiva.wikimedia.org -> titanium.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/120843 [13:14:15] (03CR) 10Ottomata: [C: 032 V: 032] Adding CNAME archiva.wikimedia.org -> titanium.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/120843 (owner: 10Ottomata) [13:14:26] (03CR) 10Hashar: [C: 032] "beta only" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121047 (owner: 10Hashar) [13:14:41] (03Merged) 10jenkins-bot: beta: drop en/de wikivoyage databases [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121047 (owner: 10Hashar) [13:27:22] (03PS1) 10ArielGlenn: update conf file for download.wm.o to support https [operations/puppet] - 10https://gerrit.wikimedia.org/r/121050 [13:29:41] (03CR) 10ArielGlenn: [C: 032] update conf file for download.wm.o to support https [operations/puppet] - 10https://gerrit.wikimedia.org/r/121050 (owner: 10ArielGlenn) [13:30:56] (03PS1) 10Hashar: beta: configure db1 on eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121051 [13:31:21] (03CR) 10Hashar: [C: 032] beta: configure db1 on eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121051 (owner: 10Hashar) [13:31:28] (03Merged) 10jenkins-bot: beta: configure db1 on eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121051 (owner: 10Hashar) [13:54:45] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [13:59:45] (03PS1) 10Hashar: beta: point udp2log in eqiad to its bastion [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121056 [14:02:48] (03CR) 10Manybubbles: [C: 031] "Looks good to me. Would +2 if this didn't require kid gloves for deployment. Feel free to +2 and deploy when you can deploy it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121056 (owner: 10Hashar) [14:03:13] (03CR) 10Hashar: [C: 032] "deploying. thank you!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121056 (owner: 10Hashar) [14:03:20] (03Merged) 10jenkins-bot: beta: point udp2log in eqiad to its bastion [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121056 (owner: 10Hashar) [14:08:53] !log hashar synchronized wmf-config/CommonSettings.php 'beta: vary udp2log destination by $wmfDatacenter {{gerrit|121056}}' [14:08:59] Logged the message, Master [14:14:41] (03PS1) 10Hashar: beta: complete redis config for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121060 [14:14:50] (03CR) 10Hashar: [C: 032] beta: complete redis config for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121060 (owner: 10Hashar) [14:14:58] (03Merged) 10jenkins-bot: beta: complete redis config for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121060 (owner: 10Hashar) [14:18:01] ottomata: whenever you are ready to move the last one [14:18:21] oooo, yeah, ok, lemme see [14:22:03] (03PS1) 10Hashar: beta: CirrusSearch servers in eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121063 [14:22:25] ok, i'm going to shutdown an25, [14:22:34] i'm a little concerned as to what is going to happen when the IP changes [14:22:58] (03CR) 10Manybubbles: [C: 031] beta: CirrusSearch servers in eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121063 (owner: 10Hashar) [14:22:59] i think it will be ok, as long as we are sure that the dns has changed and is not cached everywhere else before we turn zookeeper back on with the new ip [14:23:35] !log stopping zookeeper and analytics1025 and shutting down, preparing to move it to Row D [14:23:40] Logged the message, Master [14:25:32] hmm, whoa that was weirder than I expected, cmjohnson1, hold on, I have to look at some things [14:26:04] ok..let me know [14:27:35] (03PS1) 10Hashar: beta: fill in parsoid cache IP for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121067 [14:28:20] (03CR) 10Hashar: [C: 032] beta: fill in parsoid cache IP for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121067 (owner: 10Hashar) [14:28:28] (03Merged) 10jenkins-bot: beta: fill in parsoid cache IP for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121067 (owner: 10Hashar) [14:33:22] (03CR) 10Hashar: [C: 032] beta: CirrusSearch servers in eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121063 (owner: 10Hashar) [14:34:22] (03Merged) 10jenkins-bot: beta: CirrusSearch servers in eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121063 (owner: 10Hashar) [14:38:31] (03PS4) 10Hashar: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 [14:41:39] (03PS5) 10Hashar: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 [14:42:56] ok great, cmjohnson1, got it [14:43:03] analytics1025 shoudl be shutting down [14:43:06] i just learned something! [14:43:24] zookeepers cache IP addresses of peers when they start up! [14:43:45] PROBLEM - Host analytics1025 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:46] when we previously moved analytics1023 into its new row as part of the previous reshuffle, I didn't restart all of the other nodes [14:43:56] i think the whole zk cluster has been in a weird state since then [14:44:06] still operational, but confused about where anatlyics1023 was [14:44:37] when I turned of an25 a few minutes ago, it started to get really upset, as now there were even fewer nodes hanging out [14:44:45] anyway, analytics1025 can be moved now [14:47:27] cool [14:52:03] ottomata: are you all done migrating the Analytics project? Or at least clear of pmtpa? [14:52:59] andrewbogott: yes! [14:53:04] link me page one more time and I will update it [14:53:15] https://wikitech.wikimedia.org/wiki/Labs_Eqiad_Migration_Progress [14:53:16] thanks! [14:53:20] we have deleted all of our pmtpa instances! (as of right now :), there were 2 left) [14:58:15] RECOVERY - Puppet freshness on tantalum is OK: puppet ran at Wed Mar 26 14:58:12 UTC 2014 [15:02:03] (03PS2) 10Rush: ops under new admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/120972 [15:03:05] PROBLEM - DPKG on tantalum is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:03:58] ottomata: two more cleared status for stat1 [15:04:20] (03PS1) 10Cmjohnson: updating dns for analytics1025 [operations/dns] - 10https://gerrit.wikimedia.org/r/121079 [15:05:46] (03CR) 10Cmjohnson: [C: 032] updating dns for analytics1025 [operations/dns] - 10https://gerrit.wikimedia.org/r/121079 (owner: 10Cmjohnson) [15:06:05] RECOVERY - DPKG on tantalum is OK: All packages OK [15:06:13] awesome, thanks matanya [15:06:28] !log dist-upgrade and reboot tantalum [15:06:33] i've had some people respond to me in that thread that i've sent out that I've moved around in the etherpad as well [15:06:34] Logged the message, Master [15:07:12] ottomata: booting now [15:07:17] woot [15:08:01] ottomata: once you have everything stable I would like to power an1022 and 1024 down to move them within the rack [15:08:05] PROBLEM - Host tantalum is DOWN: PING CRITICAL - Packet loss = 100% [15:08:14] good, 16 left [15:08:54] cmjohnson1: ? [15:09:09] (03PS6) 10Hashar: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 [15:09:15] RECOVERY - Host tantalum is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:10:20] it's purely cosmetic for me...i have a bunch of holes in the rack now from where I removed these 4. If I can move these 4 than I can keep the an boxes together and have room to add more servers. [15:10:39] (03CR) 10Hashar: [C: 032] beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 (owner: 10Hashar) [15:10:40] ^ a little of that robh ocd kicks in here [15:10:41] oh ok [15:10:46] we can do those one at a time, sure [15:10:47] (03Merged) 10jenkins-bot: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 (owner: 10Hashar) [15:10:53] will let you know [15:10:54] gimme a few [15:11:03] ok..cool. thx [15:11:03] its not cosmetic [15:11:09] filling rack capacity is totally legit [15:11:20] it just also happens to be cosmetic in ADDITION to that ;] [15:11:23] we can fill it with not moving any of it but then it's messy [15:11:48] The OCD make the best DC techs [15:12:21] robh: are you getting pingged by ocd ? :P [15:12:22] though getting in and out of the cage turning everything three times is time consuming [15:12:33] no, he pinged me specifically [15:12:51] you might might want to look at ticket 7132 ? [15:12:55] (plus im always watching) [15:13:08] big brother rob [15:13:09] matanya: I just forwarded it to info and closed it [15:13:21] we dont handle that stuff usually, info helps users with passowrds and the like [15:13:40] basiclly he can't do anything then [15:14:04] too bad. the weird issue is i good several of those myself the last few days [15:14:10] *got [15:14:21] you were unable to login and didnt get your password reset email? [15:14:36] not me, users [15:14:52] one on my tlak page [15:14:57] two in mt mail [15:15:00] if there is a rash of them thats diffferent, and not indicated on said ticket ;] [15:15:08] cmjohnson1: how long should it take for all ns servers to change dns? [15:15:19] i'm getting conflicting resolution on successive digs [15:15:27] yes, agreed, but i didn't want to raise an issue i'm not sure exists [15:15:47] i guess we can see if any of those email addresses are assigned to the account [15:16:02] if they arent, then there is nothign i can do, but if they are then maybe we can do something. [15:16:17] urgh, i have to recall how to query that.... have not done it in years. [15:16:56] robh: https://wikitech.wikimedia.org/wiki/Reset_password [15:17:29] im inclined to go with the first part, we dont help them ;p [15:17:53] hrmm, home wiki i guess doesnt matter if its unified [15:17:57] as long as its sul wiki [15:18:14] matanya: oh, this is to set it [15:18:22] im not going to set emails [15:18:25] ottomata: i am as well [15:18:28] (03CR) 10Hashar: Make syslog-ng basepath a parameter (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119256 (owner: 10Hashar) [15:18:52] robh: i wouldn't ask if i wasn't suspecting there is an issue [15:19:08] though i might be imagening [15:19:11] yea, im just saying i want to query the email not set a new one [15:19:28] i dont know this dude, so im no comfortable setting an email account, as it can be used to steal the account [15:19:32] or it is just a coincedence, who knows [15:19:39] yes [15:19:41] heh [15:20:05] Reedy might remember [15:20:12] i was about to ping him, hahaha [15:20:20] or roan who is away ;] [15:20:29] gmta [15:20:51] If it turns out one of those email addresses is indeed on the account [15:21:03] then it would be nice to be able to follow a password reset through our system and ensure it goes out [15:21:28] if one of those isnt, then there is nothign we can do, but i wouldnt mind confirming. [15:21:36] (03PS5) 10Hashar: Make syslog-ng basepath a parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/119256 [15:21:39] makes sense [15:22:09] all of my notes on this are for setting the stuff not querying [15:22:23] similar (but outdated comapred to) that wikitech page [15:22:33] my notes are pre-SUL =] [15:23:08] with big notes of 'Rob, you aren't supposed to go into the db's directly, use the php wrapper' [15:23:46] rephrase it to :never ever touch the DB, by all means!!! [15:25:23] jusst in time RoanKattouw ^ [15:26:13] matanya: What's up? [15:26:16] (03PS1) 10Aude: Update Wikidata OAuth grants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121086 [15:26:33] anomie: manybubbles can we deploy that ^ [15:26:44] hi there, robh is trying to remeber how to find out what is a user email attached to an account [15:27:01] aude: for the swat deploy? [15:27:02] We have someone complaining they arent getting password resets they put in [15:27:07] yeah [15:27:09] OK [15:27:11] i can do or wahtever [15:27:12] what* [15:27:13] so i want to ensure they have the email listed to the account [15:27:17] OK [15:27:33] https://rt.wikimedia.org/Ticket/Display.html?id=7128 [15:27:44] sure. let me put on my SWAT geat [15:28:00] (03PS2) 10Aude: Update Wikidata OAuth grants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121086 [15:28:03] ok [15:28:16] * aude thought we already made that change, but apparently not [15:28:17] robh: Hmm is this about a mailman passwd as opposed to an MW passwd? [15:28:28] oh, wrong link [15:28:29] sorry [15:28:34] No worries [15:28:34] https://rt.wikimedia.org/Ticket/Display.html?id=7132 [15:28:39] too many RT tickets open! [15:28:41] You're on RT duty, I know how it is :) [15:28:48] greg-g: going to do a SWAT deploy then [15:28:54] firefox is slowing down my entire laptop its so overloaded [15:28:55] heh [15:29:14] someday i'll just move to chrome, but it bothers me on fundamental level to abandon firefox. [15:29:17] oh, that is right, it is early for him [15:29:23] anyone doing a deploy now? [15:29:36] Psh, I have 596 tabs open in Firefox and I'm running this chat client in Firefox, and my laptop is fine :P [15:29:38] aude: can you stick the patch set on the deployments page under the right SWAT deploy day? [15:29:45] Also, I should do some spring cleaning re those 596 tabs [15:29:46] manybubbles: ok [15:29:50] * manybubbles has the conch [15:29:57] (03CR) 10Manybubbles: [C: 032] Update Wikidata OAuth grants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121086 (owner: 10Aude) [15:30:10] (03Merged) 10jenkins-bot: Update Wikidata OAuth grants [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121086 (owner: 10Aude) [15:30:16] So usually users being unable to login is a user issue and not ops domain. However matanya points out he has had a few of these recently. Ideally I'd like to confirm its set to his email address. If it is not, there is nothing I am willing to do for him. If it is, I'd like to see if we are sending out the password resets. [15:30:32] robh: So, this user didn't state which wiki. Did it originally come into info-en@wm.o ? I don't know a lot about RT so I don't understand the output here [15:30:47] i dunno anythign not on ticket, i was avoiding trying to talk to him [15:31:04] cuz chances are we wont help him [15:31:08] Right [15:31:17] does it matter with SUL? [15:31:17] ottomata: you should be good to go now [15:31:23] I'll check enwiki and elwiki [15:31:27] dns is resolving correctly [15:32:04] RoanKattouw: yep, was about to suggest elwiki [15:32:05] !log manybubbles synchronized wmf-config/CommonSettings.php 'SWAT deploy for Wikidata' [15:32:11] Logged the message, Master [15:32:20] you doing in php console or what? (just curious on how its done so i can do in future) [15:32:23] aude: synced [15:32:24] thanks [15:32:29] please verify it looks lok [15:32:31] ok [15:32:35] manybubbles: Ha, you beat me to responding [15:32:37] i've asked magnus [15:32:41] can check [15:32:43] hm, cmjohnson1, i still get occasional .36.125s [15:32:48] anomie: I gets pings I do things [15:33:26] * anomie was in the middle of replying to something else when pinged [15:33:50] totally my fault that i forgot that config change [15:34:06] ottomata: odd [15:34:11] was sure we already did it [15:34:14] aude: eh? if everything is working its all good [15:34:26] that is what that time is for [15:34:34] shall find out from magnus, but think it's good [15:34:52] cool [15:34:59] * manybubbles puts down the conch [15:35:03] :) [15:35:20] robh: I have results, will contact in private [15:38:22] * yurik is looking for Reedy ... [15:43:22] (03CR) 10Hashar: "> My point is that you should not run jshint without configuration" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119750 (owner: 10Hashar) [15:43:53] outcome robh ? [15:44:02] ottomata: pretty consistent now [15:45:11] ja looking better [15:45:22] (03PS1) 10John F. Lewis: Add importsources to enwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121090 [15:46:05] (03PS1) 10Rush: jenkins admin user breakout [operations/puppet] - 10https://gerrit.wikimedia.org/r/121091 [15:46:29] ok cmjohnson1, that is looking good, zookeeper is back up on an25 [15:46:40] can you give me 20 minutes before we move the other ones? [15:46:49] yep [15:50:10] (03PS1) 10coren: Tool Labs: add user-requested packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/121093 [15:55:00] !log Jenkins: gallium swapping :/ [15:55:06] Logged the message, Master [16:00:23] (03CR) 10coren: [C: 032] "Package add." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121093 (owner: 10coren) [16:06:17] oh wait, cmjohnson1, which two do you want to do? [16:06:33] ah 22 and 24 [16:06:33] hm [16:06:36] ok [16:06:38] yes [16:06:41] 22 will be interesting [16:06:44] that is a kafka broker [16:06:48] let's do 24 first [16:06:50] ready? [16:06:55] yes [16:07:22] should only take me 1 minute to move and the rest is boot time [16:09:32] ok, cmjohnson1, an24 shutting down now [16:09:38] see it [16:10:08] anybody wants to update apache conf? (patch by Reedy) https://gerrit.wikimedia.org/r/#/c/119985/ [16:10:42] !log Jenkins got slightly overloaded for unknown reason. Will restart Zuul to clean some leaked file descriptors. [16:10:47] Logged the message, Master [16:11:35] PROBLEM - Host analytics1024 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:24] !log taking analytics1024 offline so it can be re-racked [16:12:31] Logged the message, Master [16:14:05] RECOVERY - Host analytics1024 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [16:15:53] !log restarting Zuul [16:15:59] Logged the message, Master [16:17:59] !log Jenkins: restarted gallium jenkins slave [16:18:04] Logged the message, Master [16:18:06] cmjohnson1: is it back up and racked? [16:18:54] !log Jenkins clearing /tmp on integration-slave1001 and 1002 [16:19:00] Logged the message, Master [16:19:02] ottomata yep [16:20:11] great, looking fine [16:20:17] ok oh boy, kafka! [16:21:13] you ready for an22 cmjohnson1? [16:21:18] if you are [16:21:20] k [16:21:21] yeah [16:21:35] !log initiating controlled shutdown of kafka broker analytics1022 for reracking [16:21:40] Logged the message, Master [16:22:22] ok cmjohnson1, it is shutting down now [16:23:05] * cmjohnson1 sees it off [16:23:35] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:55] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 25.0 [16:25:00] shhhhh its ok [16:25:15] * robh visulizes ottomata smothering icinga with a pillow [16:25:21] shhh, go to sleeeeeep [16:25:24] just let it happen [16:25:30] heh [16:26:34] ottomata: booting [16:27:32] (03PS1) 10GWicke: Increase the number of Parsoid job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/121100 [16:29:15] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:30:39] ottomata: ^ [16:30:48] yup, watching it [16:30:50] looking fine [16:30:54] the replicas are catching back up [16:30:58] that'll take a couple of minutes [16:30:59] manybubbles: ping [16:31:05] gwicke: here [16:31:43] https://gerrit.wikimedia.org/r/121100 ups the number of parsoid job runners slightly [16:32:22] I'm not sure if you have +2 on puppet [16:32:28] and can deploy this [16:32:42] (03CR) 10Manybubbles: [C: 031] "So long as it is safe for the parsoid cluster I think it should be safe for the job runners." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121100 (owner: 10GWicke) [16:32:59] nop! [16:33:02] nope! [16:33:06] RECOVERY - Host analytics1025 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [16:33:06] but I'll certainly +1 it [16:33:14] k, thx [16:33:22] I don't know if anyone on ops is really into the job queue [16:33:26] we'll need to find a root to deploy this then [16:33:27] and can deploy it [16:33:37] * gwicke looks around [16:34:09] can try to catch RobH later as he's on RT duty [16:34:13] ? [16:34:22] or now [16:34:25] whats up? [16:34:26] oh, lowercase [16:34:36] heh [16:34:49] depending on which client sets my nick it confuses folks [16:34:51] https://gerrit.wikimedia.org/r/121100 is a small change to process parsoid jobs more quickly [16:34:59] znc is set to robh, but limechat to RobH [16:35:16] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [16:35:35] thos job runners are just doing HTTP requests to the parsoid service, so no load issue on the job runner host [16:35:37] gwicke: when i merge this do i need to force update on parsoid hosts or let it just auto propogate? [16:35:49] the job queue needs to be restarted afaik [16:35:53] it's a shell loop [16:36:12] i've not done that, lemme take a look [16:36:43] oh, just the parsoid restart? [16:36:47] ive done that a billion times =] [16:36:51] gwicke: ^? [16:36:57] no, this is the job queue [16:37:04] ahh, ok, then i have no idea how to restart that [16:37:14] where does that script run from these days? [16:37:17] it's just something making HTTP requests to the Parsoid cluster to update / prime the caches [16:37:30] rdb1001-1004? [16:37:40] https://wikitech.wikimedia.org/wiki/Job_queue ? [16:37:41] I guess puppet should know [16:38:14] node /^mw10(0[1-9]|1[0-6])\.eqiad\.wmnet$/ { [16:38:39] mw1001-1016 [16:38:56] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [16:39:50] hrmm, i'm going to merge and puppet run on one and see if it rekicks them automatically [16:40:29] (03CR) 10RobH: [C: 032] "Seems a reasonable enough incremental change, and as I am on RT duty and Gabriel asked about this, I'm merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121100 (owner: 10GWicke) [16:40:30] there is a pid file in the job runner manifest, so maybe it's restarted automatically [16:40:40] gwicke: this is result of that email thread on jobs i assume? [16:40:52] the job queue backing up that is [16:40:58] yes, the parsoid part of it [16:41:09] it does nothing to other job types of course [16:41:36] IIRC LinksUpdate is backed up on shwiki, this change won't do anything for that [16:41:50] merged, running puppet on mw1001 to confirm it rehups the job script [16:42:23] also checking to see if the role jobrunning thing now works for this per daniels role salt change [16:42:31] cuz that would make it nicer than salting a regex [16:42:50] gwicke: it refreshes and restarts it [16:42:54] so yay =] [16:42:54] cool [16:45:39] I'm keeping an eye on the PHP API load [16:45:56] that has been the limiting factor so far [16:48:10] thanks RobH [16:48:31] (03PS1) 10Jgreen: create /srv/deployment on ocg server [operations/puppet] - 10https://gerrit.wikimedia.org/r/121104 [16:48:49] RobH, are all job runners restarted already? [16:49:04] nope, my salt grain command is just now running [16:49:14] for my part I've asked the bot making tons of changes to slow down [16:49:16] k [16:50:23] I think in the longer term we'll need to either make the API more efficient or get more hardware for it [16:50:34] (the PHP API) [16:51:05] gwicke: ok, they all ran and finished [16:51:22] RobH: awesome, thanks! [16:51:28] quite welcome [16:51:46] used daniels role change to salt [16:51:49] was much better [16:52:52] there are nice features in salt; I'm looking forward to it working for non-roots too ;) [16:54:21] is gerrit/jenkins dead? [16:54:46] oh nm. [16:54:56] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [16:55:05] (03CR) 10Jgreen: [C: 032 V: 031] create /srv/deployment on ocg server [operations/puppet] - 10https://gerrit.wikimedia.org/r/121104 (owner: 10Jgreen) [16:58:06] RECOVERY - Redis on tantalum is OK: TCP OK - 0.001 second response time on port 6379 [17:00:50] PHP API load increased a bit from around 55 to 65%, but seems to be stable there [17:05:16] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [17:05:46] not a bad bump up in load [17:07:14] though mw1001 took the spike first from my test [17:07:16] and its up there. [17:08:11] you can see where its load went down and up again about 8 minutes before the rest (and it did it again when i ran puppet on the rest and it via role group) [17:08:16] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1550.97042408 [17:08:33] the load on the job runners should not really change in the longer term [17:08:58] the parsoid jobs are just doing a handful of HTTP requests per second to the Parsoid cluster [17:09:40] the main change in load is on the PHP API cluster which is in turn used by Parsoid; so far we have been careful to keep the number of requests to the Parsoid cluster low enough to avoid overloading the PHP API cluster [17:11:54] load there is back to below 60%, so looks good: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [17:12:18] cmjohnson1: all is well! resolved the ticket :) thank you! [17:17:51] gwicke: i've thought we need to rebalance the api and general application clusters for a few months now [17:17:59] we tweaked it last a LONG time ago [17:18:32] but i mostly thought we had too many idle in general apache [17:18:41] and wanted to throw more at jobrunners and like, not api specific, but still =] [17:19:02] we do have lots of general mw servers at very low load levels [17:19:25] (not going above 40% over week long timeline) [17:19:46] yeah, it sounds like it would make sense to repurpose some of them to be API servers or job runners [17:19:53] both yep [17:20:04] the load balancing in the API cluster is memory-based [17:20:09] lets see if there is a ticket, if not i'll make one. [17:20:38] heh, slightly related https://rt.wikimedia.org/Ticket/Display.html?id=6832 [17:20:49] old ticket i made for the fact api cluster load itself doesnt seem evenly distributed. [17:20:59] mostly because there are some very expensive queries in the API that eat up a lot of memory, recently created a security bug for that [17:21:08] i'm mostly curious, how come in the graph gwicke posted from ganglia the load looks to linearly increase from low to high as the server # increases? (e.g. all green at the begining and all red at the end) [17:21:24] ebernhardson: weights in load balancing [17:21:29] the older servers are usually slower and older [17:21:40] but, they should be weighted better so they are all even, even older versus new [17:21:45] the weights are inverse to the number of cpus [17:21:46] RobH: ah, that makes sense [17:21:56] but proportional to the memory in those boxes [17:22:08] but they also need tweaking, the ticket i put in awhile back and just linked points out that while the weights and apache configs are tweaked to box spec [17:22:10] the reason is that API outages were mainly memory driven so far [17:22:18] it seems to need additional tweaking as its not quite balanced imo [17:22:36] but, now we also need more =] [17:22:55] give folks an api cluster and they'll start using it, how dare them! [17:23:04] also IIRC one set of servers has HT enabled while the other doesn't, which also confuses the load numbers [17:23:15] gwicke: i think it was a test to see if HT was good or bad [17:23:25] that was never followed up upon. [17:23:34] ht good or bad for this role that is. [17:23:53] I see [17:24:30] Roan rebalanced it during https://wikitech.wikimedia.org/wiki/Incident_documentation/20140313-API-Parsoid, but then changed it back to be memory-based [17:25:00] "Roan had already attempted to address that by rebalancing the API appserver pool weights but I pointed out that the original weights were correct because a) the "24-core" servers are actually 12-core with HT, b) the weights are adjusted based on memory, not CPU. We quickly reverted that." [17:25:14] by Faidon [17:25:24] /cc paravoid [17:25:56] (03PS3) 10Rush: ops under new admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/120972 [17:25:58] (03PS2) 10Rush: jenkins admin user breakout [operations/puppet] - 10https://gerrit.wikimedia.org/r/121091 [17:26:00] (03PS4) 10Rush: shell for proposed admin module [operations/puppet] - 10https://gerrit.wikimedia.org/r/120724 [17:26:02] (03PS1) 10Rush: breaking out parsoid admin role [operations/puppet] - 10https://gerrit.wikimedia.org/r/121109 [17:26:19] sorry I did not know it was going to be spam central when I did the native dependencies on each update [17:27:24] don't let that keep you in any way [17:27:51] (03PS1) 10Ottomata: Adding extra property configs to nginx simple-proxy.erb, upping client_max_body_size for archiva proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/121110 [17:29:08] heh [17:29:12] gwicke: gtk [17:29:21] chasemp: be proud of gerrit spamming the channel =] [17:30:10] chasemp: you may not have yet experienced the Reedy-Spam-Hour in here.... [17:30:53] basically, a ton of lined up config changes get reviewed/merged by him in prep for deployment. It's a sight to behold. [17:31:07] nice [17:32:58] You've clearly never seen what #mediawiki-visualeditor looks like on normal Wednesdays [17:33:06] I spend most of the day rebasing and merging things [17:33:48] I'm still trying to work out how best to consume the channels to be honest [17:33:56] my irc days pre-wmf are well in the past [17:35:31] that was my reaction when i started here [17:35:39] 'wait, irc, really?' [17:36:15] now i wish we used it instead of hangouts. [17:37:06] * RobH is running to corner store for bandaids, back shortly [17:37:22] (i cut myself yesterday right on the pad of thumb, have to keep something over it) [17:41:46] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1719: active_shards: 5080: relocating_shards: 1: initializing_shards: 6: unassigned_shards: 0 [17:41:55] +1 on medicating [17:42:46] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1725: active_shards: 5086: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:51:28] RobH, I added a comment re load balancing to https://rt.wikimedia.org/Ticket/Display.html?id=6832 [17:52:38] (03CR) 10Ottomata: [C: 032 V: 032] Adding extra property configs to nginx simple-proxy.erb, upping client_max_body_size for archiva proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/121110 (owner: 10Ottomata) [17:52:52] gwicke: its appreciated [18:00:14] hey akosiaris, around? [18:05:05] ori, i heard he was on vaca this week [18:05:42] ah, bummer [18:06:41] ottomata: since you do packaging stuff, could i ask you to take a quick glance at ? i think it may be something you could do super quickly [18:08:01] ori, sounds pretty easy, will check it out [18:08:29] hm, where is alex's packaing? [18:08:32] cool, it's one of those niggling things that could unblock a lot of thing [18:08:42] i don't know :/ maybe labs? i think he has a project with his name [18:08:54] oh its not in a repo? [18:09:22] ls [18:09:48] ottomata: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000af.eqiad.wmflabs [18:09:54] dunno [18:14:34] hmm, ori, i'm a little relucant to do that if it isn't in a repo somewhere [18:14:34] hm [18:14:49] (03PS1) 10Ottomata: Fixing simple-proxy configs [operations/puppet] - 10https://gerrit.wikimedia.org/r/121119 [18:15:27] (03CR) 10Ottomata: [C: 032 V: 032] Fixing simple-proxy configs [operations/puppet] - 10https://gerrit.wikimedia.org/r/121119 (owner: 10Ottomata) [18:22:06] (03PS1) 10Ottomata: Enabling git-fat on kraken deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/121120 [18:22:33] (03CR) 10Ottomata: [C: 032 V: 032] Enabling git-fat on kraken deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/121120 (owner: 10Ottomata) [18:24:32] (03PS1) 10Odder: Crats should add users to gwtoolset group on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 [18:24:57] (03CR) 10Andrew Bogott: [C: 032] "As long as you've tested this (the directory creation) on labs, I can merge any time." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119256 (owner: 10Hashar) [18:31:57] (03CR) 10Steinsplitter: [C: 031] "ok" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [18:44:46] (03CR) 10Chad: "Need to make sure the cron (if it exists) is disabled on hume. Otherwise let's get this in for terbium." [operations/puppet] - 10https://gerrit.wikimedia.org/r/74591 (owner: 10Reedy) [18:50:06] greg-g: (As a volunteer) Is it ok with you if I push https://gerrit.wikimedia.org/r/121122 (trivial) [18:52:16] hoo: I'm fine. I'll +1 [18:52:35] yay :) [18:54:27] (03CR) 10Hoo man: [C: 032] Crats should add users to gwtoolset group on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [18:54:42] ... wait, are you deploying it now? [18:54:48] oh, yes [18:54:55] (03CR) 10Hoo man: Crats should add users to gwtoolset group on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [18:55:03] revoked the -2 [18:55:05] * +2 [18:55:06] (03CR) 10Reedy: "I think this is the last thing on hume atm. So it could just be decommissioned..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/74591 (owner: 10Reedy) [18:55:27] thanks, please wait until a SWAT window or such, I thought you were asking for my opinion on the matter, my apologies [18:56:25] ? [18:57:11] twkozlowski: Misunderstanding [18:57:22] greg-g is not ok with deploying it now, but I though he is :/ [18:57:45] Yeah, that patch isn't really urgent :-)) [18:58:20] right :) [18:59:50] greg-g: Ok, so can you schedule that one so that I can do it... or let someone else do it [19:01:20] hoo: tomorrow's morning SWAT window? [19:01:50] that's SF morning I guess [19:02:15] ok, that's fine with me :P [19:05:26] hoo: please add to calendar (if you haven't already), with your name next to it since you know about it [19:05:51] will do :) [19:06:33] * hoo dislikes having to log in to wikitech per hand every time -.- [19:06:44] I mean it's not SUL [19:08:22] greg-g: Done... let's hope I remember that [19:09:03] :) [19:09:18] hoo: LastPass or KeePassX ftw :) [19:09:27] tick the remember me box [19:09:30] that too [19:09:34] :) [19:09:51] mh, that would even work as I allow cookies for wikimedia.org :P [19:13:20] (03PS1) 10Cmjohnson: adding dns entries for new HPM misc servers include stat1003 [operations/dns] - 10https://gerrit.wikimedia.org/r/121132 [19:16:28] (03CR) 10Cmjohnson: [C: 032] adding dns entries for new HPM misc servers include stat1003 [operations/dns] - 10https://gerrit.wikimedia.org/r/121132 (owner: 10Cmjohnson) [19:19:12] (03PS1) 10Rillke: No longer force recentchangestext as content message [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 [19:22:26] (03CR) 10Steinsplitter: [C: 031] No longer force recentchangestext as content message [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [19:23:38] (03PS1) 10Cmjohnson: fixing typo and white space [operations/dns] - 10https://gerrit.wikimedia.org/r/121134 [19:24:02] (03CR) 10Cmjohnson: [C: 032] fixing typo and white space [operations/dns] - 10https://gerrit.wikimedia.org/r/121134 (owner: 10Cmjohnson) [19:25:57] (03CR) 10Steinsplitter: "can we merge this _now_, pleas? on commons the template was marked for translation." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [19:30:20] (03CR) 10Greg Grossmeier: "Merging in this repository implies an immediate deployment (that someone must do by hand). This can probably be done in today's SWAT deplo" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [19:32:36] greg-g: I can take that as well... trivial [19:33:06] tonight's SWAT?! [19:33:25] (03PS1) 10Ori.livneh: beta cluster: hhvm => true [operations/puppet] - 10https://gerrit.wikimedia.org/r/121136 [19:34:10] after copying the pages from meta... we have seen that don't work on commons... and that we need this change... :/ [19:36:23] (03CR) 10Chad: [C: 031] beta cluster: hhvm => true [operations/puppet] - 10https://gerrit.wikimedia.org/r/121136 (owner: 10Ori.livneh) [19:36:39] (03CR) 10Ori.livneh: [C: 032] beta cluster: hhvm => true [operations/puppet] - 10https://gerrit.wikimedia.org/r/121136 (owner: 10Ori.livneh) [19:37:03] hoo: whenever, if people review it and are ok with it and the SWATer on duty is ok with it then it can go out (pending on wiki consensus etc) [19:37:13] (03CR) 10Ori.livneh: [V: 032] beta cluster: hhvm => true [operations/puppet] - 10https://gerrit.wikimedia.org/r/121136 (owner: 10Ori.livneh) [19:37:18] whenever == whichever SWAT window [19:37:33] don't think they have a consensus for this [19:37:45] Steinsplitter: ^ [19:37:59] then my "dont' take this as approval" comment was worthwhile :) [19:38:06] greg-g: we have done the modifications on wiki, and we can't revert it. [19:38:36] Steinsplitter: ok, can you link to that and any associated discussion? [19:38:49] also, might be useful to coordinate this next time ;) [19:38:51] greg-g: the communety has started to translate it, unfortunatly we have seen too lat that dos not work... :/ [19:38:57] gotcha [19:39:04] totally understand unforeseen issue [19:39:05] s [19:39:57] I'm going to be afk for a bit (late lunch) [19:41:30] (03CR) 10Rillke: [C: 031] "I hope the crats will be responsive and not too bureaucratic." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [19:41:36] greg-g: it is commons, we cant ask for every chang teh communety [19:41:49] haha [19:41:52] yes. [19:42:05] I wonder how to respond to Rillke's comment [19:42:14] i go to live a notice to the village pump: it is broken becous of the mediawiki config :P [19:42:19] i go dinna now, see ya later [19:43:26] (03CR) 10Rillke: "To clarify, for translation, we'd like to use https://commons.wikimedia.org/wiki/Template:Recentchangestext in future." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [19:43:56] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1721: active_shards: 5086: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 0 [19:44:56] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1725: active_shards: 5090: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [19:48:27] (03CR) 10Odder: [C: 031] No longer force recentchangestext as content message [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [19:54:46] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1721: active_shards: 5086: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [19:55:46] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1722: active_shards: 5089: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [19:55:56] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [19:56:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [19:57:27] manybubbles: hey [19:57:35] hey, I see that [19:57:58] paravoid: uhg, I'm going to file a bug against the monitoring.... [19:58:08] sok [19:59:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=63137 [20:01:56] Eloquence: bist du da? die brauchen communety consensus um die watchlist zu reparieren o_O? kann das sein? [20:02:11] ehm. ich meinte RC [20:03:24] Steinsplitter: I'm pretty sure we can get this done tonight [20:03:31] let's wait fro Greg to retrun [20:03:33] * return [20:04:06] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:33] sure. when i wake up tomorrow, i don't like to have ten PMs on my Talkpage "why you hav deleted the RCtext" :P [20:07:06] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 299375 bytes in 7.695 second response time [20:12:56] PROBLEM - Puppet freshness on tantalum is CRITICAL: Last successful Puppet run was Wed 26 Mar 2014 05:12:06 PM UTC [20:13:00] !log It seems Zuul is held up by something. Loads of jobs are 'queued', yet Jenkins is operating fine with all executors idling nothing nothing. [20:13:05] Logged the message, Master [20:15:25] (03CR) 10Spage: [C: 032] "Deploying now" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120170 (owner: 10Spage) [20:15:55] spagewmf: wooo! :) [20:16:02] release it to the world [20:16:04] :) [20:16:06] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:16] indeed [20:20:31] greg-g : zuul has a big queue in gate-and-submit. Krinkle : I tried clicking the ↑ to move my oh-so-important job to the top, and nothing happened :) [20:20:43] :/ [20:20:48] hashar: ^ [20:22:17] greg-g: FYI: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=107508&oldid=107499 if it's not ok with you, feel free to revert... [20:22:25] spagewmf: That is not a button [20:22:35] and explain it to the commons people :P [20:22:38] greg-g: I'm inclined to merge the change and find out it breaks rather than wait for "operations-mw-config-tests" [20:23:00] spagewmf: looking [20:23:11] spagewmf: Zuul / Jenkins / Gerrit went wild a few hours ago [20:23:29] !log Zuul / Jenkins overloaded for some reason again. Investigating. [20:23:30] hashar: This happens all the time, Zuul has dozens of things it says are 'queued' in Jenkins, and when looking in Jenkins, everything is idle. [20:23:35] Logged the message, Master [20:23:42] Krinkle: nice [20:23:56] * hashar blames jenkins [20:24:42] Krinkle: the :) denotes humor! Though a queue-jumping facility could be useful [20:25:26] spagewmf: Well, the jobs are supposed to run in parallel. mw-config doesn't depend on anything, there is no reason for it to not be "on top" already. Nothing it keeping it from running right now. [20:25:52] Except something is, and in that case, such facility would be broken as wel (namely, Jenkins overloaded) [20:25:59] thx [20:26:25] I'm really looing forward to ditch Jenkins OMG [20:27:06] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 294333 bytes in 8.135 second response time [20:27:40] spagewmf: previous versions were checked right? either way, yeah, don't let it block you too long [20:28:51] I hear Hudson is an awesome Jenkins replacement from Oracle/Apache [20:29:15] spagewmf: :P [20:30:32] Krinkle: it fixed itself magically :( [20:31:32] hmm maybe not [20:37:03] hashar: you (and aude) have changes in operations/mediawiki-config.git that aren't on tin. AIUI, "if more than the changes you're making show up, STOP and find out why and make sure they're ok to merge." [20:39:18] spagewmf: most probably beta related [20:39:23] let me look [20:39:54] spagewmf: the wikidata one I have no clue what it is [20:39:57] hashar: yes, yours are. aude's is in CommonSettings.php, which I'm not touching [20:40:05] spagewmf: Link? [20:40:19] hoo (hey hi!) https://gerrit.wikimedia.org/r/#/c/121086/ [20:40:26] hoo: http://paste.openstack.org/show/74392/ [20:41:08] !log hashar updated /a/common to {{Gerrit|I186e3ade7}}: Update Wikidata OAuth grants [20:41:14] Logged the message, Master [20:41:32] spagewmf: I have merged my changes, advancing local master on tin.eqiad.wmnet to the commit before aude [20:41:51] spagewmf: I usually rebase on tin when merging changes to beta,but sometime forget about it [20:42:00] ok, the Wikidata change is fine :P [20:42:11] hashar thx. Should I wait any longer for zuul to "operations-mw-config-tests queued" on gerrit 120170 ? [20:42:26] !log I have done that change. I merged fast forward to the commit before {{Gerrit|I186e3ade7}}: Update Wikidata OAuth grants [20:42:33] Logged the message, Master [20:42:40] !log I have NOT done that change. [20:42:43] :-( [20:42:45] Logged the message, Master [20:43:44] !log out of despair, restarting Zuul [20:43:49] Logged the message, Master [20:44:12] hoo "fine" as in you want me to sync out that CommonSettings.php change? it's confusing to have changes in/a/common on tin that aren't on the cluster. [20:45:14] spagewmf: I know that ... should be fine to sync out [20:45:22] not sure why manybubbles approved that change [20:45:38] hoo: what'd I do? [20:46:04] manybubbles: You approved https://gerrit.wikimedia.org/r/121086 apparently w/o syncing it [20:46:37] │11:32:01 +logmsgbot | !log manybubbles synchronized wmf-config/CommonSettings.php 'SWAT deploy for Wikidata' │ bd808 [20:46:44] I think I synced it [20:46:50] maybe I didn't? [20:47:02] mh. maybe you forgot to git pull? [20:47:07] might be [20:47:17] should be safe to sync that, though. especially with you here [20:48:10] greg-g: ^ OK by you? [20:49:15] (03CR) 10Spage: [V: 032] "Zuul stalled not processing "operations-mw-config-tests" so forcing it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120170 (owner: 10Spage) [20:49:55] other option is to back it out [20:50:14] which is also ok with me, it's not urgent [20:50:23] I can sync it myself in the SWAT later today if needed [20:50:29] but that's of course more work for both of us [20:51:18] so Zuul is waiting for Jenkins to complete builds. But apparently the build never got triggered :-/ [20:51:31] hoo greg-g not around but doing it now seems fine. So I'll sync it now. [20:51:39] :) [20:52:01] !log replacing virt0 LDAP's certificate (trouble ahead) [20:52:12] oauth things? I guess [20:52:35] * greg-g didn't see it until just now [20:52:36] greg-g: Yep [20:52:40] * greg-g doesn't care too mcuh if it's not an issue [20:54:31] !log spage synchronized wmf-config/CommonSettings.php 'Update Wikidata OAuth grants (for hoo/aude , early SWAT)' [20:55:15] hoo , aude test away [20:55:38] !log restarted Zuul [20:55:40] test away? [20:56:39] !log spage synchronized wmf-config/InitialiseSettings.php 'Enable Popups (Hovercards) everywhere' [20:56:57] jorm: ping [20:57:00] hoo I mean can you test that the OAuth grants config change didn't break anything [20:57:33] frankly, I can't [20:59:06] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol [20:59:55] (03PS1) 10BBlack: lvs300[1234] dns for esams private subnet [operations/dns] - 10https://gerrit.wikimedia.org/r/121235 [21:00:06] (03PS1) 10BBlack: private1-esams dhcp/preseed stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/121236 [21:00:51] !log spage synchronized wmf-config/InitialiseSettings-labs.php 'Enable Popups (Hovercards) everywhere' [21:00:54] away to get food, I don't expect anything bad to happen OAuth on Wikidata looks ok as far as I can tell [21:08:10] spagewmf: hoo|away greg-g these were done in swat [21:08:47] (03CR) 10Mark Bergsma: [C: 031] lvs300[1234] dns for esams private subnet [operations/dns] - 10https://gerrit.wikimedia.org/r/121235 (owner: 10BBlack) [21:08:56] aude the OAuth change wasn't on tin /a/common so unless it was pushed from another host it wasn't [21:09:54] 15:32 <+logmsgbot> !log manybubbles synchronized wmf-config/CommonSettings.php 'SWAT deploy for Wikidata' [21:10:07] no idea [21:10:19] I must have screwed it up [21:10:38] I probably git diffed it to check where we were but didn't git rebase to get it [21:10:41] must be ok now [21:10:47] preilly: pong [21:11:13] jorm: may I PM [21:11:19] of course. [21:13:56] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1730: active_shards: 5113: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [21:13:59] bleh [21:14:04] sok [21:14:56] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1731: active_shards: 5116: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:15:49] <^d> manybubbles: Panic! At the cluster [21:15:55] panic! [21:16:02] I filed a bug for this [21:16:09] <^d> I saw [21:16:10] fire in the disco [21:16:16] its the monitoring freaking out because we're rebuilding an index [21:16:50] !log finished reindexing all cirrus wikis [21:16:54] yay, no more of that [21:17:28] (03CR) 10Mark Bergsma: [C: 031] private1-esams dhcp/preseed stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/121236 (owner: 10BBlack) [21:18:35] * ^d just wants his change to merge so beta search looks nicer again [21:24:06] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [21:24:06] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol [21:24:25] that would be me [21:25:06] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.035 second response time on port 389 [21:27:19] stupid java [21:31:19] paravoid: Are you fixing the cert problem? Sorry I didn't respond before :/ [21:31:29] I am trying to [21:31:45] ok, thank you. [21:32:06] There's a bug with pdns such that it needs to be restarted anytime opendj restarts. please nudge me if/when that needs doing. [21:32:26] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol [21:32:44] (03PS1) 10Ottomata: Fixing git-deploy git-fat support [operations/puppet] - 10https://gerrit.wikimedia.org/r/121248 [21:33:05] (03PS2) 10Rush: breaking out parsoid admin role [operations/puppet] - 10https://gerrit.wikimedia.org/r/121109 [21:33:07] (03PS1) 10Rush: old 'mortals' now 'deployment' role per comment in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/121249 [21:40:04] (03CR) 10BryanDavis: [C: 031] Fixing git-deploy git-fat support [operations/puppet] - 10https://gerrit.wikimedia.org/r/121248 (owner: 10Ottomata) [21:40:20] (03CR) 10Ottomata: [C: 032 V: 032] Fixing git-deploy git-fat support [operations/puppet] - 10https://gerrit.wikimedia.org/r/121248 (owner: 10Ottomata) [21:40:53] stream of 2014-03-26 21:40:17 mw1197 enwiki: [659b7cd8] /w/api.php?hidebots=1&days=7&limit=50&hidewikidata=1&action=feedrecentchanges&feedformat=atom Exception from line 2008 of /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiBase.php: Internal error in ApiFormatFeedWrapper::execute: Invalid feed class/item [21:41:00] in exception.log, known issue? [21:43:06] don't think so [21:43:18] class/item [21:43:19] mh [21:43:52] (03PS1) 10Ottomata: Building new version 0.1.1 [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/121252 [21:50:33] I filed bug 63150 "Invalid feed class/item" exceptions from ApiFormatFeedWrapper. If I visit that URL I get a nice-looking atom feed, so maybe not serious [21:51:17] works for me locally (with Wikibase) so doesn't seem serious [21:52:26] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [21:56:26] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [22:03:22] !log stopping puppet agent on palladium for just a sec, trying to figure out why git-fat is not running properly [22:12:09] ottomata: have you tried the more flattering git-fullfigured [22:12:30] RobH I have not! [22:12:38] maybe I should go wtih git-bigboned [22:12:42] if you dont talk nice to the servers, why should they do what you want? [22:12:51] yeah true [22:13:06] that's why windows machines on FAT never worked well :p [22:13:41] yes, thats why windows machines never work well... [22:39:45] I can't login at wikitech.wikimedia.org... I tried to change my password but when I log in with temporary password I get "There was either an authentication database error or you are not allowed to update your external account." [22:39:52] anyone can help me with this? [22:41:05] ottomata: ^ [22:42:51] i am having the same problem as jgonera ^^ [22:43:16] same problem [22:49:01] MaxSem: manybubbles mobile web is hoping to get https://gerrit.wikimedia.org/r/#/c/121241/ swat deployed today, though we're having authentication problems to wikitech to add it to the caldnar. can we still get it out during the window? [22:50:56] Krenair says he also has the same issue but is banned from the channel and can't write it himself ;) [22:51:13] hrmm [22:51:20] my logged in wikitech is fine [22:51:30] but i cannot login to a new session in another browser [22:51:42] awjr, this PM is not my window [22:51:52] neither it is manybubbles's [22:51:58] apparently some folks have problem with sudo on labs instances also [22:52:04] MaxSem: you and manybubbles are on the calendar [22:52:10] maybe unrelated (maybe same issue as was on tool labs) [22:52:12] I'm on AM [22:52:20] I can deploy tomorrow [22:52:20] oh damn it UTC/PDT [22:52:23] hrmm, if its labs as well [22:52:24] aude: look at 23:00 UTC :) [22:52:33] er awjr [22:52:33] ok [22:52:35] MaxSem: i failed at timezones again [22:52:35] someone is working on it! [22:53:02] sorry MaxSem manybubbles, ignore me :) [22:53:14] greg got us on the calendar anyway [22:53:20] aude is not in the sudoers file. This incident will be reported. [22:53:24] :) [22:53:27] :p [22:54:26] Thank you paravoid [22:54:52] your irc client was stuck and you were flapping [22:55:02] joining, excess flood, joining again, like a hundred times [22:55:04] hence the ban [22:55:15] yeah [22:55:22] happened in a few other channels too [22:55:34] aude, obligatory xkcd: http://www.xkcd.com/838/ :P [22:55:42] have since changed some settings in my bouncer so that shouldn't happen again... hopefully [22:55:48] heh [22:55:49] hehehe MaxSem [22:56:24] So folks are aware of the issue with sudo in labs and wikitech logins [22:56:36] it is being actively worked/discussed [22:56:47] (i dont have more info yet, will update when i do) [22:56:53] RobH: thanks [22:56:56] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Last successful Puppet run was Fri 21 Mar 2014 01:17:26 AM UTC [22:58:04] thanks RobH [22:58:34] figured very little info is better than none ;] [22:59:00] has to do with certificates in use in the ldap services [22:59:33] and our more recent migration to cease use of wildcard certificates outside of the varnish cluster [23:00:15] (03PS1) 10Faidon Liambotis: Update ldapconfig CA to RapidSSL [operations/puppet] - 10https://gerrit.wikimedia.org/r/121273 [23:04:32] (03CR) 10Faidon Liambotis: [C: 032] Update ldapconfig CA to RapidSSL [operations/puppet] - 10https://gerrit.wikimedia.org/r/121273 (owner: 10Faidon Liambotis) [23:04:34] greg-g: ping [23:05:25] Can I go for https://gerrit.wikimedia.org/r/121133 ? [23:06:06] hey [23:06:13] I'm here for SWAT stuff, give me a sec to catch up [23:06:42] ori: If you prefer you can also do https://gerrit.wikimedia.org/r/121133 :P [23:06:52] ori, also https://gerrit.wikimedia.org/r/#/c/121241/ please [23:07:16] hoo: I'd prefer one of the SWAT team members do all SWAT deploys [23:07:24] could you guys add them to https://wikitech.wikimedia.org/wiki/Deployments , per https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines ? [23:07:25] heh, less work [23:07:35] ori: Already done [23:07:37] ori: wikitech login is broken right now :/ [23:07:46] blame ldap [23:09:27] its partially back [23:09:37] but missing some flags for users, still being worked on [23:10:12] (03CR) 10Hoo man: "What you can do is set up a Varnish instance locally /on Vagrant / ... and then manually add your things (that should be easier to get rig" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120617 (owner: 10Gilles) [23:12:10] (03CR) 10Hoo man: "I don't agree that the retaping done here is counter productive to the readability of the diff as I changed many lines anyway (and the old" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [23:12:21] * hoo wonders whether retaping is a word :P [23:13:52] OK, so which patch(es) are you looking to have deployed again? [23:13:52] Wikitech and labs sudo stuff should now be working (is my understanding) [23:13:56] PROBLEM - Puppet freshness on tantalum is CRITICAL: Last successful Puppet run was Wed 26 Mar 2014 05:12:06 PM UTC [23:14:00] you dithered a bit earlier [23:14:11] oh, you updated the wiki [23:14:20] commons people really want https://gerrit.wikimedia.org/r/121133 [23:14:40] trivial one and they already implemented all the stuff on wiki [23:15:16] RobH: not yet but maybe puppet has to run [23:15:28] re: sudo [23:15:33] ahh, prolly. [23:15:46] the ldap cert break is fixed so wikitech logins work, but yea. [23:15:54] aude: https://gerrit.wikimedia.org/r/120535 :P [23:15:56] still getting more info [23:16:11] * hoo wants a |1| [23:16:27] heh, so yea, not entirely fixed [23:16:29] :) [23:16:49] hoo: need to verify the script, even though we ran in on wikidata / test wikidata recently [23:17:37] Feel free to do so... I already did that but more eyes are always good... you might want to indicate that on the change, though [23:20:04] greg-g: what are your thoughts wrt to https://gerrit.wikimedia.org/r/#/c/121133/ ? [23:20:06] good to go? [23:20:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:21:02] MaxSem: can you cherry-pick https://gerrit.wikimedia.org/r/#/c/121241/ and update the submodule? [23:21:13] sure, 1 sec [23:22:55] ori: honestly, I don't know :) I asked ^d but he wasn't sure either [23:22:58] (03PS2) 10Ori.livneh: No longer force recentchangestext as content message [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [23:23:04] (03CR) 10Ori.livneh: [C: 032] "The people involved with this patch are credible representatives of community consensus, so I am going to merge this, with the proviso tha" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [23:23:14] greg-g: see my review comment [23:23:25] good enough for me [23:23:47] (03Merged) 10jenkins-bot: No longer force recentchangestext as content message [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121133 (owner: 10Rillke) [23:24:12] so yea, labs auth issues are still totally happening. [23:24:14] !log ori updated /a/common to {{Gerrit|I5b565f47b}}: No longer force recentchangestext as content message [23:24:20] Logged the message, Master [23:24:25] folks know in ops, and are workign on it [23:24:33] (just generic channel update ;) [23:24:51] ori: Oh btw... any idea why that git hook only fires sometimes? [23:24:53] !log ori synchronized wmf-config/InitialiseSettings.php 'I5b565f47b: No longer force recentchangestext as content message' [23:24:59] Logged the message, Master [23:25:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:25:08] I looked into it very briefly but couldn't see a reason [23:25:20] hoo: it has to be very careful not to disclose security patches by accident [23:25:30] Thx for merging [23:25:35] that's why it of doesn't fire at all? [23:25:41] It never ever fired for me :P [23:25:49] hoo: i'd need to look more closely, let me focus on the deployments atm [23:25:53] Steinsplitter: yw [23:25:57] Sure [23:28:16] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [23:28:42] !log swapping virt1000 LDAP certificate [23:28:44] (03PS2) 10Ori.livneh: Crats should add users to gwtoolset group on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [23:28:47] Logged the message, Master [23:29:03] hoo: you added https://gerrit.wikimedia.org/r/#/c/121122/ to the earlier swat window [23:29:06] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:20] ori: Earlier? [23:29:32] That one is not urgent so I scheduled it for tomorrow [23:29:39] at least that's what I planned to do [23:29:48] * ori fails timezones [23:29:56] RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical [23:30:02] you can also do it now, I don't care [23:30:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:30:07] might as well [23:30:22] is someone looking at silicon? i'd like to rule out it being at all related [23:30:25] i don't even know what it does [23:30:28] * ori looks at ganglia [23:30:48] "fundraising eqiad" [23:31:03] ori, https://gerrit.wikimedia.org/r/121278 [23:31:28] Jeff_Green, mwalker, K4-713, et al -- are you aware of the silicon alerts? ^ [23:31:59] oh tahts fun [23:31:59] ori: We are now. :/ [23:32:03] yes. RAID [23:32:15] failed disk [23:32:29] redundant array of fail [23:32:58] Jeff_Green: I assume it's not a reason to hold back SWAT deployments (MobileFrontend update, etc.), right? [23:33:05] so going to continue unless you tell me not to [23:33:12] nope, go ahead [23:33:15] thanks [23:33:32] Jeff_Green: Should we be stopping queue jobs and banners and things, do you think? [23:34:10] I think it's ok [23:34:42] i mean, the RAID system is doing what it's supposed to at least, we're just not redundant for the moment [23:35:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:36:03] Jeff_Green: Did the other drive come from the same batch? :p [23:36:17] might wanna ack that alert so we aren't all spammed :) [23:36:54] K4-713: probably [23:37:01] eep [23:37:09] I'll just be... crossing my fingers. [23:37:13] (03CR) 10Ori.livneh: [C: 032] "Caveat: I did not read the discussion in full. I trust Hoo, Odder, Steinsplitter and Rillke to be good representatives of on-wiki consensu" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [23:37:21] (03Merged) 10jenkins-bot: Crats should add users to gwtoolset group on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121122 (owner: 10Odder) [23:37:38] K4-713: imo it doesn't make a difference [23:37:46] !log ori updated /a/common to {{Gerrit|I310a42c51}}: Crats should add users to gwtoolset group on Commons [23:37:51] Logged the message, Master [23:38:02] yay icinga is full of fail too [23:38:22] !log ori synchronized wmf-config/InitialiseSettings.php 'I310a42c51: Crats should add users to gwtoolset group on Commons' [23:38:22] i get an internal server error at https://icinga-admin.wikimedia.org/icinga [23:38:27] Logged the message, Master [23:39:19] Jeff_Green: I would have phrased it thus: "PROBLEM - https://icinga-admin.wikimedia.org/icinga is CRITICAL: CRITICAL Internal server error" ;) [23:39:57] ori: i double-dare you to SMS that to the entire ops list [23:40:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:40:18] * ori chickens out [23:41:38] !log ori synchronized php-1.23wmf19/extensions/MobileFrontend/includes/specials/SpecialMobileWatchlist.php 'If18397782: Fix the watchlist header' [23:41:44] Logged the message, Master [23:41:48] MaxSem: ^ [23:41:55] jdlrobson, ^ [23:42:00] thanks ori [23:42:29] thanks ori and MaxSem [23:42:40] np! [23:42:47] I think we're done, right? [23:43:01] yup looks good to me! [23:44:24] so icinga is barfing 500s and not logging any errors. wtf [23:45:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:50:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_]. [23:54:21] (03PS1) 10Faidon Liambotis: Revert "Update ldapconfig CA to RapidSSL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121280 [23:54:52] (03CR) 10Faidon Liambotis: [C: 032] Revert "Update ldapconfig CA to RapidSSL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121280 (owner: 10Faidon Liambotis) [23:54:58] (03CR) 10Faidon Liambotis: [V: 032] Revert "Update ldapconfig CA to RapidSSL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121280 (owner: 10Faidon Liambotis) [23:55:06] PROBLEM - check_raid on silicon is CRITICAL: CRITICAL md0 status=[UU]. md1 status=[UU]. md2 status=[UU]. md3 status=[U_].