[01:50:39] PROBLEM - MySQL Recent Restart on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:40] PROBLEM - MySQL Processlist on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:50] PROBLEM - DPKG on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:59] PROBLEM - check if dhclient is running on db1046 is CRITICAL: Timeout while attempting connection [01:51:00] PROBLEM - RAID on db1046 is CRITICAL: Timeout while attempting connection [01:51:00] PROBLEM - SSH on db1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:01] PROBLEM - MySQL disk space on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:01] PROBLEM - puppet disabled on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:01] PROBLEM - Disk space on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:02] PROBLEM - MySQL Idle Transactions on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:02] PROBLEM - check configured eth on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:02] PROBLEM - mysqld processes on db1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:19] * springle pokes db1046 [02:13:31] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3790 MB (3% inode=99%): [02:20:42] !log LocalisationUpdate completed (1.24wmf1) at 2014-04-28 02:20:39+00:00 [02:20:50] Logged the message, Master [02:21:30] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3434 MB (3% inode=99%): [02:21:48] D'oh! [02:22:21] !log powercycle db1046 unresponsive [02:22:28] Logged the message, Master [02:22:53] Ah, I was wondering why I couldn't connect to the mgmt interface. [02:23:10] Simple answer: because you were. :-) [02:23:22] heh [02:23:30] RECOVERY - Disk space on virt0 is OK: DISK OK [02:30:07] !log LocalisationUpdate completed (1.24wmf2) at 2014-04-28 02:30:04+00:00 [02:30:14] Logged the message, Master [03:00:04] !log starting online schema change, bug 64411, page_props.pp_sortkey [03:00:11] Logged the message, Master [03:12:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 28 03:12:11 UTC 2014 (duration 12m 10s) [03:12:22] Logged the message, Master [03:23:39] springle: I'm around, by the way, if there's anything you need [03:27:56] ori: cool thanks :) I noticed on email list you said you can cross reference db1048 uuid's with log files, which would be wise. Fine to begin that against db1048 if/when you like. [03:28:24] it won't affect me fighting with db1046 and db1047 repl [03:28:46] cool, will do. i'll need to modify a script so it'll take me a few to get started; will let you know once i have the results. [03:29:53] excellent [03:30:23] i'm still a little worried about the duplicate ids [03:30:53] springle: which duplicate ids? [03:31:01] i ran joins on uuid for all tables and found no gaps, but i wonder if any new ids from the new consumer were overwritten [03:31:57] springle: you'll have to slow down for me a little :) how would you look for gaps? [03:32:50] ori: recall I mentioned in email that running the duplicate consumers in parallel for a while meant some auto-inc ids were used on db1048 as well as in the db1047 dump? theoretically shouldn't be an issue, but would be nice to verify from external uuid's [03:34:08] ori: looking for gaps meant LEFT JOIN .. WHERE .. IS NULL on old and new consumer data [03:36:00] ok, that makes sense. [03:36:44] do you use the auto-inc primary key for anything? [03:36:50] (just wondering) [03:37:14] no, it's not significant (and not part of the event record). it's an artifact of storing in the database. i would have used the uuid as the primary key but asher advised against it iirc [03:37:25] yeah, for index size [03:37:42] well, and write overhead i guess [03:38:13] uuid == PK would help for any future INSERT IGNORE migrations though :) [03:38:27] well, there's a unique key on uuid, it's just not the primary key [03:39:02] it isn't unique [03:39:03] hm, is it not unique? [03:39:06] not formally [03:39:08] yes, i just noticed that [03:39:13] normal key, and nullable [03:39:43] that's odd. i could have sworn.. let me look at the source code again, it's been a while [03:39:44] hence my last email to you :) i assumed it was properly unique only based uuid-ness [03:40:34] # Every table gets an integer auto-increment primary key column `id` [03:40:35] # and an indexed CHAR(32) column, `uuid`. (UUIDs could be stored as [03:40:35] # binary in a CHAR(16) column, but at the cost of readability.) [03:40:36] columns = [ [03:40:39] sqlalchemy.Column('id', sqlalchemy.Integer, primary_key=True), [03:40:40] # To keep INSERTs fast, the index on `uuid` is not unique. [03:40:42] sqlalchemy.Column('uuid', sqlalchemy.CHAR(32), index=True) [03:40:44] ] [03:40:51] :D [03:40:55] speeeeed! [03:41:22] which actually I wouldn't be too concerned about knowing the relatively small consumer write load [03:41:42] is that actually legit, or is 2013 ori full of shit? [03:41:52] * ori doesn't trust older self [03:42:25] i would have probably gone with CHAR(16) in hindsight; UUIDs are many things but readable is not one [03:42:54] unique keys are a little slower to update in some cases [03:43:15] not crazy. but propabbly also not a big deal in this case [03:43:51] CHAR(16) vs CHAR(32) is a legit consideration [03:44:08] i may quote you in the file header [03:44:14] :) [03:44:19] "not crazy." -- sean pringle, 27-April-2014 [03:50:12] actually, make them BINARY(16). table charset is utf8 [03:52:22] ori: pros and cons http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/ [03:54:15] imo the best combination here would be binary(16) uuid as primary key and lose the auto-inc id. this would still have some drawbacks, but the overhead of a single 16-byte binary PK vs auto-inc + secondary index, with just a single consumer, is lesser [03:56:17] yes, that seems right [03:57:38] and if readability is a concern the uuid could be duplicated and stored as a redundant but unindexed char(32) column [03:58:00] or binary [03:58:09] yep [04:00:22] it's compelling -- not for whatever performance improvement it would confer, but because it would make QAing the data easier [04:03:36] i think i'm going to write up the considerations in a TODO file or a bugzilla bug, and then CC analytics. i'm a bit reluctant to mess around with the data model [04:06:36] yep. plus any performance concern is also very storage engine specific, and we're talking innodb mostly. if it was tokudb or aria engine, bottlenecks will be elsewhere [04:06:53] losing auto-inc happens to make it more portable too, if that matters [04:08:30] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 334 seconds [04:08:32] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 334 seconds [04:16:30] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 92 seconds [04:16:32] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 90 seconds [04:34:57] (03CR) 10KartikMistry: "We have two instances maintained by team that are using browsertests." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129687 (owner: 10Hashar) [06:47:33] apergos: morning [06:48:05] would an icinga check on all hosts for cpu/memory usage be useful ? [07:04:27] (03PS1) 10Nemo bis: Use an actually generic address as $wmgNotificationSender default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130037 (https://bugzilla.wikimedia.org/58261) [07:06:01] (03CR) 10Nemo bis: "Followup: I214b0f5fd82d357a32849bee2e072f33577f8ef6" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/59717 (https://bugzilla.wikimedia.org/46670) (owner: 10Lwelling) [07:06:51] (03CR) 10Legoktm: [C: 031] Use an actually generic address as $wmgNotificationSender default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130037 (https://bugzilla.wikimedia.org/58261) (owner: 10Nemo bis) [07:29:55] (03PS1) 10Dzahn: disable Siebrand's shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/130038 [07:31:56] (03CR) 10Dzahn: [C: 032] disable Siebrand's shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/130038 (owner: 10Dzahn) [07:32:28] morning mutante [07:34:13] <_joe_> hi matanya [07:34:22] hi _joe_ :) [07:34:49] <_joe_> matanya: http://puppet-transition-helper.wmflabs.org/html/ (today it will improve, still here are reasonable results [07:34:54] <_joe_> matanya: bottom line [07:35:08] <_joe_> matanya: a lot of templates to fix, and just a few other things [07:36:04] * matanya is looking [07:36:25] 500 Internal Server Error [07:36:57] _joe_: i guess that needs fixing too :P [07:37:12] <_joe_> matanya: that's a 404 [07:37:26] <_joe_> but that is because we need to fix the private/labs repo [07:37:32] <_joe_> I'll do that today [07:37:45] k, thanks [07:41:48] _joe_: to verify the output of the tool: what is the content of /etc/icinga/puppet_hosts.cfg on neon? [07:44:17] <_joe_> matanya: icinga results on 2.7 are wrong for sure [07:44:32] <_joe_> matanya: I had to disable --storedconfigs on 2.7 or it would fail [07:45:07] <_joe_> so we have no external resources collected in puppet 2.7 [07:45:25] just wondering :) can you check please ? [07:47:55] (03PS8) 10Giuseppe Lavagetto: Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 [07:48:35] <_joe_> matanya: yes gimme 5 minutes [07:49:14] <_joe_> matanya: what should I check in particular? [07:49:39] the content of the file. should be : /usr/local/bin/naggen', '--stdout', '--type', 'hostextinfo [07:50:01] <_joe_> no. [07:50:10] <_joe_> look at http://puppet-transition-helper.wmflabs.org/compiled/puppet_catalogs_2.7/neon.wikimedia.org.warnings [07:50:26] <_joe_> the first line is the error, we don't have naggen [07:50:35] <_joe_> that's why compilation fails [07:50:42] <_joe_> ok, another dependency [07:51:04] so we should add naggen? [07:51:09] <_joe_> yes [07:51:35] to your tool, or to our puppet repo? who has the missing deps? [07:52:30] <_joe_> to my 'tool' [07:52:42] <_joe_> I just guess naggen is not in $PATH [07:52:48] <_joe_> I'll check in a few [07:53:18] modules/puppetmaster/files/naggen [07:53:36] (03CR) 10Giuseppe Lavagetto: [C: 032] "Let's merge (and fix anything later)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 (owner: 10Giuseppe Lavagetto) [07:56:55] (03PS1) 10Dzahn: disable account 'akhanna' [operations/puppet] - 10https://gerrit.wikimedia.org/r/130039 [07:57:29] morning hashar do you need help with https://bugzilla.wikimedia.org/show_bug.cgi?id=63934 ? [07:58:03] matanya: I gotta figure out a solution for a the dependent tickets :} [07:58:12] hold on brb [08:00:07] (03CR) 10Dzahn: [C: 032] disable account 'akhanna' [operations/puppet] - 10https://gerrit.wikimedia.org/r/130039 (owner: 10Dzahn) [08:05:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: usage: check_graphite [-h] [-U URL] [-T TIMEOUT] [08:06:37] _joe_: ^ [08:08:08] <_joe_> matanya: yes I know [08:09:42] <_joe_> matanya: wait for the next puppet run at least, the file on disk has changed but the checkcommands def has not been reloaded [08:11:08] <_joe_> matanya: if that does not recover in ~ 10 minutes, then I'll worry about it [08:11:14] (03CR) 10Hashar: "@KartikMistry Yup that is what this patch is about. Your two instances were unreachable from the Jenkins slave so I added the iptables ru" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129687 (owner: 10Hashar) [08:11:18] ok, sure [08:24:11] (03PS1) 10Matanya: appserver: no more hardy boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/130042 [08:44:33] (03PS3) 10Dzahn: rm old wikibugs - replaced by pywikibugs [operations/puppet] - 10https://gerrit.wikimedia.org/r/129694 [08:45:33] (03CR) 10Dzahn: [C: 032] rm old wikibugs - replaced by pywikibugs [operations/puppet] - 10https://gerrit.wikimedia.org/r/129694 (owner: 10Dzahn) [08:49:35] !log reloading db1046 from fresh m2 dump [08:49:42] Logged the message, Master [08:51:33] (03CR) 10Dzahn: "what is the motivation? are all ferm rules supposed to be in roles? is that considered config?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129965 (owner: 10Matanya) [08:54:03] (03PS1) 10Odder: Add Library of Congress to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130043 (https://bugzilla.wikimedia.org/64487) [08:55:01] (03PS2) 10Odder: Add Library of Congress to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130043 (https://bugzilla.wikimedia.org/64487) [08:57:25] (03CR) 10Zfilipin: [C: 031] contint/beta: set natfix for the labs shared proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/129687 (owner: 10Hashar) [09:11:05] (03PS5) 10Dzahn: turn ircecho into a parameterized class [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 [09:16:37] (03CR) 10Dzahn: "Alex, there is just the single bot (icinga-wm) left now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 (owner: 10Dzahn) [09:18:30] (03PS1) 10Springle: Include standard for the new mariadb roles. Remove duplicate includes added previously to node definitions. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130046 [09:21:08] _joe_: ^ [09:21:16] <_joe_> springle: on it [09:21:23] thank you [09:21:58] <_joe_> (btw, including standard twice should not be an issue) [09:22:08] ah cool, did wonder [09:22:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Include standard for the new mariadb roles. Remove duplicate includes added previously to node definitions. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130046 (owner: 10Springle) [09:26:58] (03PS1) 10Hashar: contint: get phantomJS on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 [09:27:18] (03CR) 10Matanya: "I think ferm rules are specific to wmf, and as such should be in role classes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129965 (owner: 10Matanya) [09:27:30] https://www.wikimania.org/ has wrong cert? [09:28:14] Nemo_bis: open a ticket in rt [09:28:26] you should poke RobH [09:28:56] Nemo_bis: that will be related to wikimania.org moving from WMCH to WMF [09:29:03] there is a months old ticket about that [09:29:19] #5587 [09:29:38] so Wikimania CH used to own that domain and it returns a wikimedia.ch cert [09:30:13] in fact, you'll have to ask Manuel Schneider [09:30:23] Admin Email:info@wikimedia.ch [09:30:43] and also CC RobH about a new cert [09:33:38] last status we have was that the decision what happens with wikimania.org went to the WMCH board [09:34:32] (03PS2) 10Hashar: contint: get phantomJS on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 [09:35:16] (03PS3) 10Hashar: contint: get phantomJS on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 [09:35:18] (03PS3) 10Hashar: contint: get composer on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124305 [09:35:54] (03CR) 10Hashar: [V: 032] "Rebased. Still deployed on contint puppetmaster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124305 (owner: 10Hashar) [09:36:25] (03CR) 10Hashar: [C: 031 V: 032] "Rebased on top of https://gerrit.wikimedia.org/r/#/c/124305/ which brings composer. That is to avoid a potential conflict." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [09:42:11] (03CR) 10Zfilipin: [C: 031] contint: get phantomJS on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [09:46:48] (03CR) 10KartikMistry: [C: 031] "As per Hashar's latest comments :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129687 (owner: 10Hashar) [09:47:46] (03PS1) 10Giuseppe Lavagetto: Fix the monitor_graphite_threshold check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130052 [09:48:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix the monitor_graphite_threshold check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130052 (owner: 10Giuseppe Lavagetto) [09:58:32] mutante: ah right, I had forgotten [10:02:58] sent [10:05:27] (03CR) 10Alexandros Kosiaris: [C: 032] "Matanya is correct. Monitoring, backup and firewalling are very specific to WMF production and as such they are better positioned in role" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129965 (owner: 10Matanya) [10:06:31] Oh, RT gives AutoReply now, that's very helpful :) [10:12:14] (03PS1) 10Aude: Set $wgPagePropsHaveSortkey to false [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130055 (https://bugzilla.wikimedia.org/64411) [10:13:22] (03CR) 10Dzahn: "" This package is a dummy transitional package. It can be safely removed." Depends: fonts-vlgothic" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [10:16:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1% data above the threshold [250.0] [10:17:18] <_joe_> and there it is. [10:17:57] :) [10:18:19] <_joe_> now I should write some docs :) [10:18:48] <_joe_> one thing I notoriously suck at. [10:31:02] Nemo_bis: blame/thank apergos :) [10:31:34] hashar: i'm gzipping all the jenkins console logs (re: change 125991) [10:31:54] mutante: dont do it as root! :-D [10:32:25] hashar: arr, ..ok [10:32:30] as jenkins.. grmbl [10:32:34] mutante: the command seems fine [10:32:42] but it traverse a million of files/directories [10:32:53] I am not sure how much time it takes nor whether it is a good idea :/ [10:33:07] that's why i wanted to watch it running [10:33:14] sorry about the user [10:36:26] hashar: the are all owned by jenkins:jenkins [10:37:13] the initial run will just be long, the diff every 24h shouldn't be that bad [10:37:37] I wish Jenkins can compress them automatically [10:37:53] I need to discard old builds as well [10:39:13] runs it exactly as in the cron job, with nice -n 19 etc [10:39:43] <_joe_> mutante: nice -n 19? cpu intensive? [10:40:37] it doesnt look bad in top [10:40:42] maybe 2% [10:40:59] <_joe_> so I'd use ionice instead :) [10:42:05] <_joe_> (or, if you're fancy and your kernel supports it, cgroups and blkio limits :) ) [10:44:04] ok, fair, is that worth it for a mostly one-time thing ? [10:44:11] and https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=gallium.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [10:44:39] just tests what was suggested on gerrit [10:45:23] <_joe_> mutante: well 10% iowait is not nice, but still not killing the server [10:46:32] thanks ariel :) [10:46:49] 10% iowait is low [10:46:59] ionice -c 3 ? [10:47:10] ah I was preceded [10:48:14] yw [10:48:56] <_joe_> eggia' [10:51:00] continues with ionice -c3 [10:56:50] (03CR) 10Springle: [C: 031] "Seems fine, obviously, however the schema change has been done. Is the field also to be populated by a batch job?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130055 (https://bugzilla.wikimedia.org/64411) (owner: 10Aude) [11:00:31] !log completed schema change, bug 64411, page_props.pp_sortkey [11:00:37] Logged the message, Master [11:05:20] (03PS4) 10Dzahn: Add jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 (owner: 10Jkrauska) [11:11:05] (03PS5) 10Dzahn: add shell account for jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 (owner: 10Jkrauska) [11:12:48] (03CR) 10Reedy: [C: 04-1] "http://packages.ubuntu.com/precise/ttf-vlgothic" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [11:16:53] (03PS2) 10Reedy: Add ttf-vlgothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) [11:30:05] (03PS3) 10Reedy: Add fonts-japanese-gothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) [11:30:29] fixing the edit summary too works [11:30:37] s/works/helps/ [11:30:54] Reedy: package 'fonts-japanese-gothic' as it is purely virtual [11:31:12] i think fonts-vlgothic [11:46:13] !log Running deleteEqualMessages.php on bpywiki (bug 43917) [11:46:20] Logged the message, Master [11:47:47] (03CR) 10Aude: "I don't think populating the column is required for having the setting = true." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130055 (https://bugzilla.wikimedia.org/64411) (owner: 10Aude) [11:55:09] (03Abandoned) 10Aude: Set $wgPagePropsHaveSortkey to false [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130055 (https://bugzilla.wikimedia.org/64411) (owner: 10Aude) [12:07:17] (03PS6) 10Dzahn: add shell account for jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 (owner: 10Jkrauska) [12:09:36] (03CR) 10Dzahn: [C: 032] add shell account for jkrauska [operations/puppet] - 10https://gerrit.wikimedia.org/r/127134 (owner: 10Jkrauska) [12:26:14] (03PS1) 10Dzahn: add 'rhenium' (netflow box) to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 [12:27:10] (03PS2) 10Dzahn: add 'rhenium' (netflow box) to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 [12:28:14] (03CR) 10Matanya: [C: 031] "I think we should add ferm on the host too." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 (owner: 10Dzahn) [12:29:04] !log Running deleteEqualMessages.php on afwiki (bug 43917) [12:29:07] !log Running deleteEqualMessages.php on cvwiki (bug 43917) [12:29:11] Logged the message, Master [12:29:18] Logged the message, Master [12:31:33] (03PS1) 10Matanya: archiva: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 [12:32:44] (03PS1) 10Dzahn: apply role::pmacct on node rhenium [operations/puppet] - 10https://gerrit.wikimedia.org/r/130062 [12:33:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One minor comment. Otherwise LGTM" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 (owner: 10Dzahn) [12:35:52] mutante / matanya / Nemo_bis : uhh, not sure what we would do about the cert error, its not our server. (wikimania.org) [12:36:04] yet [12:36:04] (you guys pinged me about it so it was in my bouncer, heh) [12:36:14] thats what folks been telling me now for over half a decade [12:36:22] 'wikimania.org is going to transfer to wikimedia' [12:36:23] ;] [12:36:44] where is manuel s ? [12:36:58] RobH: yes, something changed on their side but no indication yet that it actually moved to us :p [12:37:16] i hope it does transfer, would be nice [12:37:25] but _once_ and if they finally decide to move ... they should ask about certs _before_ just switching [12:37:30] yea [12:37:32] (dont take my doubt as an indication of disapproval, i want the domain, heh_ [12:37:43] mutante: its wikimania.org, it has no certs [12:37:48] it'd go to our main varnish cluster [12:37:51] no cert to buy =] [12:37:58] sorry, main ssl cluster, eqiad [12:38:04] (03CR) 10Matanya: [C: 031] "as a side note: I think the ferm rule should move from modules/pmacct/manifests/configs.pp, but yes, this seems right." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130062 (owner: 10Dzahn) [12:38:06] (only ulsfo has varnish handling ssl) [12:39:09] RobH: wouldn't we still need to get star.wikimania.org ? [12:39:38] RobH: I read an old and very interesting doc yesterday about ssl optimization, interested to read ? [12:40:08] hrmm, i guess so yea [12:40:18] have to add it to unified cert [12:40:23] or its own small cert, bleh [12:40:36] matanya: I'll book mark it and read it later, right now im ssl'd out =] [12:41:16] (03CR) 10Dzahn: "fonts-japanese-gothic is virtual. and we tried to remove all the virtual ones, or? fonts-vlgothic is the "regular" one" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [12:44:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "There seems also to be rsync::server included from archiva::gitfat so at least a ferm::service/rule will be needed for that. Not sure this" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 (owner: 10Matanya) [12:45:46] (03PS6) 10Dzahn: turn ircecho into a parameterized class [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 [12:47:13] (03PS2) 10Matanya: archiva: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 [12:48:04] <_joe_> !log restarted apache on wikitech-static [12:48:13] Logged the message, Master [12:53:15] (03CR) 10Dzahn: [C: 032] turn ircecho into a parameterized class [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 (owner: 10Dzahn) [12:55:19] RobH: https://insouciant.org/tech/ssl-performance-case-study/ [12:55:44] cool, thx =] [12:59:53] (03PS4) 10Dzahn: Add fonts-vlgothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [13:01:29] (03CR) 10Dzahn: [C: 04-1] "eh, wth, bug 127623 ?:), hold on" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [13:02:15] (03PS5) 10Dzahn: Add fonts-vlgothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [13:03:23] (03CR) 10Dzahn: "- no puppet3 regression, using parameters in role" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129676 (owner: 10Dzahn) [13:04:55] (03CR) 10Dzahn: [C: 032] ""VL Gothic is beautiful Japanese free Gothic TrueType font, developed by Project Vine"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 (https://bugzilla.wikimedia.org/64002) (owner: 10Reedy) [13:22:35] (03CR) 10Ottomata: "The rsync server is meant to be open to the public. It allows users of repositories that use artifacts hosted here to run 'git fat pull' " [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 (owner: 10Matanya) [13:24:31] ottomata: on port 873 ? ^ [13:26:21] (03CR) 10Ottomata: archiva: add ferm rule (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 (owner: 10Matanya) [13:27:11] matanya: yup [13:27:20] sure, fixing [13:28:01] (03PS3) 10Matanya: archiva: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 [13:30:33] (03PS1) 10Matanya: titinium: add firewall to the host [operations/puppet] - 10https://gerrit.wikimedia.org/r/130066 [13:30:45] matanya: titanium [13:31:07] <_joe_> titinium in italian sounds funny [13:31:25] <_joe_> well, in english as well, I guess [13:31:42] (03PS2) 10Matanya: titanium: add firewall to the host [operations/puppet] - 10https://gerrit.wikimedia.org/r/130066 [13:32:57] thanks Reedy fixed [13:34:55] (03CR) 10Matanya: [C: 04-1] "This depends on https://gerrit.wikimedia.org/r/#/c/130061/2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130066 (owner: 10Matanya) [13:35:00] (03PS1) 10Odder: Add University of Neuchâtel to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130067 (https://bugzilla.wikimedia.org/64535) [13:38:56] (03CR) 10Krinkle: "Inline (what is /usr/share/dblist?)" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/65634 (owner: 10Petrb) [13:39:37] (03PS1) 10Andrew Bogott: Change six UIDs to match ldap [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 [13:42:29] (03CR) 10Dzahn: "can't find user abaso in LDAP, all others confirmed though" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 (owner: 10Andrew Bogott) [13:43:08] (03PS1) 10Odder: Additional two Swiss domains to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130070 (https://bugzilla.wikimedia.org/64536) [13:46:24] (03CR) 10Krinkle: improved sql script (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/67826 (owner: 10Petrb) [13:48:00] (03CR) 10Andrew Bogott: "Adam's labs login is dr0ptp4kt for some reason" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 (owner: 10Andrew Bogott) [13:48:58] (03PS6) 10Hoo man: Make labs' sql command work with -v and remove cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 [13:49:10] Krinkle: ^ that might be interesting for you :P [13:49:14] (03CR) 10Dzahn: "ugh, i see, so how are we resolving those? different users with same UID, meh. renaming labs users? meh. .." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 (owner: 10Andrew Bogott) [13:50:43] (03PS1) 10Krinkle: toollabs/sql: Remove unused 'list', remove duplicate 'commons' [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 [13:51:14] (03PS2) 10Andrew Bogott: Change five UIDs to match ldap [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 [13:51:29] (03CR) 10Krinkle: "Fixed in I795051ab034bb4c." [operations/puppet] - 10https://gerrit.wikimedia.org/r/65634 (owner: 10Petrb) [13:51:36] (03CR) 10Krinkle: "Fixed in I795051ab034bb4c." [operations/puppet] - 10https://gerrit.wikimedia.org/r/65634 (owner: 10Petrb) [13:51:41] (03CR) 10Krinkle: "Fixed in I795051ab034bb4c." [operations/puppet] - 10https://gerrit.wikimedia.org/r/67826 (owner: 10Petrb) [13:52:11] (03CR) 10Hoo man: [C: 04-1] "Mostly redundant with https://gerrit.wikimedia.org/r/113755 (the _p additions aren't needed)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 (owner: 10Krinkle) [13:52:54] (03CR) 10Krinkle: new tool for easy sql replica access (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/65634 (owner: 10Petrb) [13:53:28] (03CR) 10Dzahn: [C: 031] Change five UIDs to match ldap [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 (owner: 10Andrew Bogott) [13:53:35] andrewbogott: heheee [13:54:11] (03CR) 10Krinkle: "That change is a more major refactor. This is a lot more trivial and therefore easier to review and merge. Yours can trivially be rebased " [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 (owner: 10Krinkle) [13:55:16] (03CR) 10Hoo man: "I don't see a reason for this change, thus... it would be nice if someone could finally review my code... that would be time spend better " [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 (owner: 10Krinkle) [13:57:01] (03PS1) 10Krinkle: toollabs/sql: Add handling for connecting to "meta_p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130073 [13:58:27] (03PS2) 10Krinkle: toollabs/sql: Support connecting to "meta_p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130073 [13:59:02] hoo: I can rebase it for you, that would make your change easier to review as it'll make less changes [13:59:26] nobody is going to review any of this, anyway... probably [14:01:13] (03CR) 10Ottomata: archiva: add ferm rule (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/130061 (owner: 10Matanya) [14:03:54] (03PS7) 10Krinkle: toollabs/sql: Fix argument forwarding (-v breaks mysql) and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:04:28] (03PS3) 10Krinkle: toollabs/sql: Add handling for connecting to "meta_p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130073 [14:07:04] Krinkle: did you figure things out wrt jsduck? [14:07:15] akosiaris had updated the ticket since last week, aiui [14:07:17] (03PS8) 10Krinkle: toollabs/sql: Fix argument forwarding (-v breaks mysql) and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:07:36] paravoid: I'm waiting for someone to tell me where and how I can test the package. [14:07:47] paravoid: akosiaris put one on the repo last week, but it didn't work [14:07:59] Krinkle: jsduck 5 ? [14:08:06] quack [14:08:07] I think akosiaris was going to or has attempted to fix it, but I haven't heard back yet [14:08:09] matanya: yes [14:08:34] yes and I am still on it. I have packaged rkelly-remix (being a dependency). I will be contacting you later today so you can test [14:08:47] (03CR) 10Krinkle: "Clarified more of the changes in the commit message." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:09:22] (03CR) 10Krinkle: "This change did even more than the commit message says, effectively moved those out into separate changes by rebasing onto my change." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:09:32] akosiaris: awesome :) [14:10:16] (03CR) 10Reedy: Vary twemproxy config location based on getRealmSpecificFilename() (take 2) (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) (owner: 10Reedy) [14:11:02] hoo: can you undo your -1? [14:11:22] (03CR) 10Andrew Bogott: [C: 032] Change five UIDs to match ldap [operations/puppet] - 10https://gerrit.wikimedia.org/r/130069 (owner: 10Andrew Bogott) [14:11:37] done [14:12:02] (03CR) 10Krinkle: "This is not needed. We already have phantomjs in slave-scripts, via that same npm." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [14:12:20] hashar: [14:12:31] Krinkle: in conf call sorry [14:12:48] (03CR) 10Krinkle: [C: 04-1] "https://github.com/wikimedia/integration-jenkins/blob/master/bin/phantomjs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [14:19:58] (03PS1) 10Reedy: Use twemproxy on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130078 [14:20:21] Krinkle: where is the other phantomjs? :-) [14:20:27] read [14:20:29] up [14:20:40] yeah :D [14:20:49] I thought you wanted to drop that wmf grunt stuff entirely ? [14:20:55] so went with yet another repo [14:21:27] Which is why I am moving that dependency up from wmfgrunt to the root. But there is no need for setting up more cruft. slave-scripts has a package.json for this reason [14:21:39] ahhhhhh [14:21:47] Also, you should've included the pckage.json in ingegration/phantomjs.git so that it can be verified by someone. [14:21:47] I guess the same can be done for kss js [14:21:51] some kind of css linter [14:23:35] bin/phantomjs@ -> ../tools/node_modules/grunt-contrib-qunit/node_modules/grunt-lib-phantomjs/node_modules/phantomjs/bin/phantomjs :( [14:23:44] hashar: the other phantomjs has been there for a while, not new. But I've moved the dependency up a few directories so that it won't be removed if we drop wmfgrunt [14:23:44] (03CR) 10Giuseppe Lavagetto: [C: 031] "Looks good; added a small suggestion" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) (owner: 10Reedy) [14:23:47] Krinkle: should I add phantomjs to the /tools/package.json ? [14:24:01] hashar: you're behind a few minutes.. [14:24:23] yeah in conf call :D [14:24:48] hashar: I just moved it [14:24:52] (03CR) 10Hashar: [C: 04-1] "Ahhh will have to cleanup that mess so :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [14:25:12] (03PS1) 10Odder: Add image-reviewer group to Persian Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130079 (https://bugzilla.wikimedia.org/64532) [14:26:03] (03CR) 10Dzahn: [C: 031] Remove mysql client from bastionhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [14:26:23] (03CR) 10Reedy: Vary twemproxy config location based on getRealmSpecificFilename() (take 2) (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) (owner: 10Reedy) [14:27:57] Krinkle: great. Can we bump phantom js now ? :) [14:29:06] hashar: Already new enough afaik. [14:29:14] hashar: Which version do you need? [14:29:20] "phantomjs": "~1.9.0-1", [14:29:21] not sure [14:29:28] asking zeljkof [14:29:40] hashar: Also, read the commit messages before changing it. [14:30:20] (03CR) 10Calak: [C: 031] "Thanks." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130079 (https://bugzilla.wikimedia.org/64532) (owner: 10Odder) [14:30:26] Krinkle: lets follow up in #wikimedia-qa :) [14:34:42] (03PS3) 10Reedy: Vary twemproxy config location based on getRealmSpecificFilename() (take 2) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) [14:36:27] (03Abandoned) 10Hashar: contint: get phantomJS on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/130049 (owner: 10Hashar) [14:38:59] (03PS4) 10Reedy: Vary twemproxy config location based on getRealmSpecificFilename() (take 2) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) [14:46:56] (03CR) 10Giuseppe Lavagetto: [C: 032] "Merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129663 (https://bugzilla.wikimedia.org/62836) (owner: 10Reedy) [14:51:12] manybubbles: Since you have changes in it, do you want to do today's SWAT? [14:51:29] anomie: yeah, I'll do it! [14:54:10] doesn't look like odder is around. [14:54:25] hoo|away: you going to be available in case something goes wrong with you SWAT changes when I start in 5? [14:54:43] hoo|away: also, do you mind if I sync them together or should they go one after the other? [14:55:29] manybubbles: oh, I'm here. [14:55:55] twkozlowski: cool. sweet. your changes: do you want them one after another of all at the same time? [14:56:27] Let me look it up again to see what exactly is scheduled [14:56:55] manybubbles: Whatever suits you better, I have no preference [14:58:09] (03CR) 10Manybubbles: [C: 032] Add a new namespace to Hebrew Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129666 (https://bugzilla.wikimedia.org/64353) (owner: 10Odder) [14:58:16] * manybubbles has the conch [14:58:19] (03Merged) 10jenkins-bot: Add a new namespace to Hebrew Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129666 (https://bugzilla.wikimedia.org/64353) (owner: 10Odder) [14:59:01] (03CR) 10Manybubbles: [C: 032] Add Library of Congress to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130043 (https://bugzilla.wikimedia.org/64487) (owner: 10Odder) [14:59:10] (03Merged) 10jenkins-bot: Add Library of Congress to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130043 (https://bugzilla.wikimedia.org/64487) (owner: 10Odder) [14:59:25] twkozlowski: mind rebasing :https://gerrit.wikimedia.org/r/#/c/129675/ [14:59:59] Sure. [15:00:04] (03CR) 10BBlack: "Are we still waiting on verification that there's no ill effects for analytics?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129714 (owner: 10Dr0ptp4kt) [15:00:35] (03PS4) 10BBlack: Set domain to TLD on GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [15:02:47] (03PS2) 10Odder: National Library of Scotland to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129675 (https://bugzilla.wikimedia.org/64357) [15:03:47] manybubbles: ^^ [15:05:58] RECOVERY - twemproxy port on fenari is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [15:05:58] RECOVERY - twemproxy process on fenari is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [15:06:03] (03CR) 10Manybubbles: [C: 032] National Library of Scotland to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129675 (https://bugzilla.wikimedia.org/64357) (owner: 10Odder) [15:06:17] (03Merged) 10jenkins-bot: National Library of Scotland to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129675 (https://bugzilla.wikimedia.org/64357) (owner: 10Odder) [15:06:35] twkozlowski: k. deploying [15:06:52] (03CR) 10BBlack: [C: 032 V: 032] Set domain to TLD on GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/127131 (owner: 10Ori.livneh) [15:08:28] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'Add new sources to gwtoolset and namespaces to hewikisource' [15:08:33] Logged the message, Master [15:08:35] twkozlowski: ^^^^ [15:08:42] please make sure everything looks good [15:09:41] manybubbles: Looks fine to me. [15:09:44] hoo|away: I'll do yours when you come online [15:09:52] twkozlowski: sweet, consider yourself SWATed [15:09:59] :-) [15:22:40] apergos: dataset alerts? [15:22:46] (dataset2 & dataset1001) [15:28:30] anyone around to support hoo's swat deploy? [15:29:41] what is it? [15:30:14] Nikerabbit: https://gerrit.wikimedia.org/r/#/c/129707/ and https://gerrit.wikimedia.org/r/#/c/129708/ [15:30:44] seem simple enough but I don't want to do them without him around to verify that they worked properly [15:30:56] looking [15:31:48] I wonder if it has been verified on testwikidata [15:35:15] Krinkle: wget http://apt.wikimedia.org/pending/ruby-rkelly-remix_0.0.6-1_all.deb && ruby-rkelly-remix_0.0.6-1_all.deb [15:35:34] and test [15:36:01] and if everything is ok I will update apt.wikimedia.org [15:38:37] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 28 Apr 2014 03:36:28 PM UTC [15:39:27] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 28 15:39:25 UTC 2014 [15:39:37] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 28 Apr 2014 03:39:25 PM UTC [15:39:50] chasemp: could you have a look at https://gerrit.wikimedia.org/r/#/c/129728/ ? I'm hoping it can fit in with your user refactor... [15:40:40] if you want to wait this would be accomplished by adding a person to a meta absent group in yaml [15:40:51] basically doing the same thing but w/ a different flow [15:41:17] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 28 15:41:11 UTC 2014 [15:41:18] akosiaris: OK, testing now [15:41:27] although it's not wiping out remaining owned files [15:41:28] hashar: integration-dev doesn't have npm. Good or bad? [15:41:36] I'll use slave1002 for testing then [15:41:39] (home directory of course) [15:41:42] which I guess I would not be in favor of automating outside of /home [15:41:55] Krinkle: integration-dev is not pooled. Will setup jenkins/zuul/gerrit there whenever I have some time :] [15:42:28] It shouldn't be pooled, but it should have most of the same packages I suppose. [15:42:39] I guess the reason is because stuff isn't puppetised.. :/ [15:42:44] (03PS2) 10Reedy: Use twemproxy on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130078 [15:42:48] (03CR) 10Reedy: [C: 032] Use twemproxy on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130078 (owner: 10Reedy) [15:43:03] Krinkle: well there is a bunch of puppet class to setup zuul/jenkins/gerrit on labs [15:43:07] hashar: :D [15:43:10] Krinkle: gotta adjust them though since some IP have changed [15:43:14] to clarify, /home/user should be killed on user removal, but anything tied to their username on the system getting wiped out? unintended consequences abound in my mind and very little benefit (to me) [15:43:18] Reedy: awesome! :-] [15:43:20] akosiaris: Hm.. and which one for jsduck? (assuming it isn't included) [15:43:33] Reedy: if that works, make sure to post some announce on qa-l :-] [15:43:44] chasemp: if you look at email thread 'Disabled users in admins.pp' I got the impression there was support for automated removal of file ownership. [15:43:55] akosiaris: And how should I install the deb? (sorry don't usually use debian packages, I found a few instructions via google but I'd rather trust you) [15:43:57] But if you're already on top of all this I don't mind abandoning my patch; there's nothing urgent about it. [15:44:08] Krinkle: dpkg -i filename.deb [15:44:13] Krinkle: http://apt.wikimedia.org/pending/ruby-jsduck_5.3.4-1wmf1_all.deb if you don't have it already from last time [15:44:24] it is practically the same [15:44:28] OK [15:44:28] Krinkle: mailed you about phantomjs. The version specifier is a range, so we got 1.9.7 now (which fit our browser tests needs) [15:44:42] (03Merged) 10jenkins-bot: Use twemproxy on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130078 (owner: 10Reedy) [15:44:43] and sudo dpkg -i filename.deb and sudo apt-get install -f [15:45:36] akosiaris: OK, dpgk'ed both. $ jsduck --version; JSDuck 5.3.4 (Ruby 1.8.7); So far so good, further than last time when it er'ed right away. [15:45:42] Running more extensive test now, will get back to you [15:45:49] :-) [15:45:58] andrewbogott: I wish I knew what it was going to take to get that changeset I have have going, could be stuck in review or who knows. I had seen this and I didn't think to say hold off because it's good cleanup I think, and now is better than later? The outside of /home auto removal stuff it's all good to me, just meant I'm not handling it and I hadn't considered adding it until now [15:46:42] chasemp: does my patch do work that you haven't done already, or do you already have remove/disable/cleanup code in your patch? (Which I haven't read all of yet...) [15:47:05] In any case, let's talk about your patch in the meeting today, maybe we can get it merged quickly. [15:47:33] my stuff basically nukes any account not in a supplementary group that is above UID 500 as it is now [15:47:48] meaning if your account isn't explicitly added to a group it will be removed [15:47:56] the /home and all [15:48:09] Oh… that seems good I think. [15:48:25] what I'd like is that we get notification of what's owned by that user outside of the home directory, before we remove the user (this goes into accoumt mgmt outside of puppet though) [15:48:42] we could do that, but notification how? [15:48:46] Anything to avoid the problem where re-use of username or uid = inheriting an ex-employee's rights [15:49:11] this is why we used to keep the account in existence (uid is claimed) but disable logins (remove the ssh auth keys file) [15:49:17] to avoid that re-use [15:49:23] * andrewbogott nods [15:49:35] as far as notification, I was thinking of something more akin to an email from an audit [15:50:16] I'm interested in this topic because it bears on data retention; we are going to want to do automatic cleanup of .. some things created by users over 90 days [15:50:34] where 'some things' is up for discussion :-D [15:50:40] ha [15:50:47] so my cleanup stuff, not meant to be a security audit [15:50:50] replacment you know? [15:50:56] just, you don't belong here now so you are now removed [15:51:06] so I grep through some logs, I create a temp file, it sits there, logs are rotated by my file remains on there a year later... clearly no good [15:51:08] I didn't want to make it so that it seemed like it was a replacement for real cleanup [15:51:14] true [15:51:15] right, ok [15:51:37] I feel like putting logic I would feel good about in every puppet run [15:51:44] is just...going to not work out and be unreliable [15:51:49] better to make cleanup very simple [15:51:50] well my short answer then would be, don't remove stuff outside /home, but we should find out about those before the uid becomes just a number [15:51:57] yeah [15:52:01] er the user name becomes a uid I should say [15:52:02] a pre-allocate thing [15:52:09] that makes complete sense to me [15:52:48] so maybe your stuff could do the check and say 'user owns some stuff. get it cleaned up first'... dunno [15:53:03] finds n some of these boxes will take forever [15:53:06] yeah I feel like that will be ignored [15:53:09] We definitely want to avoid dangling files that are owned by a mysterious UID. Something that blocks account removal as long as those files exist, and send an email? [15:53:32] and I would rather nuke the user account and leave dangling files and leave it forever? [15:53:34] (Or take the brutal approach my patch takes :) ) [15:53:41] I feel like email is a bad notification ssytem here for now [15:53:42] Obv. we can't run the find on every puppet run though [15:53:50] and you could get an email from every server [15:53:51] no we don't want to do that [15:53:55] chasemp: icinga? [15:54:23] I really think, the cleanup we can run on account allocation w/ puppet is just to remove them if they don't belong [15:54:36] the dangling files and duplicate issued UID has to be part of human thing [15:54:40] and part of a real process [15:54:56] hm [15:55:00] automatiing it or saything it requires an email before an account is removed from any system that doesn't belong [15:55:06] man that will just stack up and stack up [15:55:30] I wouldn't hate it if we just got a puppet failure -- like "puppet is trying to remove X but X has dangling files, abort" which would get us notified about a puppet failure. [15:55:44] I just see that as infeasible [15:55:48] one account removal [15:55:52] could hose puppet on 50 hosts [15:55:54] I don't think it will stack up that much. we don't go through that many users [15:56:00] that is also now hosed for every puppet managed thing [15:56:10] It wouldn't actually break the rest of the things that puppet is doing though [15:56:17] so wouldn't damage the system. Would just cause alerts, which seems correct. [15:56:39] do we get alerts on puppet failures now? I thought staleness only? [15:56:43] I would not expect someone to copy their uer-owned files around to a pile of hosts [15:56:48] Hm, not sure [15:57:08] and we only get alerts now on failure to get an OK notification form a run after I guess it's 3 hours [15:57:10] What's the objection to just downing an automated chown when an account is removed? [15:57:22] chown to what? [15:57:25] 0 [15:57:31] that's what my current patch does [15:57:33] <_joe_> I completely back up chasemp :) [15:57:39] 0? [15:57:43] root? [15:57:48] not loving thaat at all [15:57:51] So the file remains but is no longer owned by mystery user [15:58:02] Secure without being destructive [15:58:20] ok so [15:58:21] We're never going to have the option of an ex-employee going through all their files and explaining which matter and which don't :( [15:58:22] <_joe_> we do not want to remove a file that has been accidentally owned by a user, I would not remove the home dir... [15:58:24] I'd rather have a uid with no login and minimal privs that owns cruft, if we were going to do that [15:58:30] but not root so much [15:58:48] _joe_: things I have now would use deluser and backup /home to /tmp w/ default removal [15:58:52] just fyi [15:59:07] <_joe_> chasemp: looks sane to me. [15:59:16] so [15:59:21] can we separate this [15:59:29] are we accounting fron cron jobs owed by the user too btw? [15:59:35] a iciinga check that looks for files owned by nonexistent user [15:59:38] because it will happen sooner or later [15:59:53] and just have account cleanup be 1:1 w/ yaml account assignment [16:00:08] <_joe_> apergos: that someone has a cron under its user and that cron failing will create a problem? [16:00:10] let's not hang on to accounts everywhere because of one user owned file [16:00:26] <_joe_> chasemp: agree. [16:00:30] _joe_: that people have put production stuff under their own names [16:00:34] Aren't stray files a security risk though? [16:00:41] <_joe_> apergos: we do have that now? [16:00:44] Or do you have some way to ensure uids are not reused? [16:00:54] I'd have to check :-D but we sure might [16:01:07] yes but too complex to be effectively dealt w/ in this context [16:01:23] there are so many edge cases and ways to make sure ppl are not gaming the system [16:01:33] let jenkins check a user id matches LDAP ? [16:01:56] so there's gaming and there's simple 'woops forgot' or 'didn't think of that', it's the latter that I'm targeting [16:02:24] apergos: what would you like to happen for a file outside of home for a user who no longer belongs on a box? [16:02:29] I think I missed your suggestion [16:03:26] (For reference… Bryan recently emailed me "... chances are that there are a lot of files owned by me [16:03:27] scattered around the cluster. I know there are files owned by me in [16:03:29] in /a/common on tin that I own from doing a few branch deploys." ) [16:03:29] they and ops have to look at the list to see if the files can be tossed, shoul be kept but chowned, should be made available to the user [16:03:33] So I don't think we're talking about stray outliers [16:04:06] tin will be bad [16:04:13] apergos: ok agreed [16:04:21] one common, collected report from all servers wouldn't be a bad thing [16:04:26] (but not trivial to implement) [16:04:30] I suppose the rsync of that, if it preserves uids, to all the apaches, will be bad too. ugh [16:04:33] yes this [16:05:02] I'm certainly not wanting 50 emails (or alerts), that's annoying rather than helpful [16:06:02] I can make an a check that alerts on leftovers, using salt or check_mk or something [16:06:13] my guess is this would be a weeks worth of cleanup from teh get go [16:06:14] so there's two things.. or 5, who knows... we don't ever have to reuse that uid even if we remove the account [16:06:33] ok so we disable in ldap instead of remove [16:06:36] this means that running a report oncvec a week, even if we clean up a bunch of folks from $x_host, is not a problem, [16:06:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 1.69491525424% of data exceeded the critical threshold [500.0] [16:06:42] and it doesn't allocate the uid again [16:06:47] we might have several uids how up but we know who they were [16:06:49] that would be a good idea yeah [16:06:56] no uid reuse if at all possible [16:07:26] I have done that w/ a kind of archive grouping that basically just reserved UID's? [16:07:35] as long as we dont delete labs users from ldap, and since we match the UIDs and always ask for labs user first.. that should already happen ? [16:07:51] (and i dont think we delete labs users, or) [16:07:51] let's make sure of the first part of that :-D [16:07:55] (03PS1) 10Alexandros Kosiaris: Publish carbon's IPv6 address in DNS [operations/dns] - 10https://gerrit.wikimedia.org/r/130087 [16:08:28] mutante: I think in theory yes this handled? [16:08:32] feel free to comment/rain fiery hell on me guys ^ [16:08:41] depends on how ops accounts are 'removed' or prod accounts in ldap [16:09:14] akosiaris: +1 [16:09:45] chasemp: they are just disabled in admins.pp but a new user would not get the same UID because it still exists in LDAP [16:09:55] :-D [16:10:15] umm... yay ipv6? :-) [16:10:29] mutante: so the new workflow would be, disable in ldap, generates a disabled yaml user [16:10:39] and ensures consistency ....foreverrrrrrrr [16:10:58] (03PS1) 10Manybubbles: Upgrade highlighter again [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/130088 [16:11:33] chasemp: i see .. [16:12:07] this decouple it form the file checks completely, liking it [16:12:18] *decouples [16:12:20] andrewbogott: so above when I mean it hoses puppet, I mean more like a broken puppet run is not a good alert because it could be broken for any number of reasons and if you remove a user and then we get 50 broken puppet runs it's easy become complacent when it could be hiding other issues being merged [16:12:21] *from [16:12:43] using puppet runs as a notification for specific issues being hosed just becomes monstrous over time [16:13:00] we don't hear about puppet errors, only about 'omg I couldn't compile the catalog' [16:13:10] but someday we do want to hear about errors [16:13:21] yeah, so it doesn't really work. [16:14:31] would a check that compiles and alerts on orphaned files work for people from icinga? [16:14:53] so for the files owned by old users, every host would have to run a cron with a find / -uid for each uid that does NOT exist in admin.pp [16:15:07] and if it finds stuff. send snmtrapd to icinga? [16:15:12] or really in passwd [16:15:21] That's fine with me, although… I can think of lots of reasons to not want to deal with orphans by hand, and I don't understand the objection to just chownign or clobbering them automatically. [16:15:23] yep, maybe once a week find, (in passwd, exactly) [16:15:23] since even if they exist they maybe are not supposed to exist there [16:15:41] and the results collected up and ... sad to say that one email stuffed somewhere [16:15:49] and not lost/ignored. [16:16:00] I am ok w/ chowning them or whatnot actually andrew, just don't want it stuffed into the puppet run [16:16:10] clobber automatically means we might break something. [16:16:12] i wouldn't let it automatically do stuff with the files, just report them to icinga to make a person look [16:16:13] and I don't like nagios checks that make changes really [16:16:20] so it almost has to be outside of those two [16:16:27] chown just masks the issue [16:16:52] we need to actually do something with them (toss, chown, hand over) [16:17:05] thats a good point [16:17:13] auto chowning means leaving them around forever probably [16:17:15] hard to find again [16:17:25] yep [16:17:43] I guess icinga is ok as long as we have some kind of default what-to-do policy. [16:17:58] Otherwise I know that if I'm facing 3000 mystery files I will opt to do nothing for fear of breaking stuff. [16:18:36] it could send mail to the user in question? [16:18:57] well, but then they are already not users anymore [16:19:06] in almost all cases I think we will either know by looking or can ask the user [16:19:12] yeah, and possibly gone/disgruntled/busy with new things [16:19:50] they may not be at wmf or they may have given us a personal email, but not sure we can expect icinga to deal with it [16:20:12] you can just make it mail first, like cronspam [16:20:22] and see how much it really is before caring about icinga [16:20:23] so cron jobs. there's some tendril related cron under a personal account right now (I don't consider this a problem, people stage stuff all the time) [16:20:26] just an example though [16:20:30] anyways... [16:22:30] If a cron is set to run with a uid that's no longer in /etc/passwd… what happens? Will it still run, no problems? [16:22:57] no idea [16:23:17] no, it won't run [16:23:23] I mean it's nt set by uid [16:23:25] it's set by name [16:23:30] Oh, great. [16:23:31] the name won't map to anything, right? [16:23:41] so not really a danger then, just a cleanup issue [16:23:47] it will stall out [16:23:58] well it's a danger if something suddenly stops running that was useful [16:24:27] is it not considered bad for something to run under the context of a user? [16:24:36] who is not root or a service I mean [16:24:37] well yes if that's how it is permanently [16:24:52] temporary is always permanent :) [16:24:52] if we'd have and keep all 'lastlog' from all hosts we'd know where they logged in and we would only have to check there instead of ALL hosts? [16:25:11] we wouldn't keep that forever (lastlog) [16:25:21] data retention, eh apergos [16:25:25] i expected that:) [16:25:25] indeed [16:25:28] :-) [16:25:29] ahhh good though mutante [16:25:34] thought I meant [16:25:39] is not logstash doing that now? [16:25:56] in fact I don't know how long our archives are but they should be 90 days [16:26:07] I haven't gotten to the accumulated logs yet, still working through the clusters [16:26:09] logstash isn't getting any system logs and only keeps ~31 days of data [16:26:28] bd808, you get a gold star! [16:27:17] I saw apache and elasticsearch when I was over there looking [16:27:24] "we have this important cron job that is important so we can't turn it off, but because of our data retention policy we don't know anymore who the user is":) [16:27:53] puppet will tell us, because [16:28:04] commit message: remove X, uid Y (rt #blah) [16:28:06] :-P [16:28:07] well, just saying, we should not have important cron jobs running as regular users, right [16:28:26] yes we should not, lots of things, but let's hedge our bets too [16:28:33] but then there are those "more like a system user"-users adin puppet [16:28:58] yes, those are a separate issue I don't even want to think about right now [16:29:25] ok this is going far afield :) but the original discussion on whether https://gerrit.wikimedia.org/r/#/c/129728 would be replaced by https://gerrit.wikimedia.org/r/#/c/129501/ [16:29:32] :-D [16:29:48] I think yes, with an alert from icinga about orphaned files in the environment [16:29:51] I'm tempted to say that in both these cases (stray files or stray cron jobs) just breaking/deleting them is fine. It's a case where things were already broken, we just didn't know about it before :) [16:30:17] But, yes, I'll abandon my patch in favor of chase's yaml-based stuff. [16:30:44] (03Abandoned) 10Andrew Bogott: Add the decom-user resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/129728 (owner: 10Andrew Bogott) [16:31:10] (03Abandoned) 10Andrew Bogott: Decom users preilly and dsc [operations/puppet] - 10https://gerrit.wikimedia.org/r/129729 (owner: 10Andrew Bogott) [16:31:16] andrewbogott: I'm sorry man I would have mentioned it I just figured sooner the better and yours is easier to get approved [16:31:25] np [16:33:02] apergos: The logstash cleanup script is https://github.com/wikimedia/operations-puppet/blob/production/modules/logstash/files/logstash_delete_index.sh and it's cron'd by https://github.com/wikimedia/operations-puppet/blob/production/modules/logstash/manifests/output/elasticsearch.pp [16:33:36] anomie: how'd this morning's SWAT go? [16:34:14] greg-g: Ask manybubbles|away, he did it today [16:34:33] hi sarah, welcome :) [16:34:44] ah excellent, thanks bd808 [16:34:47] adding this to my notes [16:46:51] (03PS1) 10Reedy: Add deployment-db2 as slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130092 [16:51:14] !log executed graceful-stop, start for apaches in order to load the new php-luasandbox apache module [16:51:21] Logged the message, Master [16:51:57] manybubbles|away: how was the SWAT this morning? [16:52:10] anomie: he was away, so went to you ;) [16:57:52] (03PS1) 10Faidon Liambotis: Kill check_job_queue [operations/puppet] - 10https://gerrit.wikimedia.org/r/130095 [16:59:00] yay [16:59:03] (ish) [17:04:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1% data above the threshold [250.0] [17:08:07] (03PS1) 10Faidon Liambotis: Remove references to sdtpa devices [operations/puppet] - 10https://gerrit.wikimedia.org/r/130098 [17:08:36] (03CR) 10Rush: admin module for user/group/permissions cleanup (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129501 (owner: 10Rush) [17:08:57] greg-g: let me update the page [17:10:19] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=111144&oldid=111131 [17:10:23] paravoid, akosiaris: I'd like to re-assign your UID on the cluster sometime soon. Shouldn't be much of a headache for you, but it does require me to killall -u you beforehand. [17:10:24] greg-g: ^^^ [17:10:38] manybubbles: ty [17:10:38] Would it be OK if I do that towards the end of my workday today? 4-5 hours from now? [17:10:58] (Or, actually, if you just want to do yourselves at your leisure, that's also fine with me :) ) [17:11:10] whenever you please [17:11:18] I suppose you're also chowning files? [17:11:22] yep [17:11:30] yeah, no worries [17:11:36] just give me a 3' heads-up [17:11:36] I'd prefer not to kill your sessions when you're actually doing things, hence aiming for when you're asleep [17:12:03] I shouldn't have any sessions open right now [17:12:11] hm, ok. [17:12:30] would it be useful to add checks on each host for cpu/memory usage? [17:12:52] paravoid: partly, it takes a big window for things to settle down (about an hour) since I don't want to force a cluster-wide puppet run [17:12:56] either by polling ganglia or using nrpe ? [17:13:08] (03PS2) 10Rush: admin module for user/group/permissions cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/129501 [17:13:10] (03PS2) 10Rush: one-off to convert admins.pp to yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/129541 [17:13:22] andrewbogott: can you do it in 40-50ish minutes or so? [17:13:27] ops meeting and all that [17:13:36] chasemp: you are probably the right person to ask [17:13:37] (03CR) 10Rush: "fixed up sudo to use a numeric perpend, fixed linting concerns" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129501 (owner: 10Rush) [17:13:45] Sure, I can merge the puppet patch right when the meeting starts. [17:13:51] akosiaris: does that suit you as well? [17:13:52] andrewbogott: thanks btw for doing this, it doesn't sound fun at all [17:13:57] matanya: ? [17:14:09] Eh, little bit of manual labor makes for a nice break now and then. [17:14:21] chasemp: would it be useful to add checks on each host for cpu/memory usage? [17:14:23] (03PS1) 10Andrew Bogott: Sync UIDS with ldap for Faidon and Alexandros [operations/puppet] - 10https://gerrit.wikimedia.org/r/130101 [17:14:47] (03CR) 10Andrew Bogott: [C: 04-2] "Andrew will self-merge this during a pre-arranged window." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130101 (owner: 10Andrew Bogott) [17:15:58] mutante: is there not already in ganglia? and if you mean, diamond then I think yes. I had some stuff ready, I think it's waiting on some nice changes for a dimaond module in ori's hands now tho [17:16:01] (03CR) 10jenkins-bot: [V: 04-1] one-off to convert admins.pp to yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/129541 (owner: 10Rush) [17:16:04] (03PS1) 10Faidon Liambotis: swift: fix syntax error in swift-drive-audit [operations/puppet] - 10https://gerrit.wikimedia.org/r/130102 [17:16:13] chasemp: matanya :) [17:16:16] MaxSem: hey [17:16:29] mutante: oops :) sorry sir [17:16:53] but i didnt know about diamond module in ori's hands, so cool [17:16:55] (03CR) 10Faidon Liambotis: [C: 032] Kill check_job_queue [operations/puppet] - 10https://gerrit.wikimedia.org/r/130095 (owner: 10Faidon Liambotis) [17:17:06] paravoid, weee [17:17:07] matanya: see above [17:17:14] (03CR) 10Faidon Liambotis: [C: 032] Remove references to sdtpa devices [operations/puppet] - 10https://gerrit.wikimedia.org/r/130098 (owner: 10Faidon Liambotis) [17:17:21] (03CR) 10Faidon Liambotis: [C: 032] swift: fix syntax error in swift-drive-audit [operations/puppet] - 10https://gerrit.wikimedia.org/r/130102 (owner: 10Faidon Liambotis) [17:17:40] (03CR) 10Faidon Liambotis: [V: 032] swift: fix syntax error in swift-drive-audit [operations/puppet] - 10https://gerrit.wikimedia.org/r/130102 (owner: 10Faidon Liambotis) [17:17:49] matanya: https://gerrit.wikimedia.org/r/#/c/129075/ [17:19:36] MaxSem: question for you [17:19:44] sure [17:19:52] there's a cronjob that runs on terbium [17:19:54] update-geodata [17:20:18] for which we receive the stderr by mail, as we probably should keep doing [17:20:34] update-geodata does a foreachwiki, and for one of the wikis (iirc, testwiki) this is printed on stderr: [17:20:37] This script is only for wikis with Solr GeoData backend [17:20:54] therefore, we get an email every 30 minutes :) [17:21:09] anything useful in it?:P [17:21:15] just this line [17:21:57] meh [17:22:06] let's just mute it [17:22:09] 1 sec [17:26:31] (03PS1) 10MaxSem: STFU update-geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/130106 [17:26:45] paravoid, ^ [17:26:56] chasemp: i updated the diamond module patch, go ahead and re-review [17:27:03] diamond module patch? [17:27:07] sounds interesting! [17:27:26] paravoid: https://gerrit.wikimedia.org/r/#/c/129075/ [17:27:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] STFU update-geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/130106 (owner: 10MaxSem) [17:27:37] MaxSem: oh danke :) [17:27:57] thanks all, i should go on with those stupid questions, they engage reviews :D [17:27:58] paravoid: i have a couple of debian packaging questions if you have a moment [17:28:04] hit me [17:28:32] paravoid: ok, so i tried my hand with statsite. i know our requirements are less strict than debian's, but i wanted to try and meet the maximal standard [17:28:56] and ottomata i would like to hear if you have any comment on the ferm patches from earlier [17:29:18] ori: okay [17:29:39] ori: do you mean this? https://github.com/armon/statsite [17:29:57] paravoid: one issue is: the source tree contains libev and builds against it rather than debian's. i need to figure out which dh_override_foo to customize to change that. i'm also wondering if the debian/copyright needs to list libraries bundled with the upstream tarball if they're not included in the package [17:30:20] heh, welcome to the Debian hell [17:30:43] embedding libraries is generally frowned upon [17:30:47] the policy is against it [17:30:53] but some upstreams really love it [17:30:58] oh matanya, thanks I see you moved the base::firewall [17:31:01] +2s all the way for me! [17:31:11] i will merge and test shortly [17:31:20] the code that will use the embedded libev is probably going to be in upstream's autoconf [17:31:27] ori: gage and I were packaging up https://github.com/deviantART/pystatsd, I've used it quite a bit in the past and it's nice and simple [17:31:40] if you're lucky, they'll have an option --with-system-libev or something, in which case you'd need to override dh_auto_configure [17:31:44] and just pass the option [17:32:01] if not, you'd have to patch autoconf, and hence you will probably also need to use dh-autoreconf [17:32:26] it uses scons :P [17:32:31] what I usually do, is nag upstreams to remove it themselves [17:32:36] that's how I met bblack actually :) [17:32:49] i might just push to a personal github repo so i can show you whati 'm talking about [17:33:13] sometimes upstreams aren't as nice as brandon, though, cf. https://github.com/twitter/twemproxy/pull/121 [17:34:21] heh [17:35:03] but yeah, I always say that packaging is the easy part [17:35:08] as in debian/ [17:35:17] dealing with embedded & license hells is the worst part :) [17:35:53] as for d/copyright [17:36:03] you need to either document libev's copyright there indeed [17:36:15] or alternatively, strip the embedded source from the tarball and repackage it [17:36:27] i went with the former [17:40:49] paravoid: https://github.com/atdt/statsite-pkg/blob/master/debian/copyright (incomplete, still has *.ex files, etc.) [17:41:44] needs a License: LGPL section but I guess you knew that [17:41:50] yep [17:41:57] also, the dep5 url is deprecated [17:42:12] https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ is the canonical copy nowadays [17:42:21] it's slightly difference but I don't remember the differences [17:42:23] ah [17:42:37] *different [17:42:40] like, some tiny details [17:43:27] I'd also add a Comment: not actually being used in the binary (or something better phrased along these lines) for libev [17:44:29] might not be worthwhile, chasemp evaluated statsite before and found the efficiency of a C implementation not worth the diminished debugabbility / extensibility that came with it [17:44:30] ori: it's easy, isn't it :) [17:44:38] (03CR) 10Rush: [C: 032] "this all looks good to me, will need to readd the cpu / mem stuff to a base roll but otherwise sweetness." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129075 (owner: 10Ori.livneh) [17:44:52] paravoid: i don't want to admit how long it took me to get to that spot [17:45:01] really? [17:45:07] well, an afternoon [17:45:32] I mean, I learned this a very long time ago, so I might have forgotten about how difficult it was for someone new [17:45:41] greg-g: any objections to me pushing an upgrade to our highlighter plugin to the elasticsearch machines today? [17:45:46] and maybe I'm comparing with the pre-dh days which were significantly more complicated [17:45:50] paravoid: that question I asked you last week that took you 3 seconds, i spent like 3 hours on it :) [17:45:56] paravoid: it's a lot of sifting through outdated / inaccurate documentation [17:47:15] chasemp: bd808 has a cleaned-up / simplified statsd implementation based on pystatsd [17:47:25] chasemp: it looked at it quickly and it was very small and self-contained, which i liked [17:47:31] is it in service now? [17:47:36] i don't think so, no [17:47:36] not yet [17:47:54] linky linkify me? [17:47:55] chasemp: https://github.com/bd808/yastatsd [17:47:55] (03PS1) 10Dzahn: WIP - turn imagescaler into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/130113 [17:48:32] It's a rewrite of the statsd server I wrote for $DAYJOB-1 [17:49:23] php > $DAYJOB = "Developer at Wikimedia"; [17:49:23] php > var_dump( $DAYJOB-1 ); [17:49:23] int(-1) [17:49:33] I don't care which statsd-like thing we use as long as we find one that works. [17:49:39] I feel the same [17:49:41] manybubbles: I don't think so, why should I be worried? [17:49:44] ditto [17:49:51] manybubbles: why should I/should I? :) [17:49:55] this is ridiculously similar to what I came up based on the same origin :D [17:50:05] greg-g: no worries [17:50:13] manybubbles: coolio then :) [17:50:18] a few differences are I used twisted as the stock epoll kept crashing on vulnerability scan and they have solved a lot of things in that sense [17:50:22] and a few others [17:50:31] will circle back here when I get through it all [17:50:47] (03PS2) 10Dzahn: WIP - turn imagescaler into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/130113 [17:50:55] (03CR) 10Manybubbles: [C: 032 V: 032] Upgrade highlighter again [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/130088 (owner: 10Manybubbles) [17:52:15] (03PS1) 10Ottomata: Adding haithams on bast1001 in order to access stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/130115 [17:52:29] (03CR) 10Ottomata: [C: 032 V: 032] Adding haithams on bast1001 in order to access stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/130115 (owner: 10Ottomata) [17:53:14] (03PS1) 10Dzahn: change my UID to match labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/130116 [17:54:15] !log deploying a new version of our Elasticsearch highlighter by doing a rolling restart on Elasticsearch machines - should cause no interruption of service [17:54:22] Logged the message, Master [17:55:05] chasemp: disregard the statsite distraction then; it sounds like you're in a position to make good choices with respect to that setup so i'll move to something else [17:55:49] ori: https://gerrit.wikimedia.org/r/#/c/129930/ [17:57:04] mutante: paravoid was against changing imagescaler into a module iirc [17:57:46] moar statsd implementations! [17:57:49] ottomata1: my git-deploy didn't take! [17:58:06] I have to say, a statsd server based in C sounds appealing, even without knowing anything about it [17:58:10] I haven't restarted anything yet, but it is making tons of 4.0K just files [17:58:12] I guess that says a lot about my bias [17:58:25] next weekend [17:58:28] paravoid: only if its evented properly, I guess [17:58:30] i'm gonna write my own statsd [17:58:31] in asm [17:58:35] and force us to use it [17:58:44] ori: it's all good to me man, just didn't want to use cycles in parallel. I wish there was tigher integration between tickets and changesets to avoid that kind of toe stepping or make it better...bwahahah I said it [17:58:53] please write it in closure [17:59:31] ottomata1: I did it a second time and they all seem to have worked [18:00:31] hmmmm [18:00:33] is there something I can do to pybal to deactiviate a machine with a script? [18:00:43] like, that I can add to my elasticsearch restart script [18:01:16] paravoid: can you please enlighten me on your view of modularized imagescaler [18:01:51] because right now we're relying on retries in Cirrus to smooth over the downtime blips - but it'd be neat to just remove the server from the rotation while it is down [18:01:52] we're in a meeting [18:02:21] sorry, when you have time. [18:02:23] (03CR) 10Andrew Bogott: [C: 032] Sync UIDS with ldap for Faidon and Alexandros [operations/puppet] - 10https://gerrit.wikimedia.org/r/130101 (owner: 10Andrew Bogott) [18:06:45] manybubbles: you can depool a machine with pybal [18:06:58] but it involves editing a file on...uh used to be fenari, maybe on iron now? [18:07:05] ottomata: ah [18:07:12] I was hoping I could curl somethign [18:07:19] or scp a file or something [18:07:21] you can curl a tarball of openssh [18:07:23] :P [18:07:26] ha [18:07:28] yeah, i mean [18:07:29] http://noc.wikimedia.org/pybal/eqiad/search [18:07:31] its that file [18:07:34] and that file is just hosted somewhere [18:07:37] ah [18:07:39] and can be edited live [18:07:54] woudl be good if you could edit it [18:07:59] not sure who to ask...:) [18:08:14] I imagine you are the person I'd ask:) [18:08:46] it's fenari iirc [18:08:52] fenari:/home/wikipedia/conf/pybal/eqiad/search [18:09:05] yeah, but is it still fenari since pmtpa move? [18:09:29] edit the file, comment out a node, watch bytes in/out on ganglia :) [18:10:08] ottomata: feari is still alive [18:10:45] aye k [18:12:20] pybal still on fenari.. noc still on fenari.. apache config deploy still on fenari [18:12:25] for a little while longer [18:14:22] aye k [18:19:43] (03CR) 10Dr0ptp4kt: "We are not waiting on that. QChris +1'd." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129714 (owner: 10Dr0ptp4kt) [18:20:16] ^d: does gitblit returns 500 [18:21:38] * why return [18:22:07] ..to get to the other side? [18:22:51] :) [18:24:13] Wikimedia Platform operations, serious stuff | Log: http://bit.ly/wikisal | Channel logs: http://ur1.ca/edq22 | MediaWiki error counts: http://ur1.ca/edq1f | Requests: ops-requests@rt.wikimedia.org | on RT duty: Andrew Bogott [18:24:17] whoops [18:25:51] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [18:37:37] apergos: we can hear you just fine [18:37:41] You're super loud :) [18:39:49] !log Running deleteEqualMessages.php on abwiki (bug 43917) [18:39:55] Logged the message, Master [18:40:28] PROBLEM - Varnishkafka Delivery Errors on cp1038 is CRITICAL: Return code of 137 is out of bounds [18:40:28] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: Return code of 137 is out of bounds [18:40:28] PROBLEM - Varnishkafka Delivery Errors on cp4003 is CRITICAL: Return code of 137 is out of bounds [18:41:12] ok well that's weird [18:41:20] every time people have some different complaint [18:41:22] !log Jenkins disconnected lanthanum slave, killed all jenkins-slave process on it and repooled server. [18:41:29] Logged the message, Master [18:41:47] my mic is turned way down too [18:42:20] whatever, google has been tell me 'are you talking to people? unmute now' (except for then) when I am hardware muted so it's getting annoying [18:42:44] this is a new development, but it does seem that every time I'm on google voice there is some 'new development' [18:43:16] neon not talking to me [18:43:20] eh? [18:43:25] no ssh or icinga web ui [18:43:31] hm [18:43:43] oh wait... just slooooow [18:43:54] _joe|away https://github.com/jkrauska/check_graphite.py/blob/master/check_graphite.py [18:43:54] load average: 70.48, 60.17, 37.74 [18:43:56] hehe [18:44:06] PROBLEM - Varnishkafka Delivery Errors on amssq47 is CRITICAL: Return code of 137 is out of bounds [18:44:08] II'm on it [18:44:11] yeah slow I guess [18:44:24] on it = on th host, not on the error [18:44:39] RECOVERY - Varnishkafka Delivery Errors on amssq47 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:44:44] lots of max concurrent checks reached messages in log [18:44:46] RECOVERY - Varnishkafka Delivery Errors on cp4003 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:44:46] RECOVERY - Varnishkafka Delivery Errors on cp1038 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:44:55] akosiaris jinxed it [18:45:14] swap [18:45:34] heh [18:45:46] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:46:14] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+eqiad&h=neon.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [18:48:16] !log Running deleteEqualMessages.php on afwikiquote (bug 43917) [18:48:22] Logged the message, Master [18:50:50] <_joe|away> cajoel: seen it, it does not really do what we want to do with graphite-to-nagios feedback loop [18:51:10] <_joe|away> cajoel: thanks for the pointer anyway! [18:51:22] I've love to learn more about what you did.. (I wrote that simple one..) [18:51:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 1.69491525424% of data exceeded the critical threshold [500.0] [18:52:43] <_joe|away> cajoel: welll take a look here: https://github.com/lavagetto/nagios-plugins/blob/master/check_graphite.py [18:53:35] <_joe|away> cajoel: I can go deeper in that after the meeting [18:53:50] _joe|away cool - thx [18:58:09] (03PS1) 10Dr0ptp4kt: Remove OM tagging for 635-10. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130121 [18:59:29] bblack, when you have a moment, ^^ your review and +2 with merge and deploy would be appreciated [19:01:07] !log Running deleteEqualMessages.php on bat-smgwiki (bug 43917) [19:01:14] Logged the message, Master [19:01:24] (03PS1) 10Springle: Enable TokuDB for dbstore100[12]. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130122 [19:01:45] <_joe_> springle: toku is *cool* [19:02:52] (03CR) 10BBlack: [C: 032 V: 032] Remove OM tagging for 635-10. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130121 (owner: 10Dr0ptp4kt) [19:03:27] <_joe_> springle: does mariadb come with toku incorporated? if not, how do you manage its inclusion with puppet? [19:04:31] _joe_: built the test mariadb 10 packages from source. the plugin is built by default [19:05:38] (03PS2) 10Springle: Enable TokuDB for dbstore100[12]. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130122 [19:05:43] <_joe_> springle: oh ok :) [19:06:11] (03CR) 10Tim Landscheidt: [C: 04-1] "I'm an idiot. This associates the hostnames with the external IP, not the internal. *argl*" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123149 (https://bugzilla.wikimedia.org/54052) (owner: 10Tim Landscheidt) [19:06:25] the default tokudb compression results in a disk footprint about 30% of equivalent innodb, for shards 1-7 [19:06:30] crazy [19:06:58] innodb with zlib compression can't get close [19:07:27] DDL is much faster too, and partly online [19:07:38] <_joe_> springle: seen that in my previous gig - toku is incredible [19:07:49] have to see how concurrency goes for our traffic [19:08:05] no hot backup without $$$ htough :/ [19:08:25] <_joe_> cajoel: It's a bit late here in Italy - maybe we can speak better about this another time? Basically, I think any nagios plugin should have 3 steps: 1) fetching the data 2) parsing them in classes of data 3) presenting the result and fire an alarm [19:08:56] <_joe_> so I wanted to have something that clearly separated the three steps and that will allow us to create almost any type of alarms from graphite in a pluggable way [19:10:05] <_joe_> for example, it's relatively easy to add a check that works on the change speed of graphite variables with my check, or add forecasting techniques other than HoltWinters (which is what graphite offers by default). [19:10:18] <_joe_> now I'm really going to dinner :) [19:10:24] dinner! yes [19:10:26] _joe_: beside lack of hot backup, tokudb would also rule out clustering tech like galera. hoping they decide to add support for wsrep [19:11:09] (03CR) 10Springle: [C: 032] Enable TokuDB for dbstore100[12]. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130122 (owner: 10Springle) [19:11:13] <_joe_> springle: no hot backup is their strategy to make you pay [19:11:21] yep [19:11:23] fair enough [19:11:26] (03PS2) 10Andrew Bogott: change my UID to match labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/130116 (owner: 10Dzahn) [19:11:27] but won't fly here [19:14:25] (03CR) 10Andrew Bogott: [C: 032] change my UID to match labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/130116 (owner: 10Dzahn) [19:14:37] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [19:19:54] (03PS1) 10Ori.livneh: Unset $wgUseXVO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130132 [19:21:06] (03PS1) 10Faidon Liambotis: appserver: set PHP's expose_php to Off [operations/puppet] - 10https://gerrit.wikimedia.org/r/130135 [19:21:09] ori: ^ [19:21:25] we're ambushing all the headers! [19:22:06] (03CR) 10Ori.livneh: [C: 031] appserver: set PHP's expose_php to Off [operations/puppet] - 10https://gerrit.wikimedia.org/r/130135 (owner: 10Faidon Liambotis) [19:22:11] manybubbles: elsatic1004? [19:22:28] !log Running deleteEqualMessages.php on rowiktionary (bug 43917) [19:22:28] (03CR) 10Faidon Liambotis: [C: 031] ":)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130132 (owner: 10Ori.livneh) [19:22:34] Logged the message, Master [19:22:38] ottomata: ? is it blowing up on you? [19:22:40] we should had cross +2 them [19:22:46] :P [19:23:06] ottomata: I just restarted it [19:23:15] (03CR) 10Ori.livneh: [C: 032] Unset $wgUseXVO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130132 (owner: 10Ori.livneh) [19:23:27] ok cool, just saw icinga upset [19:23:39] (03CR) 10Krinkle: "Hm.. original file creation powered by Windows CRLF?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130135 (owner: 10Faidon Liambotis) [19:23:43] am I missing those or has it not made it to irc? [19:23:47] Krinkle: apparently [19:24:13] (03Merged) 10jenkins-bot: Unset $wgUseXVO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130132 (owner: 10Ori.livneh) [19:24:36] (03PS1) 10Krinkle: appserver: Convert CRLF to LF in php.ini [operations/puppet] - 10https://gerrit.wikimedia.org/r/130139 [19:24:45] that's where I saw it [19:24:45] fair enough :) [19:24:49] here [19:24:59] !log ori updated /a/common to {{Gerrit|I5e0709ef0}}: Unset $wgUseXVO [19:25:01] oh damn, this breaks my commit :) [19:25:03] icinga-wm [19:25:03] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111   [19:25:03] 3:14 [19:25:07] Logged the message, Master [19:25:07] (03CR) 10Krinkle: [C: 031] appserver: set PHP's expose_php to Off [operations/puppet] - 10https://gerrit.wikimedia.org/r/130135 (owner: 10Faidon Liambotis) [19:25:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] appserver: Convert CRLF to LF in php.ini [operations/puppet] - 10https://gerrit.wikimedia.org/r/130139 (owner: 10Krinkle) [19:25:41] !log ori synchronized wmf-config/InitialiseSettings.php 'I5e0709ef0: Unset $wgUseXVO' [19:25:48] Logged the message, Master [19:26:35] wtf [19:27:01] Status Submitted, Merge Pending [19:27:06] what is that supposed to mean? [19:27:19] it doesn't merge [19:27:33] ^d: here? [19:28:14] <_joe_> paravoid: you +2'd before jenkins verified [19:28:17] springle: so, there's missing data [19:28:23] yes, I know [19:28:38] jenkins doesn't know how to verify php.ini so I wanted to save some time [19:28:40] <_joe_> just hit 'submit patch 1'? [19:28:43] I do that often :) [19:28:47] I already did that [19:28:53] that's why it says "Submitted, Merge Pending" [19:28:55] <_joe_> oh, ok. [19:29:23] (03Abandoned) 10Rush: debian packaging directory with our details [operations/debs/ircd-ratbox] - 10https://gerrit.wikimedia.org/r/129835 (owner: 10Rush) [19:29:26] ori: oh, you did the sync check with logs? [19:29:53] ^d: when you get back, I have a patchset in limbo for you, https://gerrit.wikimedia.org/r/#/c/130139/ [19:30:05] ^d: submit does nothing, it doesn't merge [19:30:12] !log Running deleteEqualMessages.php on simplewiki (bug 43917) [19:30:18] Logged the message, Master [19:30:46] hmm [19:30:51] it's dependent isn't merged [19:30:59] (03CR) 10Faidon Liambotis: [C: 032] appserver: set PHP's expose_php to Off [operations/puppet] - 10https://gerrit.wikimedia.org/r/130135 (owner: 10Faidon Liambotis) [19:31:12] ah there we go [19:31:14] ^d: nevermind :) [19:31:47] springle: not comprehensively, no. (had to sign off unexpectedly last night.) but on db1048: "select count(*) from WikimediaBlogVisit_5308166 where `timestamp` >= 20140425000000 and `timestamp` < 20140426000000;" returns 0 [19:32:22] springle: whereas on vanadium "zgrep /var/log/eventlogging/archive/WikimediaBlog client-side-events.log-20140426.gz | head" shows events [19:32:34] er, paste fail [19:32:37] (03PS1) 10Rush: change .gitreview for debian branch [operations/debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/130142 [19:33:00] "zgrep WikimediaBlog /var/log/eventlogging/archive/client-side-events.log-20140426.gz | head" even [19:34:23] springle: but i'm not sure exactly how things went down. IIRC i told you the uuid column was unique, and my entire understanding of the migration process was predicated on that, and it turns out to be false [19:35:10] that's a much bigger chunk missing than I would have expected from the uuid/id mixup though [19:35:14] entire day [19:35:18] seems odd [19:36:03] bblack, thx for review [19:36:05] <_joe_> uhm cant reach icinga [19:36:21] <_joe_> somebody knows what's up? [19:36:45] <_joe_> I can't ssh into neon either [19:37:38] same here, ping but no ssh. anyone on console? [19:38:03] <_joe_> jgage: I'm getting to that [19:38:06] springle: db1047 is also atrociously slow in responding to simple queries. 'select * from WikimediaBlogVisit_5308166 limit 1;' takes anywhere from 5 to 10 seconds to produce a result [19:38:07] * jgage looks around for mgmt passwd [19:38:11] it was swapping earlier. apergos was looking into it? (maybe?) [19:38:15] <_joe_> give me 4 mins [19:38:45] it seemed to be recovering [19:38:50] ori: that's the federated tables. db1046 is just about back up, after which we can switch back to replication [19:39:05] not db1047, neon [19:39:19] (03CR) 10Jkrauska: [C: 031] "yes please" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130062 (owner: 10Dzahn) [19:40:01] <_joe_> apergos: what's up with neon? [19:40:07] (03CR) 10Jkrauska: [C: 031] "yep" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 (owner: 10Dzahn) [19:40:16] <_joe_> I can't get into console either... [19:40:46] springle: kk. so i guess the remaining TODO for me is to have a consumer load missing data into db1048 by reading from the file logs. any advice on how to do that, since there's no unique key? if not select exists .. insert ? [19:40:48] * _joe_ votes for a power cycle [19:41:09] _joe_ sounds reasonable to me. where did you find the mgmt passwd? [19:41:21] <_joe_> jgage: can you manage this? it 10 PM here, so if not strictly necessary I'd bail out [19:41:40] ori: that would be fine if it's single threaded. otherwise, can we make the uuid fields properly unique? [19:41:41] _joe_ sure [19:41:46] ori: and not null [19:42:01] can anyone tell me where to find the mgmt passwd? [19:43:02] * jgage switches channels [19:43:33] springle: i'd prefer to do that (make the field properly unique). yesterday we discussed a more ambitious plan to make the uuid the primary key as well, but i forget what the verdict was re: just making uuid unique + not null. iirc, it was: will probably hurt performance a little, but not significantly [19:43:36] is that right? [19:43:50] ori: correct. i think do it [19:43:54] uh, it was ok when I got off [19:44:02] (03PS2) 10Rush: change .gitreview for debian branch [operations/debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/130142 [19:44:04] (03PS1) 10Rush: debian packaging directory with our details [operations/debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/130145 [19:44:21] ori: i can run it through the online schema change script [19:45:10] springle: that would be awesome. i can update eventlogging's code to make sure any new tables created from now on make the uuid unique [19:45:47] can I ask someone else to pick up neon? (this is the third time I have put my dad off a sip call) [19:46:08] ori: sounds good [19:46:10] <_joe_> apergos: what was hogging the memory? [19:46:31] <_joe_> apergos: it's me and gage trying to troubleshoot this but I cannot get into console [19:46:37] I couldn't tell by the time I got on, it was already clearing up [19:46:56] cajoel: I've played with conntrackd previously; it was dead easy to set up [19:47:06] ok lemme tell him to try again :-D [19:47:07] sec [19:48:13] so you can't get in on mgmt either, _joe_? [19:48:35] <_joe_> apergos: no [19:49:13] yay ganglia is unhappy now too [19:50:31] (03CR) 10Rush: [C: 032] "this is the existing code that was running on ekrem, I am making it master branch and doing a self +2...god help me" [operations/debs/ircd-ratbox] - 10https://gerrit.wikimedia.org/r/129832 (owner: 10Rush) [19:52:23] (03PS1) 10Mattflaschen: Change format of GettingStarted config to use namespace prefix [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130147 [19:53:53] paravoid: did you try some live cutover testing? (how effective is it?) [19:55:08] (03CR) 10Mattflaschen: [C: 032] Change format of GettingStarted config to use namespace prefix [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130147 (owner: 10Mattflaschen) [19:56:18] (03CR) 10Rush: [V: 032] "jenkins is nowhere to be found" [operations/debs/ircd-ratbox] - 10https://gerrit.wikimedia.org/r/129832 (owner: 10Rush) [19:56:20] (03Merged) 10jenkins-bot: Change format of GettingStarted config to use namespace prefix [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130147 (owner: 10Mattflaschen) [19:57:46] (03CR) 10Rush: [C: 032 V: 032] ".gitreview for master branch" [operations/debs/ircd-ratbox] - 10https://gerrit.wikimedia.org/r/129833 (owner: 10Rush) [19:58:52] !log mflaschen synchronized php-1.24wmf1/extensions/GettingStarted/ 'Sync GettingStarted for Growth team deploy' [19:58:58] Logged the message, Master [19:59:27] !log mflaschen synchronized php-1.24wmf2/extensions/GettingStarted/ 'Sync GettingStarted for Growth team deploy' [19:59:30] hashar: is this a good time for me to reassign your UID? You'll have to stay logged out of production for ~ an hour. [19:59:34] Logged the message, Master [19:59:45] greg-g: What SWAT's aren't done if I'm not around? [20:00:03] andrewbogott: sure :] [20:00:13] andrewbogott: are you changing my production UID? [20:00:15] hoo: is the question "why do I need to be around for a SWAT deploy?" [20:00:16] akosiaris: jsduck verified. Works great. [20:00:20] hashar: yep, labs won't be affected. [20:00:26] greg-g: Yep, those were simple things [20:00:26] andrewbogott: great. [20:00:42] andrewbogott: I get accounts on tin fenari gallium and lanthanum [20:00:49] !log mflaschen synchronized wmf-config/InitialiseSettings.php 'Update GettingStarted config for new format' [20:00:51] If they were more complex, I would have created an own window and did it mysefl [20:00:55] Logged the message, Master [20:01:01] hoo: simple is never easily defined, and thus it is best if the requesting developer, or someone familiar with the request, can be around to deal with anything/any questions that might arise [20:01:14] aud.e would have been around [20:01:40] then next time put her name down? :) [20:02:20] !log deployed Parsoid cab9348e using deploy 9e9030d [20:02:24] Logged the message, Master [20:02:56] greg-g, done. [20:03:26] paravoid: akosiaris: jsduck verified in labs. Works great. Can be pushed to production as far as I'm concerned [20:03:34] (03PS1) 10Andrew Bogott: Remove access for Daniel Bauer [operations/puppet] - 10https://gerrit.wikimedia.org/r/130151 [20:03:36] (03PS1) 10Andrew Bogott: Change Hashar's UID to match labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130152 [20:04:11] (03CR) 10Hashar: [C: 031] "Thank you!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130152 (owner: 10Andrew Bogott) [20:04:43] superm401: cool [20:06:46] greg-g: rescheduled my stuff... but tomorrow I'm for sure not around, and I'm not sure aude will be [20:07:19] mh, there's another SWAT today, but it's insanely late [20:07:26] (03CR) 10Andrew Bogott: [C: 032] Remove access for Daniel Bauer [operations/puppet] - 10https://gerrit.wikimedia.org/r/130151 (owner: 10Andrew Bogott) [20:08:01] (03CR) 10Andrew Bogott: [C: 032] Change Hashar's UID to match labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130152 (owner: 10Andrew Bogott) [20:08:11] hoo: so what's going to be diffrent tomorrow? [20:08:36] !log restarted gmetad on nickel [20:08:41] Logged the message, Master [20:08:50] greg-g: Tomorrow I *plan* to not be around (which is != to the unplanned headaches today) [20:09:25] Anyway, I'm ok with the late deploys, I guess... got enough stuff to do anyway [20:09:28] (03PS2) 10Reedy: Add deployment-db2 as slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130092 [20:09:32] (03CR) 10Reedy: [C: 032] Add deployment-db2 as slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130092 (owner: 10Reedy) [20:10:23] (03CR) 10jenkins-bot: [V: 04-1] Add deployment-db2 as slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130092 (owner: 10Reedy) [20:10:35] hoo|away: what I mean is, tomorrow morning the same thing will happen if you're not around. [20:12:07] (03CR) 10Reedy: [V: 032] "Unrelated test failures. Will fix seperately" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130092 (owner: 10Reedy) [20:12:33] !log reedy synchronized wmf-config/db-labs.php [20:12:39] Logged the message, Master [20:14:59] greg-g: hoo|away ? [20:16:31] * aude see https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=111187&oldid=111180 [20:16:49] tomorrow is ok [20:16:57] aude: the 15:00 SWAT deploy today included things from hoo|away, they were not done since hoo wasn't there, he scheduled for tomorrow at 15:00 which has the same problem :) [20:17:04] unless you'll be here tomorrow [20:17:46] we have some other wikidata backports [20:18:30] maybe i can prepare them for later today [20:19:15] springle: hey [20:19:28] bd808: Re. bug 53972, you're wonderful. :-) [20:19:31] ottomata: nfs1 slow cirrussearch alert? [20:19:51] paravoid, i thought I had disabled active checks of that (maybe I don't know what that does) [20:19:52] paravoid: hi :) [20:19:55] is nfs1 going away? [20:19:56] springle: what's with https://ishmael.wikimedia.org/sample/?host=db1052 ? [20:20:02] was hoping it would just disappear.... [20:20:03] James_F: Save the hug until I get it to work. :) [20:20:11] ottomata: remove the check? [20:20:11] but i guess since the tampa move is mostly done for now [20:20:11] it won't? [20:20:19] bd808: How about you get two, one now, one on delivery? :-) [20:20:44] AaronSchulz: ishmael data collection was disabled recently due to causing stalls on slaves. havn't fixed it yet [20:20:45] in puppet...its in the role that is used on flluorine [20:20:45] i really have no idea why mw-log udp2log instance is even on those nfs boxes [20:20:48] i guess I could remove it... [20:21:06] springle: (this might be a very silly question) I've noticed you use icinga's "silence notifications" rather than "acknowledge problem"; is there a reason for that? [20:21:40] paravoid, i'm finishing up some camus stuff, should be done with that in 15 mins, then will visit the CirrusSearch thing [20:21:57] k [20:22:24] paravoid: partly bad habit. i silence if i know something will blip during maintenance. i ack/schedule if it's already blipped... sometimes [20:22:43] ok [20:22:53] do you mind if I acknowledge alerts that you've previously silenced? [20:23:00] I've already done so a few times [20:23:14] np [20:23:19] I'm OCDing a bit over Icinga's "All Unhandled Problems" [20:23:22] I like it to be empty [20:23:33] managed to do so, then the hackathon came, then sdtpa, it's all a mess again :) [20:23:53] :) [20:24:40] <_joe_> paravoid: I tend to have the same paranoid attention towards alerts, but for now I'm slow [20:25:58] (03PS8) 10Ottomata: Running update-server-info for submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [20:27:44] (03PS9) 10Ottomata: Running update-server-info for submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 [20:30:41] (03CR) 10jenkins-bot: [V: 04-1] Running update-server-info for submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/126846 (owner: 10Ottomata) [20:31:09] ottomata: what's with the elasticsearch alerts? [20:31:20] is it manybubbles' rolling upgrade? [20:34:04] paravoid: as far as I know, I pinged him before about it and he said he know about it [20:34:15] but he's not online...hm [20:34:18] manybubbles: [20:34:19] oh, is he? [20:34:25] oh, yes he is, sorry [20:38:05] !log apache-graceful-all after tuning php.ini's expose_php setting [20:38:10] Logged the message, Master [20:38:27] (03PS2) 10BryanDavis: [WIP] Provision scap scripts using trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 [20:38:29] (03PS1) 10BryanDavis: Add scap/scap trebuchet target [operations/puppet] - 10https://gerrit.wikimedia.org/r/130211 [20:39:42] Commit titles are so cool. [20:40:01] I often notice I only understand one or two words of them. [20:41:32] (03PS1) 10Andrew Bogott: Rename abaso to dr0ptp4kt and change his UID to match labs ldap. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130212 [20:41:40] dr0ptp4kt: for your review ^ [20:41:47] heh, I kinda preferred abaso :) [20:41:53] no offence :) [20:42:00] (not vetoing this or anything) [20:42:04] "Remove unused uses" is a very good example. [20:42:11] paravoid: its rolling upgrades, yeah. [20:42:32] twkozlowski: "Add gibberish" [20:42:35] did i mute that bot somehow? [20:42:37] * bd808 changes his shell name to 1337h4|<3R [20:42:48] paravoid: me too, but I'm more confident of my ability to change a username in production than in labs :) [20:43:15] actually in labs it should be pretty easy [20:43:26] greg-g: I browse through a lot of commits every week, maybe I should start documenting the best of them somewhere. [20:43:35] just change the ldap attribute [20:43:41] twkozlowski: :) :) [20:43:48] that's the unix shell name, right? [20:43:51] not the username [20:43:55] so it should be pretty trivial [20:43:59] (famous last words) [20:44:14] paravoid: I think shared storage would get scrambled, probably need some hand-tuning there. [20:44:29] hmm [20:44:30] right [20:44:31] andrewbogott, paravoid, i'm okay with whatever you guys decide. just lemme know if i need to do anything, um, fun, with regard to gerrit or elsewhere [20:44:32] But yeah, it should be possible, I just haven't thought about it much. [20:45:32] (03PS3) 10BryanDavis: [WIP] Provision scap scripts using trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 [20:46:02] (03CR) 10Andrew Bogott: [C: 032] Rename abaso to dr0ptp4kt and change his UID to match labs ldap. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130212 (owner: 10Andrew Bogott) [20:46:33] (03PS1) 10Ottomata: Importing eventlogging logs into HDFS via Kafka and Camus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130216 [20:47:22] (03PS2) 10Ottomata: Importing eventlogging logs into HDFS via Kafka and Camus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130216 [20:47:39] (03PS3) 10Ottomata: Importing eventlogging logs into HDFS via Kafka and Camus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130216 [20:47:39] Running deleteEqualMessages.php on sqwiki (bug 43917) [20:47:50] !log Running deleteEqualMessages.php on sqwiki (bug 43917) [20:47:55] Logged the message, Master [20:48:30] Nemo_bis: Hm.. I see lots of wikis have makebot*/makesysop* translations still [20:48:49] Nemo_bis: Can you maybe look into that? Will probably need gs to delete those. [20:49:36] I can't do that myself. And the script can't either since there is no knowledge of those message (they're not detected as "equal", they're completely out of the system by now, so they're just local custom messages as if they're gadgets basically) [20:49:49] They're ancient. [20:50:58] (03CR) 10Ottomata: [C: 032 V: 032] Importing eventlogging logs into HDFS via Kafka and Camus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130216 (owner: 10Ottomata) [20:51:52] Anyone know who Gregory Maxwell is? He has an account defined in admins.pp but it's not included anywhere. And, no email attached to either his labs or prod account. [20:52:17] yeah [20:52:18] maybe in wikipedia-en-admins or wikipedia-en channel? [20:52:25] gmaxwell [20:52:44] I can get you an email if you want [20:52:47] he's a former volunteer dev [20:52:59] from aeons ago [20:53:00] ...and network admin as well [20:53:56] yeah, ex juniper, now at Mozilla [20:53:57] greg-g, an email would be great, thank you. [20:54:03] greg-g: Oh, found it! [20:54:07] "Use new banana checker as a linter to avoid lacking qqq messages" [20:54:19] My favourite <3 [20:55:18] I meant wikimedia network admin [20:55:24] um, andrewbogott are there weird puppet user account things happening right now? [20:55:26] he still has access to junipers as well, andrewbogott [20:55:37] i'm getting Could not find class accounts::abaso on analytics nodes right now [20:55:46] he is also included on stat1002, but i'm not getting that error there... [20:55:49] ottomata: yes, I'm in the process of renaming adam [20:55:52] ottomata: he's renaming him [20:55:55] AH [20:56:00] hm [20:56:02] paravoid: gotcha, well, he has good credentials then ;) [20:56:08] eta? puppet is broken :/ [20:56:17] ottomata: I'm surprised that's breaking puppet though. Let me look. [20:56:18] andrewbogott: if he responds something along the lines of "I've lost my key" or "who cares about wikimedia" (doubtful), lemme know so I can remove his network access as well [20:56:29] paravoid: yep, ok :) [20:56:31] doubtfull on both of those, actually ;) [20:56:34] Could not find class accounts::abaso for analytics1026.eqiad.wmnet at /etc/puppet/manifests/role/analytics.pp:46 [20:56:45] ottomata: Must be his account is included someplace that I didn't look... [20:56:55] ah, that's it... [20:57:34] * andrewbogott is surprised to see all those accounts handled outside of admins.pp [20:58:31] (03PS1) 10Andrew Bogott: Rename abaso to dr0ptp4kt in a couple other places. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130218 [20:58:56] andrewbogott: afaik there are accounts:: in many places :( [20:59:10] yeah, I'll remember to grep next time I do a rename [20:59:21] oh my that's what he wants his shell name to be [20:59:22] haha [21:00:22] ah, I will have to rename his hdfs home dir too :/ [21:00:35] can I merge that andrewbogott? [21:00:50] ottomata: yep, as soon as Jenkins approves. [21:00:52] (03CR) 10Ottomata: [C: 032] Rename abaso to dr0ptp4kt in a couple other places. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130218 (owner: 10Andrew Bogott) [21:00:58] i see jenkins! [21:01:03] (03CR) 10Ottomata: [V: 032] Rename abaso to dr0ptp4kt in a couple other places. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130218 (owner: 10Andrew Bogott) [21:01:04] :) [21:01:05] thanks [21:01:24] ottomata: I'm going to do a salt 'mv /home/abaso /home/dr0ptp4kt/' after puppet catches up. [21:01:29] And chmods and such. [21:02:04] ok danke [21:02:20] ottomata, any bets on how long until someone deletes my account? ZOMG it's a h4x0r [21:02:41] hah [21:03:31] dr0ptp4kt: due to my missing that last bit, your changeover is delated by a bit. I'll do the last bits in about half an hour. [21:03:45] andrewbogott: cool [21:05:17] ori, should I be able to access current EventLogging data from s1-analytics-slave.eqiad.wmnet [21:05:37] Right now, MySQL is hanging when I ask it to give me a single row of an EL table from there (LIMIT 1). [21:05:52] Wondering if there are any known issues. /cc springle [21:09:07] Got: [21:09:09] ERROR 2013 (HY000): Lost connection to MySQL server during query [21:09:30] andrewbogott, do you know of any issues with s1-analytics-slave.eqiad.wmnet ? [21:09:40] superm401: nope [21:09:50] mutante: cmjohnson1, either of you know what's up with nfs1? [21:09:55] i see there is a ticket for a decom [21:10:19] will it be decomed? [21:10:25] i know you guys moved it to 12th floor [21:12:16] mu tante is gone by now I hope (11 pm his time) [21:12:27] nfs1 will live until 12th is shut down [21:12:35] the decom ticket, I think all hosts on 12 have one [21:13:01] ottomata: [21:13:28] to be more accurate, it will live til we no longer need it (all dependencies on it are removed) [21:14:46] ok, i am going down a long chain of git commits trying to find out why this instances is even on nfs1 [21:14:51] i think that it was on nfs1 originally [21:14:59] but then fluorine came around and someone added it there [21:15:03] but didn't remove it from the nfs* boxes [21:15:07] ...i'm going to be bold and remove it. [21:17:00] (03PS1) 10Ottomata: Removing mw-log udp2log instance from nfs1 - this now lives on fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/130222 [21:17:14] (03PS2) 10Ottomata: Removing mw-log udp2log instance from nfs1 - this now lives on fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/130222 [21:18:04] (03CR) 10Ottomata: [C: 032 V: 032] Removing mw-log udp2log instance from nfs1 - this now lives on fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/130222 (owner: 10Ottomata) [21:23:48] sweet! [21:24:17] (03PS1) 10Ottomata: Fixing quieting of logster job [operations/puppet] - 10https://gerrit.wikimedia.org/r/130223 [21:25:53] (03PS2) 10Ottomata: Fixing quieting of logster job [operations/puppet] - 10https://gerrit.wikimedia.org/r/130223 [21:26:13] (03CR) 10Ottomata: [C: 032 V: 032] Fixing quieting of logster job [operations/puppet] - 10https://gerrit.wikimedia.org/r/130223 (owner: 10Ottomata) [21:27:55] springle: hey [21:28:26] springle: icinga bot is down for some reason, but the web page says db1047 down for 17m now [21:32:24] finally [21:35:52] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2032: active_shards: 6035: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:35:52] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [21:36:56] (03PS1) 10Ottomata: Changing Camus job name [operations/puppet] - 10https://gerrit.wikimedia.org/r/130225 [21:37:19] (03CR) 10Ottomata: [C: 032 V: 032] Changing Camus job name [operations/puppet] - 10https://gerrit.wikimedia.org/r/130225 (owner: 10Ottomata) [21:50:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1% data above the threshold [250.0] [21:54:52] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2032: active_shards: 6035: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:55:31] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [22:04:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 1.69491525424% of data exceeded the critical threshold [500.0] [22:05:31] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2032: active_shards: 6035: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:05:31] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2032: active_shards: 6035: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:05:41] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [22:07:39] * AaronSchulz looks at http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path [22:07:49] paravoid: why do we need our own LRU daemon again? [22:08:39] AaronSchulz: that feature md5s the URL, so you can't reconstruct it to purge it [22:09:10] ah, that gets to my other question...how the purging would actually be efficient (having looked at the hash scheme) [22:12:09] (03CR) 10Gage: [C: 032] add 'rhenium' (netflow box) to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 (owner: 10Dzahn) [22:14:48] hrm i did +2 but I hit "publish comments" instead of "publish and submit" and puppet-merge doesn't see my change. what else do i need to do? [22:15:13] hit "submit" [22:15:32] hm i have no such button [22:15:39] Submit Patchset 1 [22:15:52] it's gray [22:15:53] https://gerrit.wikimedia.org/r/#/c/130060/ [22:16:07] Need Rebase or Has Dependency [22:16:12] arr, ok [22:16:18] right below the reviewers [22:16:22] hit the "rebase" button [22:16:33] Also "Can Merge No" [22:16:36] if you're lucky and that works, it should be easy [22:16:41] (03PS3) 10Gage: add 'rhenium' (netflow box) to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 (owner: 10Dzahn) [22:17:23] (03CR) 10Gage: [C: 032 V: 032] add 'rhenium' (netflow box) to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/130060 (owner: 10Dzahn) [22:17:37] great, thanks faidon [22:17:42] i mean paravoid [22:17:48] both work :) [22:17:52] np [22:18:59] (03PS2) 10Gage: apply role::pmacct on node rhenium [operations/puppet] - 10https://gerrit.wikimedia.org/r/130062 (owner: 10Dzahn) [22:19:00] springle: ping [22:19:33] (03CR) 10Gage: [C: 032 V: 032] apply role::pmacct on node rhenium [operations/puppet] - 10https://gerrit.wikimedia.org/r/130062 (owner: 10Dzahn) [22:19:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=64573 [22:19:54] greg-g: ^' [22:19:59] I texted superm401 just in case [22:21:21] StevenW, back [22:22:20] (03PS1) 10Reedy: Update unit tests to drop pmtpa [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130227 [22:22:47] That's really weird, looking at it now. [22:23:11] superm401 / StevenW revert if needed, plz [22:24:32] StevenW, WFM. Did you test on a page where you add a GS task assigned? [22:24:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=64573#c2 [22:24:50] https://en.wikipedia.org/w/index.php?title=User:Superm401/Sandbox&diff=606245196&oldid=605237279 [22:25:44] Okay, reverting, will comment more afterwards. [22:25:54] The last edit to WP:VPP was 19 minutes ago. [22:26:04] So weird that it's only certain pages. [22:26:32] what's up? [22:26:49] I see the bug and I see superm401 is reverting; do you need anything? [22:26:55] paravoid, don't think so, in progress. [22:27:16] k, thanks. I'll stick around for the next 10-15', let me know if anything comes up [22:27:50] does neon.eqiad.wmnet not resolving on from the iron host seem odd to anyone [22:27:57] jgage or chasemp: db1047 seems down and people are already complaining but springle is not responsive, right now. can you have a look? [22:28:06] chasemp: it's neon.wikimedia.org [22:28:16] sure but it has no internal dns? [22:28:21] no, why would it? [22:28:27] !log mflaschen synchronized php-1.24wmf1/extensions/GettingStarted/ 'Revert token/TrackedPageContentSaveComplete GettingStarted change' [22:28:35] Logged the message, Master [22:28:38] internal dns is for hosts under the internal ip space (10/8) [22:28:40] !log mflaschen synchronized php-1.24wmf2/extensions/GettingStarted/ 'Revert token/TrackedPageContentSaveComplete GettingStarted change' [22:28:46] Logged the message, Master [22:28:48] hosts with a public IP just have a wikimedia.org hostname [22:28:52] Revert done. [22:29:19] ok, working that out, it's just unusual (from my perspective) no big deal just getting used to it I guess [22:29:26] Confirmed [22:29:53] on db1047, if I can't hit it over ssh I have no idea what to do, can someone sidebar w/ me on console access? [22:30:08] (03PS1) 10Jkrauska: Add jkrauska to rhenium [operations/puppet] - 10https://gerrit.wikimedia.org/r/130231 [22:30:09] hi, was afk. i'll connect to db1047 on console. [22:30:09] login to db1047.mgmt.eqiad.wmnet using the mgmt password [22:30:17] then it's iDRAC [22:30:21] unless chasemp beats me to it :) [22:30:29] I'm about to go to bed, hence me dropping this on you guys :) [22:30:33] jgage: can you loop me in on this so i can observe? [22:30:40] more suitable timezone and everything [22:30:47] paravoid, no problem [22:30:50] have a good night [22:30:50] chase, sure, but how? [22:31:01] hangout can screen share [22:31:02] ? [22:31:12] huh really [22:31:19] what was that thing that's like screen but for sharing [22:31:35] is it terminal? [22:32:08] yes [22:32:10] totally trying to call you on hangout as we speak [22:32:16] dtach can do terminal sessions sharing [22:32:34] I usually use that I wasn't sure if this was some crappy UI thing like they usually are [22:32:39] MaxSem: heh, I forgot to submodule update before doing cmake ;) [22:34:12] thanks superm401 (paravoid, you are releived of duty ;) ) [22:34:20] :) [22:34:26] good night! [22:34:29] g'night! [22:34:43] (03CR) 10Jkrauska: "Comes from https://rt.wikimedia.org/Ticket/Display.html?id=7368" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130231 (owner: 10Jkrauska) [22:36:41] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2032: active_shards: 6035: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:37:55] MaxSem: yeah so I tweaked the includes slightly and swapped the allocator used...I think that's it [22:38:02] of course you can diff the files to be sure [22:38:19] (03PS1) 10Gergő Tisza: Enable MediaViewer survey on Spanish Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130234 [22:39:09] also added some casts to git rid of c++ warnings [22:43:51] !log rebooting db1047 due to unpingable and unresponsive on mgmt console [22:43:57] Logged the message, Master [22:47:32] /dev/sda1 has gone 576 days without being checked, check forced. [22:47:47] is this the right time to talk about regularly scheduled reboots? :) [22:47:52] RECOVERY - puppet disabled on db1047 is OK: OK [22:48:01] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [22:48:31] RECOVERY - check configured eth on db1047 is OK: NRPE: Unable to read output [22:48:32] RECOVERY - check if dhclient is running on db1047 is OK: PROCS OK: 0 processes with command name dhclient [22:48:41] RECOVERY - RAID on db1047 is OK: OK: optimal, 3 logical, 6 physical [22:48:41] RECOVERY - SSH on db1047 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [22:49:57] ok well if it wasn't clear from icinga, db1047 is back :) [22:52:20] (03PS1) 10BryanDavis: Scap: remove mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130238 [22:52:22] (03PS1) 10BryanDavis: Scap: Remove mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130239 [22:52:58] (03PS2) 10BryanDavis: Scap: purge mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130238 [22:53:47] (03PS2) 10BryanDavis: Scap: Remove mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130239 [22:54:49] ori: ^ Trivial puppet cleanup for scap changes from last week. [22:55:29] One to ensure=absent and a followup to drop the file management entirely. [23:03:49] (03PS1) 10Jkrauska: Add jkrauska to rhenium using an admin group [operations/puppet] - 10https://gerrit.wikimedia.org/r/130241 [23:03:51] (03CR) 10Ori.livneh: [C: 032] Scap: purge mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130238 (owner: 10BryanDavis) [23:04:02] (03CR) 10Ori.livneh: [C: 032] Scap: Remove mergeCdbFileUpdates symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/130239 (owner: 10BryanDavis) [23:19:12] bd808: ran puppet on tin [23:20:15] ori: I see a dangling /usr/local/bin/mergeCdbFileUpdates there :( [23:20:55] I also see a dangling /usr/local/bin/scappy from times gone by :/ [23:21:47] bd808: i'll clear those manually [23:21:55] cool beans [23:26:18] !log aaron synchronized php-1.24wmf2/maintenance/runJobs.php '91dddcaffa58430204e2bf3c612d893b2710f33b' [23:26:24] Logged the message, Master [23:26:40] * hoo wonders about today's SWAT [23:26:57] greg-g, did anyone take swat today? [23:27:41] ori, RoanKattouw, ^? [23:27:56] Nope [23:28:20] I probably should have looked but I totally forgot [23:28:42] duh: https://gerrit.wikimedia.org/r/129640 that needs to be merged and a core deploy branch patch needs to be prepared [23:28:47] for some reason; I find 4 o'clock a ridiculously hard time to do things [23:28:49] but I can do it [23:29:10] hoo: yes but I can't merge on wmf/* [23:29:16] duh: Oh really? [23:29:29] I just have +1/-1 [23:29:36] duh: +2ed [23:29:40] now do the rest :P [23:30:45] (03CR) 10Mwalker: [C: 032] Remove Wikidata as an importsource for testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129707 (owner: 10John F. Lewis) [23:31:26] (03CR) 10Mwalker: [C: 032] Add two languages not supported by MediaWiki to wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129708 (https://bugzilla.wikimedia.org/59905) (owner: 10Hoo man) [23:33:13] (03Merged) 10jenkins-bot: Remove Wikidata as an importsource for testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129707 (owner: 10John F. Lewis) [23:37:08] (03PS2) 10BBlack: Deter keeprefreshing.com noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/129714 (owner: 10Dr0ptp4kt) [23:38:02] gerrit seems to be moving very slowly... but it is running tests [23:38:09] *jenkins seems to be moving slowly [23:38:36] and by moving slowly; I mean that it's not using all of its available build threads [23:39:46] mwalker: reduce consumption mannnnn [23:40:23] yay, new languages! [23:40:33] thanks hoo and mwalker [23:40:35] aude: Oh, you're still around :) [23:40:53] done with time parsers for tonight :) [23:41:04] we'll have some things for swat tomorrow [23:41:34] aude: Probably not going to be around at the 5pm slot [23:41:47] thanks to our users finding a few issues on test.wikidata during th weekend [23:41:54] hoo: not a problem [23:42:11] :) [23:43:09] (03CR) 10BBlack: [C: 032 V: 032] Deter keeprefreshing.com noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/129714 (owner: 10Dr0ptp4kt) [23:43:15] bblack thx [23:43:24] np [23:44:16] oh well... I'll need a power adapter for the Zürich Hackathon :/ [23:45:24] AaronSchulz, it looks like I'm going to be deploying https://gerrit.wikimedia.org/r/#/c/130246/ -- is that OK? [23:47:05] sure, though I thought I synced that already [23:48:30] (03PS1) 10Ori.livneh: mobile varnishes: enable GeoIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/130256 [23:48:36] bblack: ^ [23:49:00] !log mwalker Started scap: SWAT for {{gerrit|129813}}, {{gerrit|129640}}, {{gerrit|129708}}, {{gerrit|129707}}, and {{gerrit|130246}} [23:49:07] Logged the message, Master [23:53:25] (03Abandoned) 10Aude: Re-connect test.wikipedia to wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124321 (https://bugzilla.wikimedia.org/63619) (owner: 10Aude) [23:56:17] AaronSchulz, you had actually deployed it; git log was just being interesting [23:56:41] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka [23:56:57] that's me [23:57:54] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka ori.livneh Ori doing maintenance [23:57:58] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka ori.livneh Ori doing maintenance