[00:00:55] matanya: ^ [00:01:15] mutante: you are great, thanks for chasing this down [00:01:18] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.484 second response time [00:02:10] !log reedy synchronized wmf-config/ [00:02:11] !log restarting gitblit on antimony [00:02:17] Logged the message, Master [00:02:25] Logged the message, Master [00:02:44] (03PS1) 10Reedy: Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 [00:03:03] (03CR) 10Reedy: [C: 032] Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 (owner: 10Reedy) [00:03:10] (03Merged) 10jenkins-bot: Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 (owner: 10Reedy) [00:03:18] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 365717 bytes in 9.096 second response time [00:06:02] !log By what can only be described as kicking-down-the-door-style deployment, mwalker and I managed to deploy four FundraisingChart Jenkins jobs after about 15 tries each. [00:06:10] Logged the message, Master [00:06:14] Life is fun. [00:06:56] (03PS3) 10Reedy: Disable and remove ContactPageFundraiser [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110292 [00:08:50] (03PS4) 10Reedy: Disable and remove ContactPageFundraiser [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110292 [00:10:38] Hrm [00:10:52] There are mass -1s coming from jenkins, is the accepted fix to restart zuul? [00:11:44] (03Abandoned) 10Reedy: Remove 1.18 back compat [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/93116 (owner: 10Reedy) [00:14:30] Oh, no, never mind [00:22:09] (03PS1) 10BryanDavis: Additional ssh key for Bryan Davis. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116014 [00:23:53] (03CR) 10BryanDavis: "Public SSH key listed on https://office.wikimedia.org/wiki/User:BDavis_(WMF) for verification of ownership." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116014 (owner: 10BryanDavis) [00:26:09] (03PS2) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:26:23] (03CR) 10jenkins-bot: [V: 04-1] Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 (owner: 10Reedy) [00:28:20] mutante: Any reason why we don't have a bastion only user group, so that people who only need specific hosts only get access to the bastions and the hosts they actually need? [00:28:48] hoo: we do [00:28:55] admins::restricted [00:29:56] mutante: Admins::restricted also you to jump onto terbium and have root on the DBs... and to view the logs, ... that's not waht I call a bastion only thing :P [00:29:57] (03CR) 10Dzahn: [C: 032] Remove scap-recompile [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/109950 (owner: 10Reedy) [00:30:11] s/also/allows/ [00:30:11] (03CR) 10Dzahn: [V: 032] Remove scap-recompile [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/109950 (owner: 10Reedy) [00:30:25] (03PS3) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:31:10] hoo: i don't know about root on DBs, are you sure [00:31:41] mutante: Yep, run mysql_root_pass [00:31:41] or mysql_root_password or so [00:31:57] hoo: dunno, should ping springle-away about that [00:32:09] (03PS4) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:32:34] mutante: mortals have a use for root on teh MySQL's, but people only wanting to upload eg. releases (like Markus) don't :P [00:32:59] hoo: you don't need to convince me, sounds reasonable [00:33:25] :) [00:33:42] that's something that was added today, right [00:33:49] people being mw uploaders [00:33:51] and having shell [00:34:03] that's why I'd split admins::restricted into admins::bastion and the actual restricted [00:34:18] manybubbles|away: Done [00:34:19] likely a good idea, yea [00:34:28] Fail [00:35:24] springle-away: ^ re: mysql_root_pass [00:35:55] Jeff_Green: ^ re: mw-uploaders [00:35:58] hoo: ^ [00:35:59] :) [00:36:10] mutante: Ok, shall I create a (draft) patch? [00:36:29] yes, sure [00:37:20] mutante: Can you check when someone last logged in into the cluster? [00:37:32] hoo: define "the cluster" [00:37:37] a random mw machine? [00:37:56] (03PS2) 10Reedy: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 [00:38:09] (03CR) 10jenkins-bot: [V: 04-1] Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 (owner: 10Reedy) [00:38:25] mutante: That's pretty hard... like on all bastions? Eg. dab is probably inactive (former TS root) [00:38:25] hi springle [00:39:33] hoo: what do you want to find out? [00:40:04] just if he is still using it? [00:40:14] doesnt reply to just asking if he still needs it? [00:41:34] (03PS3) 10Reedy: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 [00:44:48] mutante: I just wonder whether he still uses the access (eg. whether TS uses his account to pull data... which would be pretty bad, but you never know) [00:44:48] mh... fenari is both a bastion and has private data... bad [00:44:49] mutante: Yep [00:44:49] Is there another tampa bastion? [00:45:14] We could also do that, I suppose :P He's not on IRC atm though... [00:45:14] mutante: Is there another tampa bastion (one that doesn't have private data like fenari) [00:45:31] (03Abandoned) 10Reedy: Make sync-dblist report done, don't echo mediawiki-installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/110092 (owner: 10Reedy) [00:46:35] hoo: no, don't know, we already use eqiad [00:47:05] If I search for "equinix ashburn" in Google Maps, the buildings on Filigree Ct. seem to be in the direct approach path to runway 19C. Very odd choice. [00:47:30] or wait... you can jump from bast1001 to tampa also, I guess... so not really needed [00:48:14] (03PS2) 10Reedy: Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 [00:49:01] (03PS1) 10Hoo man: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 [00:50:48] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [00:52:46] (03PS1) 10Dzahn: ganglia, pdf1 is dead, monitor pdf2/3 not pdf1/2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116020 [00:54:18] (03CR) 10Dzahn: [C: 032] ganglia, pdf1 is dead, monitor pdf2/3 not pdf1/2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116020 (owner: 10Dzahn) [00:57:25] (03PS1) 10Dzahn: decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 [00:59:35] (03CR) 10Hoo man: [C: 031] Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [00:59:39] (03PS2) 10Dzahn: decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 [01:04:47] !log pdf1 - disable monitoring - downtime until ∞ [01:04:57] Logged the message, Master [01:05:16] (03CR) 10Dzahn: [C: 032] decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 (owner: 10Dzahn) [01:06:37] mutante: https://gerrit.wikimedia.org/r/116019 [01:07:24] mutante: Ok, so I guess this wont matter much anyway... they probably only need bast1001 and the realese host [01:07:25] we probably have more users in there who only need some specific host(s) [01:07:26] springle: hey, around? [01:08:49] hoo: am now [01:09:19] (reading log) [01:09:26] springle: We were just talking about mysql_root_pass [01:09:44] hoo: RT 5612 [01:09:55] as springle just mentioned correcly [01:10:18] ah ok, fine then [01:11:16] hoo: re: the question if dab is still active.. Jan 12 [01:11:24] springle: Just out of interested... what's the current process for schema updates? Wikitech only has information from 1954 on that... :P Do you use like the percona online schema change thing? [01:11:45] pt-online-schema-change, I mean [01:12:38] yes, mostly pt-online-schema-change unless it's something horrible like adding a PK [01:12:50] heh :) [01:12:54] !log pdf1 - revoke puppet cert, kill from stored configs,... [01:13:04] sometimes also just depool a slave, alter, repool [01:13:04] Logged the message, Master [01:14:41] springle: Ok... what about say a drop table? depool, drop, repool? ... innodb_file_per_table ... [01:16:17] that would break replication. incremental delete in master, check replag, then drop [01:17:18] hoo: what about innodb_file_per_table? it's on some hosts, not yet all. [01:17:34] springle: The actual drop table time depends highly on it [01:17:46] hence incremental delete ;) [01:17:46] * highly depends on it [01:18:04] ah, ok [01:18:22] if it is definitely out of service, then can drop on individual machines [01:19:02] springle: Yeah... like the "cur" table... unused since 2007 (or so), new wikis don't even have it, the data in there is damn unuseful... [01:19:37] there are a few tables that need dropping [01:20:35] not high prioity. some we keep just on principle. anything _old or _delete should go [01:21:43] hoo: mutante: who needs to review https://gerrit.wikimedia.org/r/#/c/116019 ? [01:21:53] jeff? [01:23:10] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:15] springle: Yeah... that was my main idea... IMO mortals should still have all the access the current restricted groups have (incl. the MySQL bits), but many of them just don't need it [01:23:35] * many of the restricted users [01:28:00] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.147 second response time [01:30:04] * springle_ stabs irc [01:31:35] hoo: interesting test of drop table recently on an s5 slave with: innodb_file_per_table, partitioned `revision`, ~48G buffer pool. no appreciable stall. it might be worse on the new slaves coming which will have 96G buffer pool [01:34:29] springle_: Nice... drop tables are some of the less unfunny schema changes because of that [01:36:59] one can always delay drop table. look, we do it since 2007 :) [01:38:20] heh... stale tables aren't harming :) [01:38:52] springle_: One last question ... :P How much ram do the master boxes currently have? [01:39:11] (I guess they have at least as much as the slaves) [01:40:16] hoo: Look in ganglia [01:40:52] enwiki master: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=MySQL+eqiad&h=db1052.eqiad.wmnet&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [01:41:33] heh [01:41:39] oh, right :P [01:42:31] hoo: db1051-db1060 are 96G, others are 64G. do you can see from dbtree [01:43:43] slaves should have same or more resources than master. we break this rule where 96G box is master and one or more 64G slaves are in the shard. but not a perfect world [01:45:41] I see :) [01:47:39] (03PS1) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:47:54] (03CR) 10jenkins-bot: [V: 04-1] Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:48:53] (03PS2) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:49:02] (03CR) 10jenkins-bot: [V: 04-1] Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:50:59] (03PS3) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:51:10] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:17] hoo: first three 128G boxes will arrive soon. i expect they will all be S1 slaves. the 96G boxes will be prioritised among S1, S2, and S4; not entirely sure how yet [01:51:40] :) [01:52:40] I guess you saw our growth plans for wikidata (and also the plans to make the data queriable)... at some point we will probably surpass s1 in resource usage [01:52:54] oh wikidata makes me cry [01:53:13] it will need it's own shard someday [01:53:56] Yeah... we'll probably need to throw hardware at the problem... let's see how fast it grows [01:54:20] (03PS4) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:54:32] (03PS4) 10Dzahn: removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 (owner: 10Matanya) [01:55:08] (03CR) 10Dzahn: [C: 032] removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 (owner: 10Matanya) [01:55:29] !log DNS update - removing pdf1 [01:56:00] wb_terms is the wikidata hotspot for slow queries [01:56:43] i tried a partitioned wb_terms on hash term_language recently; it had some improvement, but data is heavily skewed to certain languages [01:57:10] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:25] (03CR) 10Reedy: [C: 04-1] Remove db and job queue pmtpa files (036 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:59:08] Yeah, that table is awry [02:01:13] one interesting mysql 5.7 optimization which i hope mariadb will get: Index Condition Pushdown for partitioned tables. we could really work with that. [02:01:24] (03PS5) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [02:03:03] (03PS6) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [02:03:44] (03CR) 10Dr0ptp4kt: "Shoot, I think we can only remove the specific 'set req.http.X-ZeroTLS = "1";' for a carrier, not remove the vcl_deliver. That said, can y" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115669 (owner: 10BBlack) [02:06:44] hoo: fun http://bugs.mysql.com/bug.php?id=51325 and http://bugs.mysql.com/bug.php?id=69316 ... 5.5 ftw :) [02:10:38] springle_: Yeah... but that's only much of an issue if the table has actually been accessed [02:10:45] at least it should, I guess [02:11:14] (03PS1) 10Andrew Bogott: Turn down respawn limit for manage-volumes and manage-nfs-volumes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116038 [02:11:23] mariadb 5.5.36 is out now too.. [02:18:01] wikitech is down again [02:23:00] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [02:23:00] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [02:23:20] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [02:23:50] !log reedy synchronized php-1.23wmf16/includes/htmlform/HTMLFormField.php 'I2741ef940d83eeb564e89e20378fb4004cfe5b83' [02:24:00] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.000 second response time on port 389 [02:24:00] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.929 second response time [02:24:00] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [02:24:08] !log restarted apache on virt0 [02:24:34] * Reedy wonders why wikitech usually breaks when springle_ is around [02:25:09] * springle_ looks shifty [02:26:00] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.714 second response time [02:26:06] Logged the message, Master [02:26:14] !log rebuilt ldap indexes on virt1000 [02:26:26] Logged the message, Master [02:26:29] springle_: it's an ldap issue I think [02:26:36] I've been looking at it [02:26:45] (possibly fixed already, hard to tell) [02:27:05] andrewbogott: should i not have touched apache? [02:27:15] springle_: it's fine, harmless [02:27:34] it hit MaxClients, as usual [02:28:17] man, ldap is still going bananas. I wonder who/what is hitting us so hard? [02:28:29] It's almost certainly self-inflicted but I've already stopped the usual suspects [02:34:52] !log LocalisationUpdate completed (1.23wmf15) at 2014-02-28 02:34:35+00:00 [02:34:58] Logged the message, Master [02:50:21] Reedy: Can you run rebuildInterwiki please? [02:57:08] *sigh* fix s1 contributions-page slow queries and what crawls back to the top of the slow list? good old LogPager [03:05:29] !log LocalisationUpdate completed (1.23wmf16) at 2014-02-28 03:05:28+00:00 [03:05:37] Logged the message, Master [03:06:24] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [03:06:32] Logged the message, Master [03:06:49] thanks, Reedy [03:06:58] !log reedy updated /a/common to {{Gerrit|Icb3159198}}: Fix arrays for $wgContactConfig [03:07:02] (03PS1) 10Reedy: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 [03:07:06] Logged the message, Master [03:07:26] (03CR) 10Reedy: [C: 032] Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:30] (03CR) 10Hoo man: [C: 032] Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:33] (03Merged) 10jenkins-bot: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:42] +4 :D [03:14:30] (03PS1) 10coren: Labs: Disable labs LVM until image fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/116045 [03:16:28] (03CR) 10coren: [C: 032] "Annoying, but necessary." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116045 (owner: 10coren) [03:43:34] (03PS1) 10coren: vmbuilder: tweak partition sizes [operations/puppet] - 10https://gerrit.wikimedia.org/r/116048 [03:45:13] (03CR) 10coren: [C: 032] "Trivial tweak" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116048 (owner: 10coren) [03:45:33] * Coren rebuilds the image. [03:50:40] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-28 03:50:40+00:00 [03:50:49] Logged the message, Master [03:51:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [04:01:35] (03PS1) 10coren: Labs: Some tweaks and saner defaults for labs LVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/116050 [04:05:25] (03PS1) 10Andrew Bogott: Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 [04:06:29] heya mark, you there? qq. is there a not-much-used or non-critical node in esams that Snaps and I can run rdkafka_performance tests from? [04:06:40] we want to see if we can reproduce this intermittent problem [04:08:23] (03PS2) 10Andrew Bogott: Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 [04:10:49] (03CR) 10Andrew Bogott: [C: 032] Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 (owner: 10Andrew Bogott) [04:19:17] (03PS8) 10Dzahn: turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [04:22:35] (03CR) 10Dzahn: "PS8: removed the entire SSL config stuff, because now this is behind a proxy that does SSL termination anyways" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:52:13] (03CR) 10Dzahn: [C: 032] turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:54:03] (03CR) 10Dzahn: "no worries, just used on labs instance and nothing special in here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:55:36] (03CR) 10Aude: Introduce an admins::bastion user group (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [04:58:45] (03CR) 10Dzahn: "@RobiH, it's merged and a module now, also doesn't need public IP anymore and doesn't need to use star SSL certs, it's behind proxy, next " [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [05:02:25] (03CR) 10Dzahn: "some extra changes in there that seem unintended and you'd have to actually use this on bast1001 but +1 for the idea" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:16:11] (03PS1) 10Dzahn: enhanced comments re UIDs and key verification [operations/puppet] - 10https://gerrit.wikimedia.org/r/116055 [05:16:42] (03CR) 10Andrew Bogott: [C: 032] Turn down respawn limit for manage-volumes and manage-nfs-volumes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116038 (owner: 10Andrew Bogott) [05:18:14] (03CR) 10Dzahn: [C: 032] enhanced comments re UIDs and key verification [operations/puppet] - 10https://gerrit.wikimedia.org/r/116055 (owner: 10Dzahn) [05:31:08] (03PS2) 10Dzahn: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:31:48] (03PS3) 10Hoo man: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 [05:32:15] (03PS1) 10TTO: Allow more upload file types for sewikimedia sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116057 [05:32:51] (03CR) 10Dzahn: [C: 031] "rebased, fixed tabs and stuff" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:44:49] (03PS1) 10TTO: Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 [05:51:05] (03PS1) 10Dzahn: add README.md for wikistats module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116061 [05:53:12] (03CR) 10Dzahn: [C: 032] add README.md for wikistats module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116061 (owner: 10Dzahn) [06:27:40] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:29:40] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [06:52:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [07:00:11] (03PS1) 10Dzahn: WIP - turn RT from misc/* into puppet module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 [07:00:13] (03CR) 10Matanya: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 (owner: 10Reedy) [07:01:27] (03CR) 10Matanya: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 (owner: 10Reedy) [07:02:32] (03CR) 10Dzahn: [C: 04-1] "just started, not done yet" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 (owner: 10Dzahn) [07:09:49] (03CR) 10Matanya: [C: 031] Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [07:15:22] (03PS1) 10Ryan Lane: Enable keystone redis driver and switch replication around [operations/puppet] - 10https://gerrit.wikimedia.org/r/116065 [07:18:45] (03CR) 10Ryan Lane: [C: 032] Enable keystone redis driver and switch replication around [operations/puppet] - 10https://gerrit.wikimedia.org/r/116065 (owner: 10Ryan Lane) [07:23:40] (03PS1) 10Ryan Lane: Use Token and not TokenNoList redis driver for folsom keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/116066 [07:27:56] lots of odd things in fluorine:/a/mw-log [07:28:15] yeah I saw that [07:28:25] happens every blue moon [07:28:44] dberror is gone [07:28:47] at least [07:30:03] (03CR) 10Ryan Lane: [C: 032] Use Token and not TokenNoList redis driver for folsom keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/116066 (owner: 10Ryan Lane) [08:35:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:37:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:39:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:41:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:43:46] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:45:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:47:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:49:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:54:37] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:55:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:57:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:58:12] !log disabled puppet on labstore1001 to allow unattended file copies [08:58:20] Logged the message, Master [09:00:21] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [09:01:02] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Fri Feb 28 09:01:00 UTC 2014 [09:02:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 09:01:00 AM UTC [09:30:47] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Fri Feb 28 09:30:46 UTC 2014 [09:32:27] !log restarting Jenkins, some jobs registration is broken :( [09:32:35] Logged the message, Master [09:39:14] !log Jenkins restarted. [09:39:23] Logged the message, Master [09:42:24] (03CR) 10Mark Bergsma: "Let's just use $::mw_primary now. That should be (made) equal to 'eqiad' in production, and to $::site in labs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [09:47:26] (03PS2) 10Hashar: beta: fix upload cache directors [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 [09:47:36] (03CR) 10Hashar: "Done :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [09:53:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [10:02:38] (03PS3) 10Hashar: Use mw_primary for Varnish swift backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 [10:03:47] (03CR) 10Mark Bergsma: [C: 032] Use mw_primary for Varnish swift backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [10:05:00] good [10:05:10] still broken but at least now I have a much better error message [10:42:39] (03PS1) 10Andrew Bogott: Added even more error checking to dc-migrate. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116071 [10:44:37] (03CR) 10Andrew Bogott: [C: 032] Added even more error checking to dc-migrate. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116071 (owner: 10Andrew Bogott) [10:52:54] (03PS1) 10Andrew Bogott: Clear the puppet cert on the proper host. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116073 [10:54:39] (03CR) 10Andrew Bogott: [C: 032] Clear the puppet cert on the proper host. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116073 (owner: 10Andrew Bogott) [11:29:42] (03CR) 10Hashar: contint: webproxy for maven on CI production slaves (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [11:30:10] (03PS3) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [11:30:28] (03PS4) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [11:46:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [12:15:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 minor comment, otherwise LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [12:54:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [13:24:42] the logs on fluorine seem to have grown a number of files this morning [13:26:15] all the new log files that I've checked mention mobile [13:26:56] anyone from mobile on? [13:36:11] manybubbles: [13:36:25] I suspect mobilecontext somehow or other (I guess you do too) [13:36:49] I suspect it but don't really know what to do about it :) [13:36:51] thanks [13:37:03] yeah I think someone from the mobile team will have to take a look [13:37:36] are there new files since 6 or 7 am? [13:37:49] who's time zone? [13:37:58] fluorine's [13:38:02] ie utc [13:38:02] -rw-r--r-- 1 udp2log udp2log 200 Feb 28 11:07 7):.log [13:38:15] hm that'smore recent. ok, looks like problem is ongoing still [13:38:16] oh cool, there is one named .log [13:38:23] hahaha nice [14:11:30] Would anyone object to me deploying a one-file fix to a regression in 1.23wmf15 in a few minutes (https://gerrit.wikimedia.org/r/116095)? [14:15:11] !log anomie synchronized php-1.23wmf15/includes/htmlform/HTMLFormField.php 'Backport fix for bug 61942 to 1.23wmf15 (reedy already did wmf16 last night)' [14:15:18] Logged the message, Master [14:15:34] thanks :) [14:22:05] hashar: Shouldn't tools/scap things rather go here than in -dev? [14:22:56] hoo: it is merely for platform engineering / devs use [14:23:02] hoo: I dont think we need it for operations [14:23:17] there is enough spam here I think :] [14:23:37] enough spam in both... it's moreover a DevOps thing, I guess (yay, Buzzword) [14:34:37] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 11:34:07 AM UTC [14:47:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [15:14:15] (03PS1) 10Hashar: contint: pip shared cache on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 [15:14:40] (03CR) 10Alexandros Kosiaris: [C: 032] "I would love a man page at some point. lintian is complaining already but you can add it at some point later in time. help2man might be us" [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [15:16:45] (03CR) 10coren: contint: pip shared cache on labs slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 (owner: 10Hashar) [15:17:31] (03PS3) 10Reedy: Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 [15:17:51] (03CR) 10Ottomata: [C: 032 V: 032] "Oo, ok thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [15:22:15] (03PS2) 10Hashar: contint: pip shared cache on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 [15:27:23] (03CR) 10Ottomata: "Aye, I'd do that, except git-fat's help message is pretty useless." [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [15:37:02] (03CR) 10coren: [C: 032] "lgtm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 (owner: 10Hashar) [15:38:24] (03PS1) 10Ottomata: Adding elastic1013-elastic1016 to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:38:36] (03PS2) 10Ottomata: Adding elastic1013-elastic1016 to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:39:01] manybubbles: just running puppet on these new servers with the same setup should be all I need to do, right? [15:39:07] allocate the partition; run puppet; [15:39:07] right? [15:39:15] no! [15:39:17] stop! [15:39:31] ok ! [15:39:32] you'll have to install Elasticsearch 0.90.10 on them manually because puppet'll do 0.90.11 [15:39:33] i stopped! [15:39:36] oh hm [15:39:42] ok so I should apt-get that first [15:39:52] because we pulled it into apt [15:39:52] is the version manually specified in puppet? [15:39:59] or does it just say ensure => 'installed'? [15:40:00] ? [15:40:03] no, it isn't manually specified [15:40:10] ok cool [15:40:12] logstash upgraded but I didn't take the upgrade [15:40:19] so if we install 0.90.10 [15:40:21] then puppet should be ok? [15:40:21] I should probably manually specify the version in puppet [15:40:27] it should be ok, yeah [15:40:41] ah, hm, i need to still run puppet at least once on these first, ok [15:40:45] i will puppetize them without ES first [15:40:52] ah [15:41:05] If you wanted to manually specify the version that'd be fine with me [15:41:08] I'll +1 it [15:41:26] either way is fine [15:41:34] but we'll upgrade to 1.0.1 in a week [15:41:57] if the version is wrong in puppet, will puppet install the next version? [15:42:08] I'm pretty sure I don't want that because I want to manually run the upgrades [15:42:20] so I can do them one at a time when that is right and all at once when that is right [15:42:21] (03PS3) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:42:37] no, yeah [15:42:41] i think => installed is fine [15:42:55] (03PS4) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:43:33] (03CR) 10jenkins-bot: [V: 04-1] Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 (owner: 10Ottomata) [15:44:59] (03PS5) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:45:17] (03PS6) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:50:27] (03CR) 10Ottomata: [C: 032 V: 032] Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 (owner: 10Ottomata) [15:55:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [15:57:48] ha, manybubbles [15:57:55] we ahve both 0.9.11 and 0.9.7 in our apt [15:58:01] http://apt.wikimedia.org/wikimedia/pool/universe/e/elasticsearch/ [16:04:11] mark, can you comment on this? [16:04:12] https://rt.wikimedia.org/Ticket/Display.html?id=6948 [16:04:18] should I just remove the IPv6 addies from DNS? [16:04:22] * mark checks [16:04:32] (03CR) 10Hashar: "@akosiaris your comment is missing :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:04:42] and I am off till monday! Have fun folks [16:04:47] are they all on a fixed port? [16:04:53] always 9092? [16:04:58] kafka yes [16:05:09] ok then I'll just open that up [16:05:13] until we resolve the private link situation [16:05:25] will the IPv6 addy be routed via internet or via dedicated link? [16:05:45] (03CR) 10Alexandros Kosiaris: contint: webproxy for maven on CI production slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:06:40] (03CR) 10Hashar: contint: webproxy for maven on CI production slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:07:08] (03PS5) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [16:07:13] ok, manybubbles [16:07:26] ok? [16:07:27] elasticsearch 0.9.10 is installed on elastic101[3-6] [16:07:38] btw, the elasticsearch .deb package does not depend on java! [16:07:44] sweet. [16:07:45] it let me install it without having installed java :p [16:07:46] stupid [16:07:52] so dumb [16:08:00] just said it couldn't start up after installing, pssh [16:08:00] heh [16:08:06] anyway, that's fine [16:08:10] ok, so that is installed, what next [16:08:12] might want to get the same java version as is on the other machines too [16:08:14] if possible [16:08:16] now should I puppetize it? [16:08:17] hm [16:08:22] puppet should do that, right? [16:08:28] I think it'll just install whatever [16:08:33] don't think so [16:08:36] rather, the openjdk [16:08:42] but not the same package [16:08:45] which would be helpful [16:08:52] java version "1.7.0_25" [16:08:53] OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2) [16:08:53] OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) [16:08:56] that's elastic1003 [16:09:24] class elasticsearch::packages { [16:09:24] package { [ 'openjdk-7-jdk', 'elasticsearch', 'curl' ]: } [16:09:24] } [16:09:29] s'ok, ja? [16:09:35] ottomata: internet [16:09:50] should work now [16:09:54] hm, mark, I think we want to go over the link, that's what we'e been doing thus far anyway [16:09:59] interesting. [16:10:03] perhaps you should try both [16:10:11] yeah, that's kinda why we were setting it up [16:10:12] hm [16:10:13] it will go over the dedicated link when we fix issues [16:10:14] the trouble is [16:10:31] the brokers themselves report the brokers that are assigned to partitions [16:10:35] so, that is configured dynamically [16:10:41] i'm pretty sure they report by hostname [16:10:48] unless there is a way to make them report by IP [16:10:49] what's the problem with over the internet? [16:11:04] well, nothing I suppose, except we haven't done it yet and we are trying to be scientific here! [16:11:07] to many variables! [16:11:08] too* [16:11:28] in the end it boils down to the same really [16:11:39] Snaps: do you know if it is possible to make brokers report IPs instead of hostnames? [16:11:47] on metadata request for a toppar? [16:11:47] dont think so [16:11:50] hm [16:11:57] mark, its the same? [16:12:08] internet would be higher latency, no? [16:12:11] no [16:12:14] it's going over the same equipment [16:12:16] but I can filter on AF in rdkafka if that helps [16:12:34] we have control of the routes over the internet between DCs? [16:12:34] ottomata: so long as ew get 1.7.0_25 on the new ones too [16:12:35] well it's probably going over some other network but that's just random [16:12:39] then we're good [16:12:42] ottomata: we do [16:12:48] more than we have control over the route that private link takes [16:12:53] hahah ok [16:13:11] if the private link were more stable we'd use it for that also [16:13:13] but right now it isn't [16:13:35] ahhhHHHHH ok ok ok [16:13:43] i geuss this is fine, but it messes with my science experiment! [16:13:48] yes it does :) [16:14:01] you can remove the ipv6 addresses from dns then [16:14:04] up to you [16:14:19] hmm, ok, i think I will collect some perf data with Snaps like this [16:14:22] and then maybe do that too [16:14:25] thank you! [16:14:29] yw [16:14:54] oh, mark, is this setup the same on other remote DCs? [16:14:56] ulsfo? [16:15:00] no [16:15:05] ulsfo has two links [16:15:11] they will be using the same IPs [16:15:23] for brokers [16:15:29] ulsfo to eqiad is always over private links [16:15:48] ok, but IPv6 :9092 is allowed? [16:16:01] ah guess, so, everything is allowed via private link? [16:16:37] hm, manybubbles, openjdk-7 installed java version "1.7.0_51" on these new nodes [16:16:51] can you switch it? :( sorry for the trouble! [16:17:07] 0_51 is actually known not to work well with lucene/elasticsearch [16:18:30] ottomata: we only filter across the internet [16:18:36] on our private links, everything is allowed [16:19:00] ok cool, danke [16:19:06] someone just fixed rancid? [16:25:18] manybubbles: i am not sure how, trying to see if I can do that [16:28:35] PROBLEM - DPKG on elastic1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:29:35] RECOVERY - DPKG on elastic1013 is OK: All packages OK [16:37:34] PROBLEM - DPKG on elastic1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:42:34] RECOVERY - DPKG on elastic1013 is OK: All packages OK [16:48:04] PROBLEM - DPKG on elastic1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:50:04] RECOVERY - DPKG on elastic1015 is OK: All packages OK [16:50:09] phew, manybubbles, i got it [16:50:19] 1.7.0_25 on the new servers [16:50:22] thanks! [16:50:23] what next? [16:50:26] you are wonderful [16:50:38] uh, puppet them into the cluster, I guess [16:50:42] oook [16:50:54] I would like to head to lunch now so many merge puppet commit in an hour when I'm back? [16:50:57] just in case? [16:51:03] yeah let's do it then [16:51:07] i want to work with nuria on some stuff [16:51:22] ottomata: cool. can you reploy to 1.0 upgrade email? [16:52:00] reply about the deploy, aka, reploy [16:52:13] (03PS6) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [16:52:16] hahah [16:56:46] (03CR) 10Alexandros Kosiaris: [C: 032] contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [17:01:13] (03CR) 10Matthias Mullie: [C: 031] "LGTM - to be deployed Mon 03/03" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride) [17:13:42] where's the wikimedia freenode irc server? [17:13:53] or what's it's name, and I'll figure out where it is.. :) [17:14:53] cajoel: Hm? There is no Wikimedia Freenode IRC server. [17:15:03] Unless I am missing something over the past few days :p [17:16:31] JohnLewis: cajoel: http://irc.netsplit.de/servers/dickson.freenode.net/ [17:17:00] thx [17:17:50] hoo: I never knew that xD [17:18:00] it's new [17:18:11] greg-g: Fair enough then. [17:18:18] like, couple months-ish [17:18:25] Also; when's the wedding again? :D [17:18:48] JohnLewis: tomorrow :) [17:18:51] * greg-g is in Michigan [17:19:53] greg-g: Have a good day then and tell me what it was like :p I lovely daily installments of the Greg Grossmeier show ;) [17:20:07] So far: Cold. [17:20:33] It's cold here anyway - Granted I'm in a different contintent. [17:21:12] *Continent [17:25:52] superm401: btw, thanks for the [[wikitech:Deployments]] fix. I would have just click "Thanks" on the edit history, but we don't have that extension :) [17:26:11] greg-g, no problem. :) [17:27:01] greg-g: Manage it to be released then :D [17:27:41] :P [17:35:34] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 11:34:07 AM UTC [17:39:35] (03CR) 10Alexandros Kosiaris: "Sure" [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [17:47:54] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 318 seconds [17:48:04] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 309 seconds [17:48:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [17:48:54] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 93 seconds [17:49:05] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 77 seconds [17:52:26] (03PS1) 10Alexandros Kosiaris: Some minor cleanups in osmlabsdb.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/116122 [17:52:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Some minor cleanups in osmlabsdb.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/116122 (owner: 10Alexandros Kosiaris) [17:54:24] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Fri Feb 28 17:54:20 UTC 2014 [17:57:27] (03CR) 10Gergő Tisza: Remove ArticleFeedbackv5 from Wikimedia wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride) [17:58:50] (03PS1) 10Jgreen: remove deprecated fundraising development cnames [operations/dns] - 10https://gerrit.wikimedia.org/r/116126 [17:59:49] (03CR) 10Daniel Kinzler: [C: 031] Redirect all *.wikidata.org subdomains to www.wikidata.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113972 (owner: 10Thiemo Mättig (WMDE)) [18:01:03] (03CR) 10Jgreen: [C: 032 V: 031] remove deprecated fundraising development cnames [operations/dns] - 10https://gerrit.wikimedia.org/r/116126 (owner: 10Jgreen) [18:25:11] (03PS1) 10ArielGlenn: Kunal Mehta access to terbium/flow database, rt #6895 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116133 [18:26:01] apergos: Yay, more tabs vs. spaces... :P [18:26:08] xD [18:26:10] (03CR) 10jenkins-bot: [V: 04-1] Kunal Mehta access to terbium/flow database, rt #6895 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116133 (owner: 10ArielGlenn) [18:26:16] ah tradition [18:26:25] woudln't have been right to get that the first try [18:28:13] (03PS2) 10ArielGlenn: Kunal Mehta access to terbium/flow database, rt #6895 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116133 [18:29:05] hoo, yeah I'm not excited but the fix for that is for someone to do the tabs->space conversion on the entire file (and maybe no other fix on that round) [18:29:28] :%s/\t/ / [18:29:28] :P [18:29:39] That's going to be a messy diff [18:30:14] (03CR) 10ArielGlenn: "please wait to merge til the user verifies that this is the account name he wants." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116133 (owner: 10ArielGlenn) [18:30:39] yes, which is why that should be the only thing in the diff [18:30:57] First I want https://gerrit.wikimedia.org/r/116019 in [18:32:14] (03PS1) 10Alexandros Kosiaris: Include standard in role::osm [operations/puppet] - 10https://gerrit.wikimedia.org/r/116134 [18:33:54] !log Reloading zuul to deploy Ic3aeb6a5f13086b108 [18:34:01] Logged the message, Master [18:34:37] greg-g: So um...there's a bug in MMV where if you open the lightbox, then close it, and try to e.g. type in the search bar...nothing happens [18:34:57] greg-g: tgr tells me he's working on a fix but we'd sort of like to push it out today...would that be OK? [18:35:13] (03CR) 10Alexandros Kosiaris: [C: 032] Include standard in role::osm [operations/puppet] - 10https://gerrit.wikimedia.org/r/116134 (owner: 10Alexandros Kosiaris) [18:37:39] (03PS2) 10Aaron Schulz: Removed unused SwiftCloudFiles extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115959 [18:38:44] Reedy: ^ [18:41:54] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:44:03] manybubbles|away: lemme know when you are back and wanna bring this up [18:44:06] these* [18:44:11] ottomata: ready [18:44:30] oO [18:44:33] OK! [18:45:46] (03PS1) 10Ottomata: Adding elastic101[3-6] to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116138 [18:46:02] (03PS2) 10Ottomata: Adding elastic101[3-6] to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116138 [18:46:13] ok manybubbles, doing it(?) [18:46:34] do it! [18:46:42] (03CR) 10Ottomata: [C: 032 V: 032] Adding elastic101[3-6] to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116138 (owner: 10Ottomata) [18:47:36] running puppet on es13 [18:47:42] elastic1013 [18:48:50] done on 1013 manybubbles [18:48:51] can you see it? [18:49:01] yeah [18:49:03] "number_of_nodes" : 14 [18:49:09] awesome [18:49:10] easy [18:49:14] ok running on the rest [18:49:23] sweet [18:49:44] it'll take it an hour or so to move shards over to the new machines [18:49:57] it takes its time so it won't hammer the network or the disk [18:50:57] rdwrer: bug number? [18:51:23] rdwrer: also, does this effect all wikis or just those on wmf16? [18:51:49] aand, manybubbles how about now? [18:51:51] see all of them? [18:52:04] !b 62033 | greg-g [18:52:15] greg-g: I believe wmf16, sec [18:52:18] gj wm-bot [18:52:34] Yeah, commons is fine [18:52:34] "number_of_nodes" : 16, [18:52:40] looks good [18:52:45] rdwrer: in that case, can we wait until Monday? [18:52:56] We can, I suppose [18:52:58] or are there a ton of media viewer people on mw.org who are bitching? :) [18:53:01] Heh [18:53:04] I guess it's fine [18:53:08] tgr: ^^ [18:53:12] cool [18:53:58] manybubbles: ya'll need your Monday morning window for any prep stuff? if not, I'd like to give it to rdwrer / tgr for a media viewer fix (if rdwrer is ok with a 9am pacific deploy) [18:54:09] Ugh [18:54:16] greg-g: I'll do it, but I won't be happy about it [18:54:24] greg-g: have it [18:54:27] :) [18:54:36] rdwrer: 9;30? [18:54:37] my only prep is another puppet change I'll write this afternoon [18:54:38] :) [18:54:42] Whatever, either way [18:54:56] 9:30, sleep in [18:54:57] It's not a scapping change so I can wander in at 0930 and probably still get the deploy done [18:55:04] With time to spare [18:55:26] Then again I haven't scapped in a while, maybe the improvements mean I could do it anyway [18:56:15] rdwrer: :) [18:56:16] rdwrer: can we also push https://gerrit.wikimedia.org/r/#/c/116044/ and https://gerrit.wikimedia.org/r/#/c/116074/ then? [18:56:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [18:56:37] tgr: Yeah, let me write my name in The Book and then you can dump patches there [18:57:14] greg's little black book [18:58:30] ottomata: it is moving shards: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Elasticsearch%20cluster%20eqiad&m=es_query_time&r=hour&s=by%20name&hc=4&mc=2&st=1393613869&g=network_report&z=large [19:02:02] (03PS1) 10BryanDavis: Remove use of deprecated --extended argument to mwversionsinuse [operations/puppet] - 10https://gerrit.wikimedia.org/r/116140 [19:02:05] nice [19:03:00] (03CR) 10Aaron Schulz: [C: 031] Remove use of deprecated --extended argument to mwversionsinuse [operations/puppet] - 10https://gerrit.wikimedia.org/r/116140 (owner: 10BryanDavis) [19:09:12] rdwrer: will those other two patches be applied to just wmf16, or both 16 and 15? [19:09:22] * greg-g is editing calendar now, don't edit conflict me bro [19:09:59] Just 16 [19:10:04] otherwise greg-g will schedule your unrelease :p [19:10:16] rdwrer: cool [19:10:35] JohnLewis: :) [19:22:55] hmm, manybubbles [19:23:01] there aren't that many jars in archiva [19:23:02] yes? [19:23:06] in the binary release anyway [19:23:09] k [19:23:16] think i could just google and find the licenses for each? [19:23:19] most of the jars are jetty jars [19:23:28] OHH [19:23:31] there's more in WEB-INF [19:23:32] not just in lib [19:24:27] ah poo yeah I take it back [19:24:29] there are alot [19:25:13] manybubblesSSSzzzsSSSSSS, maybe you feel especially favorable towards me for getting those new serves up ehhhHH ehhhh? [19:25:23] want to get me a jar -> license mapping? [19:25:25] or tell me how you do that? [19:25:28] I'm finishing that license mapping now [19:25:30] !@!!! [19:25:31] :D [19:25:41] you are a true pal [19:27:55] anyone know the details on servermon? apparently was once used, but no longer? [19:28:26] it was used, then removed during the puppet migration, but we plan on reinstating it again at some point [19:28:39] ok, thanks [19:28:54] ok, i wasnt sure [19:29:08] the sockpuppet to palladium migration right? [19:29:19] cuz i wasnt sure where the hell it went and didnt see any puppetization to push it into place [19:30:40] yes [19:31:06] I installed it manually almost when I first joined [19:31:11] servermon was manually installed onto the old puppetmaster then? ahhh [19:31:12] it didn't survive the migration because of that [19:31:13] ok [19:31:16] now it all makes sense [19:31:16] that's my fault :) [19:31:24] and it's alex's fault for not puppetizing it now :P [19:31:59] in the future, we've had enough turnover for you to just say 'someone who isnt here' and no one will know. ;] [19:32:07] heh [19:32:08] heh [19:32:42] but we've talked a bit about it [19:32:48] we'll reinstate it at some point [19:32:59] the big plans have it replacing racktables, but it needs some more work for that [19:35:57] yep, its how it came up in discussion [19:36:22] racktables replacement [19:36:40] then i realized i had no idea what happened to it post migration [19:36:58] which means I've not done a manual run for puppet freshness since then [19:37:02] * RobH feels shame [19:38:25] PROBLEM - Puppet freshness on lanthanum is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 04:37:38 PM UTC [19:39:59] akosiaris: got a chance to look this over too? [19:39:59] https://gerrit.wikimedia.org/r/#/c/110454/ [19:40:04] webserver.pp lint from matanya [19:40:13] been on my list of reviews for a while [19:40:29] it shouldn't affect anything, but webserver.pp is widely used [19:40:30] i think [19:43:21] (03PS1) 10Cmcmahon: update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 [19:44:14] (03PS2) 10Cmcmahon: update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 [19:45:13] ottomata: thanks for working on the jvm deployment stuff! I'll be so happy when that works [19:45:32] also, do you know what rack the Elasticsearch machine is on? [19:45:34] Row D [19:45:43] machine(s). the new ones [19:46:49] no, i can't find those nodes in racktables (yet?) [19:46:55] and and chris is not online [19:46:58] hmmm [19:47:00] (03PS3) 10Cmcmahon: update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 [19:47:02] yeah can't tell [19:49:22] thanks though! [19:52:58] ottomata: do I have access to racktables? [19:53:14] dunno [19:53:14] https://racktables.wikimedia.org/ [19:53:51] (03PS4) 10Cmcmahon: update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 [19:53:59] nope [19:59:32] manybubbles: do you need something? [19:59:37] can i help? [19:59:52] cmjohnson1: oh cool! I was just asking which rack the row d elasticsearch servers were in [19:59:58] (03PS1) 10RobH: fixing etherpad certificate chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/116148 [20:00:04] they are in d3 [20:00:47] thanks! [20:02:39] (03CR) 10Chad: [C: 031] "Oh cool, I had hit this recently-ish and thought maybe it was just me since it seemed to go away shortly. Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116148 (owner: 10RobH) [20:03:21] queue times are not fast. [20:03:37] but its movin. [20:06:10] why is it so slow? [20:06:13] =[ [20:09:18] (03CR) 10RobH: [C: 032] fixing etherpad certificate chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/116148 (owner: 10RobH) [20:12:33] !log fixing cert errors on etherpad.w.o, sorry if folks have service interruption [20:12:39] Logged the message, RobH [20:14:37] !log i killed bz during puppet run on server, fixing now [20:14:41] i can't reach bugzilla [20:14:46] Logged the message, RobH [20:14:50] oh [20:14:51] there we go :D [20:14:52] you broke it you bastard [20:14:55] i just posted in dev too, its puppet run that takes two runs [20:15:17] its fixed [20:15:27] !log and i fixed it too, huzzah [20:15:32] huzzah [20:15:36] Logged the message, RobH [20:18:47] manybubbles: running out for a bit, back on soon, email me if you got that thar mapping :) [20:19:06] (03PS1) 10RobH: further fix for etherpad cert chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/116151 [20:26:25] PROBLEM - Puppet freshness on gallium is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 05:25:59 PM UTC [20:34:49] (03CR) 10RobH: [C: 032] further fix for etherpad cert chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/116151 (owner: 10RobH) [20:36:10] Did anybody file a bug report about the crazy log spew on fluorine from udp2log yet? [20:49:25] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [21:03:06] Did someone just restart Jenkins? [21:04:15] hashar is usually good about !log'ing that kind of thing [21:04:30] Yeah, but he didn't this time. I can see he's the one doing it [21:04:33] terrible timing [21:05:34] Krinkle: what's it blocking? [21:05:40] everything? [21:05:52] well, heh, didn't know if you meant something specific [21:06:15] I'm also in the middle of orchestrating various jenkins changes, and having it restart in the middle of a deployment is not cool [21:06:15] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [21:06:32] Krinkle: eek [21:10:45] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [21:11:15] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [21:11:19] ^ on mediawiki.org, search gives "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." [21:12:22] search suggestions don't work but if your search matches a page name you're taken to it [21:12:26] manybubbles: ^^ [21:12:40] yeah, I'm looking at it [21:12:46] it started doing that a few mintues [21:12:48] ago [21:12:56] adding the extra machines seems to have pissed it off [21:12:59] I'm not sure why though [21:13:18] mostly it is sitting on searches for no reason rather than trying to run them [21:13:29] so we're filling the pool queue [21:14:05] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [21:14:45] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [21:16:45] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [21:17:05] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [21:17:24] manybubbles: need anyone/thing? [21:17:44] greg-g: I'm not sure what anyone/thing would do [21:18:45] oh, bad: /usr/lib/jvm/java-7-openjdk-amd64//bin/java -Xms256m -Xmx1g -Xss256k [21:20:16] see, it should be java-7-openjdk-amd64//bin/java -Xms30G -Xmx30G -Xss256k [21:20:27] oh god [21:20:55] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [21:23:28] see, JVM's don't do well when they run out of memory. shit blows up [21:23:32] but the thing is, I'd have hoped (not really expected, I guess) elasticsearch to time out on the broken machines and retry on working machines [21:25:38] and we're back [21:25:46] we're still in yellow health while it recovers from me kill -9ed four nodes, but we didn't lose any data that it doesn't have backup copies of [21:25:46] it's restoring those now [21:26:28] greg-g: mostly back to normal [21:26:35] geez [21:26:35] what happened? [21:27:15] can write up [21:27:15] will email you and ops [21:27:26] anyone else? [21:32:45] manybubbles: that's good [21:40:55] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1360: active_shards: 4007: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [21:40:55] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1360: active_shards: 4007: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [21:41:35] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1360: active_shards: 4007: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [21:42:08] wee [21:42:55] manybubbles: wikitech-l? [21:43:12] greg-g: that is so far behind:) [21:43:16] odder: sure [21:44:21] manybubbles: is that also a problem? :) [21:44:33] everything is a problem, I think :) [21:45:44] if everything is a problem then nothing is a problem. Let's go for a beer. [21:50:36] bd808, ping [21:53:42] MaxSem: pong [21:54:02] bd808, MW logs are screwed up:P [21:54:33] MaxSem: yeah. I've been looking at that [21:54:40] I see a bunch of files named like leContext.php(849):.log [21:54:56] I think it's a combination of a bug in MobileFrontend and a bug in udp2log [21:55:14] MF? that's my domain:) [21:55:48] uhhhh [21:56:18] I have a feeling that it's because multiline stack traces are being logged? [21:56:25] but it used to work before [21:57:19] multiline is anunderstatement :) [21:57:24] I don't even see anything like that in sources [21:57:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [21:58:08] MaxSem: it looks like a 5000+ line backtrace from MF broke something in the logging stack [21:58:19] uhoh [21:58:40] shotgun logging:P [21:58:40] * greg-g just parroted what bd808 said in #mediawiki-core [21:59:31] * MaxSem wonders why that chan is not in his autojoin list [22:06:11] bd808, found the culprit [22:21:23] greg-g: sent email [22:39:25] PROBLEM - Puppet freshness on lanthanum is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 04:37:38 PM UTC [22:51:10] (03PS2) 10Dzahn: WIP - turn RT from misc/* into puppet module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 [23:18:03] (03CR) 10Aaron Schulz: [C: 032] Removed unused SwiftCloudFiles extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115959 (owner: 10Aaron Schulz) [23:18:14] (03Merged) 10jenkins-bot: Removed unused SwiftCloudFiles extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115959 (owner: 10Aaron Schulz) [23:19:20] !log aaron synchronized wmf-config 'Removed unused SwiftCloudFiles extension' [23:19:28] Logged the message, Master [23:26:47] (03PS5) 10Ottomata: Initial 2.0.0-1 debian release [operations/debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/115323 [23:27:25] PROBLEM - Puppet freshness on gallium is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 05:25:59 PM UTC [23:44:54] csteipp, regarding my current problem with my lost soft token on wikitech -- given that I cant find technical documentation for how to reset that preference -- do you have any insight? [23:45:04] I'm guessing it's a preference listed in the user_properties table [23:45:48] mwalker: I don't know, sorry. Maybe try Ryan Lane if he's on line [23:46:22] Ryan_Lane: ping? [23:46:25] !log maxsem synchronized php-1.23wmf16/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/116174/' [23:46:33] Logged the message, Master [23:47:10] actually; this would more properly be in #wikimedia-labs -- but I'll be around in both [23:50:25] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [23:51:44] !log maxsem synchronized php-1.23wmf16/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/116174/' [23:51:52] Logged the message, Master