[00:00:55] matanya: ^ [00:01:15] mutante: you are great, thanks for chasing this down [00:01:18] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.484 second response time [00:02:10] !log reedy synchronized wmf-config/ [00:02:11] !log restarting gitblit on antimony [00:02:17] Logged the message, Master [00:02:25] Logged the message, Master [00:02:44] (03PS1) 10Reedy: Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 [00:03:03] (03CR) 10Reedy: [C: 032] Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 (owner: 10Reedy) [00:03:10] (03Merged) 10jenkins-bot: Fix arrays for $wgContactConfig [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116005 (owner: 10Reedy) [00:03:18] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 365717 bytes in 9.096 second response time [00:06:02] !log By what can only be described as kicking-down-the-door-style deployment, mwalker and I managed to deploy four FundraisingChart Jenkins jobs after about 15 tries each. [00:06:10] Logged the message, Master [00:06:14] Life is fun. [00:06:56] (03PS3) 10Reedy: Disable and remove ContactPageFundraiser [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110292 [00:08:50] (03PS4) 10Reedy: Disable and remove ContactPageFundraiser [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110292 [00:10:38] Hrm [00:10:52] There are mass -1s coming from jenkins, is the accepted fix to restart zuul? [00:11:44] (03Abandoned) 10Reedy: Remove 1.18 back compat [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/93116 (owner: 10Reedy) [00:14:30] Oh, no, never mind [00:22:09] (03PS1) 10BryanDavis: Additional ssh key for Bryan Davis. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116014 [00:23:53] (03CR) 10BryanDavis: "Public SSH key listed on https://office.wikimedia.org/wiki/User:BDavis_(WMF) for verification of ownership." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116014 (owner: 10BryanDavis) [00:26:09] (03PS2) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:26:23] (03CR) 10jenkins-bot: [V: 04-1] Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 (owner: 10Reedy) [00:28:20] mutante: Any reason why we don't have a bastion only user group, so that people who only need specific hosts only get access to the bastions and the hosts they actually need? [00:28:48] hoo: we do [00:28:55] admins::restricted [00:29:56] mutante: Admins::restricted also you to jump onto terbium and have root on the DBs... and to view the logs, ... that's not waht I call a bastion only thing :P [00:29:57] (03CR) 10Dzahn: [C: 032] Remove scap-recompile [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/109950 (owner: 10Reedy) [00:30:11] s/also/allows/ [00:30:11] (03CR) 10Dzahn: [V: 032] Remove scap-recompile [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/109950 (owner: 10Reedy) [00:30:25] (03PS3) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:31:10] hoo: i don't know about root on DBs, are you sure [00:31:41] mutante: Yep, run mysql_root_pass [00:31:41] or mysql_root_password or so [00:31:57] hoo: dunno, should ping springle-away about that [00:32:09] (03PS4) 10Reedy: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 [00:32:34] mutante: mortals have a use for root on teh MySQL's, but people only wanting to upload eg. releases (like Markus) don't :P [00:32:59] hoo: you don't need to convince me, sounds reasonable [00:33:25] :) [00:33:42] that's something that was added today, right [00:33:49] people being mw uploaders [00:33:51] and having shell [00:34:03] that's why I'd split admins::restricted into admins::bastion and the actual restricted [00:34:18] manybubbles|away: Done [00:34:19] likely a good idea, yea [00:34:28] Fail [00:35:24] springle-away: ^ re: mysql_root_pass [00:35:55] Jeff_Green: ^ re: mw-uploaders [00:35:58] hoo: ^ [00:35:59] :) [00:36:10] mutante: Ok, shall I create a (draft) patch? [00:36:29] yes, sure [00:37:20] mutante: Can you check when someone last logged in into the cluster? [00:37:32] hoo: define "the cluster" [00:37:37] a random mw machine? [00:37:56] (03PS2) 10Reedy: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 [00:38:09] (03CR) 10jenkins-bot: [V: 04-1] Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 (owner: 10Reedy) [00:38:25] mutante: That's pretty hard... like on all bastions? Eg. dab is probably inactive (former TS root) [00:38:25] hi springle [00:39:33] hoo: what do you want to find out? [00:40:04] just if he is still using it? [00:40:14] doesnt reply to just asking if he still needs it? [00:41:34] (03PS3) 10Reedy: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 [00:44:48] mutante: I just wonder whether he still uses the access (eg. whether TS uses his account to pull data... which would be pretty bad, but you never know) [00:44:48] mh... fenari is both a bastion and has private data... bad [00:44:49] mutante: Yep [00:44:49] Is there another tampa bastion? [00:45:14] We could also do that, I suppose :P He's not on IRC atm though... [00:45:14] mutante: Is there another tampa bastion (one that doesn't have private data like fenari) [00:45:31] (03Abandoned) 10Reedy: Make sync-dblist report done, don't echo mediawiki-installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/110092 (owner: 10Reedy) [00:46:35] hoo: no, don't know, we already use eqiad [00:47:05] If I search for "equinix ashburn" in Google Maps, the buildings on Filigree Ct. seem to be in the direct approach path to runway 19C. Very odd choice. [00:47:30] or wait... you can jump from bast1001 to tampa also, I guess... so not really needed [00:48:14] (03PS2) 10Reedy: Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 [00:49:01] (03PS1) 10Hoo man: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 [00:50:48] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [00:52:46] (03PS1) 10Dzahn: ganglia, pdf1 is dead, monitor pdf2/3 not pdf1/2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116020 [00:54:18] (03CR) 10Dzahn: [C: 032] ganglia, pdf1 is dead, monitor pdf2/3 not pdf1/2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116020 (owner: 10Dzahn) [00:57:25] (03PS1) 10Dzahn: decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 [00:59:35] (03CR) 10Hoo man: [C: 031] Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [00:59:39] (03PS2) 10Dzahn: decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 [01:04:47] !log pdf1 - disable monitoring - downtime until ∞ [01:04:57] Logged the message, Master [01:05:16] (03CR) 10Dzahn: [C: 032] decom pdf1,remove from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116022 (owner: 10Dzahn) [01:06:37] mutante: https://gerrit.wikimedia.org/r/116019 [01:07:24] mutante: Ok, so I guess this wont matter much anyway... they probably only need bast1001 and the realese host [01:07:25] we probably have more users in there who only need some specific host(s) [01:07:26] springle: hey, around? [01:08:49] hoo: am now [01:09:19] (reading log) [01:09:26] springle: We were just talking about mysql_root_pass [01:09:44] hoo: RT 5612 [01:09:55] as springle just mentioned correcly [01:10:18] ah ok, fine then [01:11:16] hoo: re: the question if dab is still active.. Jan 12 [01:11:24] springle: Just out of interested... what's the current process for schema updates? Wikitech only has information from 1954 on that... :P Do you use like the percona online schema change thing? [01:11:45] pt-online-schema-change, I mean [01:12:38] yes, mostly pt-online-schema-change unless it's something horrible like adding a PK [01:12:50] heh :) [01:12:54] !log pdf1 - revoke puppet cert, kill from stored configs,... [01:13:04] sometimes also just depool a slave, alter, repool [01:13:04] Logged the message, Master [01:14:41] springle: Ok... what about say a drop table? depool, drop, repool? ... innodb_file_per_table ... [01:16:17] that would break replication. incremental delete in master, check replag, then drop [01:17:18] hoo: what about innodb_file_per_table? it's on some hosts, not yet all. [01:17:34] springle: The actual drop table time depends highly on it [01:17:46] hence incremental delete ;) [01:17:46] * highly depends on it [01:18:04] ah, ok [01:18:22] if it is definitely out of service, then can drop on individual machines [01:19:02] springle: Yeah... like the "cur" table... unused since 2007 (or so), new wikis don't even have it, the data in there is damn unuseful... [01:19:37] there are a few tables that need dropping [01:20:35] not high prioity. some we keep just on principle. anything _old or _delete should go [01:21:43] hoo: mutante: who needs to review https://gerrit.wikimedia.org/r/#/c/116019 ? [01:21:53] jeff? [01:23:10] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:15] springle: Yeah... that was my main idea... IMO mortals should still have all the access the current restricted groups have (incl. the MySQL bits), but many of them just don't need it [01:23:35] * many of the restricted users [01:28:00] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.147 second response time [01:30:04] * springle_ stabs irc [01:31:35] hoo: interesting test of drop table recently on an s5 slave with: innodb_file_per_table, partitioned `revision`, ~48G buffer pool. no appreciable stall. it might be worse on the new slaves coming which will have 96G buffer pool [01:34:29] springle_: Nice... drop tables are some of the less unfunny schema changes because of that [01:36:59] one can always delay drop table. look, we do it since 2007 :) [01:38:20] heh... stale tables aren't harming :) [01:38:52] springle_: One last question ... :P How much ram do the master boxes currently have? [01:39:11] (I guess they have at least as much as the slaves) [01:40:16] hoo: Look in ganglia [01:40:52] enwiki master: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=MySQL+eqiad&h=db1052.eqiad.wmnet&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [01:41:33] heh [01:41:39] oh, right :P [01:42:31] hoo: db1051-db1060 are 96G, others are 64G. do you can see from dbtree [01:43:43] slaves should have same or more resources than master. we break this rule where 96G box is master and one or more 64G slaves are in the shard. but not a perfect world [01:45:41] I see :) [01:47:39] (03PS1) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:47:54] (03CR) 10jenkins-bot: [V: 04-1] Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:48:53] (03PS2) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:49:02] (03CR) 10jenkins-bot: [V: 04-1] Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:50:59] (03PS3) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:51:10] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:17] hoo: first three 128G boxes will arrive soon. i expect they will all be S1 slaves. the 96G boxes will be prioritised among S1, S2, and S4; not entirely sure how yet [01:51:40] :) [01:52:40] I guess you saw our growth plans for wikidata (and also the plans to make the data queriable)... at some point we will probably surpass s1 in resource usage [01:52:54] oh wikidata makes me cry [01:53:13] it will need it's own shard someday [01:53:56] Yeah... we'll probably need to throw hardware at the problem... let's see how fast it grows [01:54:20] (03PS4) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [01:54:32] (03PS4) 10Dzahn: removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 (owner: 10Matanya) [01:55:08] (03CR) 10Dzahn: [C: 032] removed pdf1, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115581 (owner: 10Matanya) [01:55:29] !log DNS update - removing pdf1 [01:56:00] wb_terms is the wikidata hotspot for slow queries [01:56:43] i tried a partitioned wb_terms on hash term_language recently; it had some improvement, but data is heavily skewed to certain languages [01:57:10] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:25] (03CR) 10Reedy: [C: 04-1] Remove db and job queue pmtpa files (036 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 (owner: 10Reedy) [01:59:08] Yeah, that table is awry [02:01:13] one interesting mysql 5.7 optimization which i hope mariadb will get: Index Condition Pushdown for partitioned tables. we could really work with that. [02:01:24] (03PS5) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [02:03:03] (03PS6) 10Reedy: Remove db and job queue pmtpa files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116036 [02:03:44] (03CR) 10Dr0ptp4kt: "Shoot, I think we can only remove the specific 'set req.http.X-ZeroTLS = "1";' for a carrier, not remove the vcl_deliver. That said, can y" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115669 (owner: 10BBlack) [02:06:44] hoo: fun http://bugs.mysql.com/bug.php?id=51325 and http://bugs.mysql.com/bug.php?id=69316 ... 5.5 ftw :) [02:10:38] springle_: Yeah... but that's only much of an issue if the table has actually been accessed [02:10:45] at least it should, I guess [02:11:14] (03PS1) 10Andrew Bogott: Turn down respawn limit for manage-volumes and manage-nfs-volumes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116038 [02:11:23] mariadb 5.5.36 is out now too.. [02:18:01] wikitech is down again [02:23:00] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [02:23:00] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [02:23:20] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [02:23:50] !log reedy synchronized php-1.23wmf16/includes/htmlform/HTMLFormField.php 'I2741ef940d83eeb564e89e20378fb4004cfe5b83' [02:24:00] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.000 second response time on port 389 [02:24:00] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.929 second response time [02:24:00] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [02:24:08] !log restarted apache on virt0 [02:24:34] * Reedy wonders why wikitech usually breaks when springle_ is around [02:25:09] * springle_ looks shifty [02:26:00] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.714 second response time [02:26:06] Logged the message, Master [02:26:14] !log rebuilt ldap indexes on virt1000 [02:26:26] Logged the message, Master [02:26:29] springle_: it's an ldap issue I think [02:26:36] I've been looking at it [02:26:45] (possibly fixed already, hard to tell) [02:27:05] andrewbogott: should i not have touched apache? [02:27:15] springle_: it's fine, harmless [02:27:34] it hit MaxClients, as usual [02:28:17] man, ldap is still going bananas. I wonder who/what is hitting us so hard? [02:28:29] It's almost certainly self-inflicted but I've already stopped the usual suspects [02:34:52] !log LocalisationUpdate completed (1.23wmf15) at 2014-02-28 02:34:35+00:00 [02:34:58] Logged the message, Master [02:50:21] Reedy: Can you run rebuildInterwiki please? [02:57:08] *sigh* fix s1 contributions-page slow queries and what crawls back to the top of the slow list? good old LogPager [03:05:29] !log LocalisationUpdate completed (1.23wmf16) at 2014-02-28 03:05:28+00:00 [03:05:37] Logged the message, Master [03:06:24] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [03:06:32] Logged the message, Master [03:06:49] thanks, Reedy [03:06:58] !log reedy updated /a/common to {{Gerrit|Icb3159198}}: Fix arrays for $wgContactConfig [03:07:02] (03PS1) 10Reedy: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 [03:07:06] Logged the message, Master [03:07:26] (03CR) 10Reedy: [C: 032] Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:30] (03CR) 10Hoo man: [C: 032] Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:33] (03Merged) 10jenkins-bot: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116043 (owner: 10Reedy) [03:07:42] +4 :D [03:14:30] (03PS1) 10coren: Labs: Disable labs LVM until image fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/116045 [03:16:28] (03CR) 10coren: [C: 032] "Annoying, but necessary." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116045 (owner: 10coren) [03:43:34] (03PS1) 10coren: vmbuilder: tweak partition sizes [operations/puppet] - 10https://gerrit.wikimedia.org/r/116048 [03:45:13] (03CR) 10coren: [C: 032] "Trivial tweak" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116048 (owner: 10coren) [03:45:33] * Coren rebuilds the image. [03:50:40] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-28 03:50:40+00:00 [03:50:49] Logged the message, Master [03:51:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [04:01:35] (03PS1) 10coren: Labs: Some tweaks and saner defaults for labs LVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/116050 [04:05:25] (03PS1) 10Andrew Bogott: Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 [04:06:29] heya mark, you there? qq. is there a not-much-used or non-critical node in esams that Snaps and I can run rdkafka_performance tests from? [04:06:40] we want to see if we can reproduce this intermittent problem [04:08:23] (03PS2) 10Andrew Bogott: Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 [04:10:49] (03CR) 10Andrew Bogott: [C: 032] Fixed a typo in the 'nova stop' phase. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116051 (owner: 10Andrew Bogott) [04:19:17] (03PS8) 10Dzahn: turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [04:22:35] (03CR) 10Dzahn: "PS8: removed the entire SSL config stuff, because now this is behind a proxy that does SSL termination anyways" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:52:13] (03CR) 10Dzahn: [C: 032] turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:54:03] (03CR) 10Dzahn: "no worries, just used on labs instance and nothing special in here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [04:55:36] (03CR) 10Aude: Introduce an admins::bastion user group (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [04:58:45] (03CR) 10Dzahn: "@RobiH, it's merged and a module now, also doesn't need public IP anymore and doesn't need to use star SSL certs, it's behind proxy, next " [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [05:02:25] (03CR) 10Dzahn: "some extra changes in there that seem unintended and you'd have to actually use this on bast1001 but +1 for the idea" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:16:11] (03PS1) 10Dzahn: enhanced comments re UIDs and key verification [operations/puppet] - 10https://gerrit.wikimedia.org/r/116055 [05:16:42] (03CR) 10Andrew Bogott: [C: 032] Turn down respawn limit for manage-volumes and manage-nfs-volumes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116038 (owner: 10Andrew Bogott) [05:18:14] (03CR) 10Dzahn: [C: 032] enhanced comments re UIDs and key verification [operations/puppet] - 10https://gerrit.wikimedia.org/r/116055 (owner: 10Dzahn) [05:31:08] (03PS2) 10Dzahn: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:31:48] (03PS3) 10Hoo man: Introduce an admins::bastion user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 [05:32:15] (03PS1) 10TTO: Allow more upload file types for sewikimedia sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116057 [05:32:51] (03CR) 10Dzahn: [C: 031] "rebased, fixed tabs and stuff" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [05:44:49] (03PS1) 10TTO: Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 [05:51:05] (03PS1) 10Dzahn: add README.md for wikistats module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116061 [05:53:12] (03CR) 10Dzahn: [C: 032] add README.md for wikistats module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116061 (owner: 10Dzahn) [06:27:40] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:29:40] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [06:52:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [07:00:11] (03PS1) 10Dzahn: WIP - turn RT from misc/* into puppet module [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 [07:00:13] (03CR) 10Matanya: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/74592 (owner: 10Reedy) [07:01:27] (03CR) 10Matanya: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 (owner: 10Reedy) [07:02:32] (03CR) 10Dzahn: [C: 04-1] "just started, not done yet" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116064 (owner: 10Dzahn) [07:09:49] (03CR) 10Matanya: [C: 031] Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [07:15:22] (03PS1) 10Ryan Lane: Enable keystone redis driver and switch replication around [operations/puppet] - 10https://gerrit.wikimedia.org/r/116065 [07:18:45] (03CR) 10Ryan Lane: [C: 032] Enable keystone redis driver and switch replication around [operations/puppet] - 10https://gerrit.wikimedia.org/r/116065 (owner: 10Ryan Lane) [07:23:40] (03PS1) 10Ryan Lane: Use Token and not TokenNoList redis driver for folsom keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/116066 [07:27:56] lots of odd things in fluorine:/a/mw-log [07:28:15] yeah I saw that [07:28:25] happens every blue moon [07:28:44] dberror is gone [07:28:47] at least [07:30:03] (03CR) 10Ryan Lane: [C: 032] Use Token and not TokenNoList redis driver for folsom keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/116066 (owner: 10Ryan Lane) [08:35:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:37:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:39:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:41:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:43:46] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:45:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:47:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:49:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:54:37] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:55:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:57:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [08:58:12] !log disabled puppet on labstore1001 to allow unattended file copies [08:58:20] Logged the message, Master [09:00:21] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:30:08 AM UTC [09:01:02] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Fri Feb 28 09:01:00 UTC 2014 [09:02:42] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 09:01:00 AM UTC [09:30:47] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Fri Feb 28 09:30:46 UTC 2014 [09:32:27] !log restarting Jenkins, some jobs registration is broken :( [09:32:35] Logged the message, Master [09:39:14] !log Jenkins restarted. [09:39:23] Logged the message, Master [09:42:24] (03CR) 10Mark Bergsma: "Let's just use $::mw_primary now. That should be (made) equal to 'eqiad' in production, and to $::site in labs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [09:47:26] (03PS2) 10Hashar: beta: fix upload cache directors [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 [09:47:36] (03CR) 10Hashar: "Done :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [09:53:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [10:02:38] (03PS3) 10Hashar: Use mw_primary for Varnish swift backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 [10:03:47] (03CR) 10Mark Bergsma: [C: 032] Use mw_primary for Varnish swift backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/115910 (owner: 10Hashar) [10:05:00] good [10:05:10] still broken but at least now I have a much better error message [10:42:39] (03PS1) 10Andrew Bogott: Added even more error checking to dc-migrate. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116071 [10:44:37] (03CR) 10Andrew Bogott: [C: 032] Added even more error checking to dc-migrate. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116071 (owner: 10Andrew Bogott) [10:52:54] (03PS1) 10Andrew Bogott: Clear the puppet cert on the proper host. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116073 [10:54:39] (03CR) 10Andrew Bogott: [C: 032] Clear the puppet cert on the proper host. [operations/puppet] - 10https://gerrit.wikimedia.org/r/116073 (owner: 10Andrew Bogott) [11:29:42] (03CR) 10Hashar: contint: webproxy for maven on CI production slaves (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [11:30:10] (03PS3) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [11:30:28] (03PS4) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [11:46:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [12:15:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 minor comment, otherwise LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [12:54:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [13:24:42] the logs on fluorine seem to have grown a number of files this morning [13:26:15] all the new log files that I've checked mention mobile [13:26:56] anyone from mobile on? [13:36:11] manybubbles: [13:36:25] I suspect mobilecontext somehow or other (I guess you do too) [13:36:49] I suspect it but don't really know what to do about it :) [13:36:51] thanks [13:37:03] yeah I think someone from the mobile team will have to take a look [13:37:36] are there new files since 6 or 7 am? [13:37:49] who's time zone? [13:37:58] fluorine's [13:38:02] ie utc [13:38:02] -rw-r--r-- 1 udp2log udp2log 200 Feb 28 11:07 7):.log [13:38:15] hm that'smore recent. ok, looks like problem is ongoing still [13:38:16] oh cool, there is one named .log [13:38:23] hahaha nice [14:11:30] Would anyone object to me deploying a one-file fix to a regression in 1.23wmf15 in a few minutes (https://gerrit.wikimedia.org/r/116095)? [14:15:11] !log anomie synchronized php-1.23wmf15/includes/htmlform/HTMLFormField.php 'Backport fix for bug 61942 to 1.23wmf15 (reedy already did wmf16 last night)' [14:15:18] Logged the message, Master [14:15:34] thanks :) [14:22:05] hashar: Shouldn't tools/scap things rather go here than in -dev? [14:22:56] hoo: it is merely for platform engineering / devs use [14:23:02] hoo: I dont think we need it for operations [14:23:17] there is enough spam here I think :] [14:23:37] enough spam in both... it's moreover a DevOps thing, I guess (yay, Buzzword) [14:34:37] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 11:34:07 AM UTC [14:47:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [15:14:15] (03PS1) 10Hashar: contint: pip shared cache on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 [15:14:40] (03CR) 10Alexandros Kosiaris: [C: 032] "I would love a man page at some point. lintian is complaining already but you can add it at some point later in time. help2man might be us" [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [15:16:45] (03CR) 10coren: contint: pip shared cache on labs slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 (owner: 10Hashar) [15:17:31] (03PS3) 10Reedy: Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 [15:17:51] (03CR) 10Ottomata: [C: 032 V: 032] "Oo, ok thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 (owner: 10Reedy) [15:22:15] (03PS2) 10Hashar: contint: pip shared cache on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 [15:27:23] (03CR) 10Ottomata: "Aye, I'd do that, except git-fat's help message is pretty useless." [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [15:37:02] (03CR) 10coren: [C: 032] "lgtm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116111 (owner: 10Hashar) [15:38:24] (03PS1) 10Ottomata: Adding elastic1013-elastic1016 to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:38:36] (03PS2) 10Ottomata: Adding elastic1013-elastic1016 to elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:39:01] manybubbles: just running puppet on these new servers with the same setup should be all I need to do, right? [15:39:07] allocate the partition; run puppet; [15:39:07] right? [15:39:15] no! [15:39:17] stop! [15:39:31] ok ! [15:39:32] you'll have to install Elasticsearch 0.90.10 on them manually because puppet'll do 0.90.11 [15:39:33] i stopped! [15:39:36] oh hm [15:39:42] ok so I should apt-get that first [15:39:52] because we pulled it into apt [15:39:52] is the version manually specified in puppet? [15:39:59] or does it just say ensure => 'installed'? [15:40:00] ? [15:40:03] no, it isn't manually specified [15:40:10] ok cool [15:40:12] logstash upgraded but I didn't take the upgrade [15:40:19] so if we install 0.90.10 [15:40:21] then puppet should be ok? [15:40:21] I should probably manually specify the version in puppet [15:40:27] it should be ok, yeah [15:40:41] ah, hm, i need to still run puppet at least once on these first, ok [15:40:45] i will puppetize them without ES first [15:40:52] ah [15:41:05] If you wanted to manually specify the version that'd be fine with me [15:41:08] I'll +1 it [15:41:26] either way is fine [15:41:34] but we'll upgrade to 1.0.1 in a week [15:41:57] if the version is wrong in puppet, will puppet install the next version? [15:42:08] I'm pretty sure I don't want that because I want to manually run the upgrades [15:42:20] so I can do them one at a time when that is right and all at once when that is right [15:42:21] (03PS3) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:42:37] no, yeah [15:42:41] i think => installed is fine [15:42:55] (03PS4) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:43:33] (03CR) 10jenkins-bot: [V: 04-1] Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 (owner: 10Ottomata) [15:44:59] (03PS5) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:45:17] (03PS6) 10Ottomata: Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 [15:50:27] (03CR) 10Ottomata: [C: 032 V: 032] Puppetizing elastic1013-elastic1016 (not adding to ES cluster yet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116113 (owner: 10Ottomata) [15:55:37] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [15:57:48] ha, manybubbles [15:57:55] we ahve both 0.9.11 and 0.9.7 in our apt [15:58:01] http://apt.wikimedia.org/wikimedia/pool/universe/e/elasticsearch/ [16:04:11] mark, can you comment on this? [16:04:12] https://rt.wikimedia.org/Ticket/Display.html?id=6948 [16:04:18] should I just remove the IPv6 addies from DNS? [16:04:22] * mark checks [16:04:32] (03CR) 10Hashar: "@akosiaris your comment is missing :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:04:42] and I am off till monday! Have fun folks [16:04:47] are they all on a fixed port? [16:04:53] always 9092? [16:04:58] kafka yes [16:05:09] ok then I'll just open that up [16:05:13] until we resolve the private link situation [16:05:25] will the IPv6 addy be routed via internet or via dedicated link? [16:05:45] (03CR) 10Alexandros Kosiaris: contint: webproxy for maven on CI production slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:06:40] (03CR) 10Hashar: contint: webproxy for maven on CI production slaves (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 (owner: 10Hashar) [16:07:08] (03PS5) 10Hashar: contint: webproxy for maven on CI production slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/114597 [16:07:13] ok, manybubbles [16:07:26] ok? [16:07:27] elasticsearch 0.9.10 is installed on elastic101[3-6] [16:07:38] btw, the elasticsearch .deb package does not depend on java! [16:07:44] sweet. [16:07:45] it let me install it without having installed java :p [16:07:46] stupid [16:07:52] so dumb [16:08:00] just said it couldn't start up after installing, pssh [16:08:00] heh [16:08:06] anyway, that's fine [16:08:10] ok, so that is installed, what next [16:08:12] might want to get the same java version as is on the other machines too [16:08:14] if possible [16:08:16] now should I puppetize it? [16:08:17] hm [16:08:22] puppet should do that, right? [16:08:28] I think it'll just install whatever [16:08:33] don't think so [16:08:36] rather, the openjdk [16:08:42]