[00:15:43] (03PS1) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107108 [02:11:41] !log LocalisationUpdate completed (1.23wmf9) at Mon Jan 13 02:11:41 UTC 2014 [02:11:49] Logged the message, Master [02:21:30] !log LocalisationUpdate completed (1.23wmf10) at Mon Jan 13 02:21:29 UTC 2014 [02:21:35] Logged the message, Master [02:39:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jan 13 02:39:15 UTC 2014 [02:39:22] Logged the message, Master [03:04:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:06:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:08:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:10:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:12:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:14:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:16:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:18:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:20:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:22:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:24:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:26:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:28:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:30:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:00:04 AM UTC [03:30:41] RECOVERY - Puppet freshness on mw32 is OK: puppet ran at Mon Jan 13 03:30:34 UTC 2014 [03:32:41] PROBLEM - Puppet freshness on mw32 is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:30:34 AM UTC [04:00:29] RECOVERY - Puppet freshness on mw32 is OK: puppet ran at Mon Jan 13 04:00:22 UTC 2014 [05:18:09] <^d> !log enwiki reporting lsearchd hasn't updated in days. Cursory investigation says this is right. Nothing in searchidx1001's logs seems telling, yet. [05:18:15] Logged the message, Master [05:20:30] <^d> Hmm, getting tons of timeouts trying to obtain locks. [05:23:05] <^d> enwiki index seems *very* out of date :\ [05:39:21] (03CR) 10Ottomata: [C: 032 V: 032] imported Mercurial ganglios from https://bitbucket.org/maplebed/ganglios/overview [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/106505 (owner: 10Matanya) [06:51:39] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:50:33 AM UTC [09:52:39] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:50:33 AM UTC [10:26:30] !log upgrading packages on gallium and lanthanum [10:26:37] Logged the message, Master [10:28:27] (03CR) 10Dzahn: [C: 032] "seems this was meanwhile also fixed on the remote side, but prefer non-capitalized anyways" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107105 (owner: 10Nemo bis) [10:28:40] (03CR) 10Dzahn: [V: 032] "seems this was meanwhile also fixed on the remote side, but prefer non-capitalized anyways" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107105 (owner: 10Nemo bis) [10:30:27] (03PS2) 10Dzahn: [Planet] Add Virginia Gentilini to Italian Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/107106 (owner: 10Nemo bis) [10:32:42] (03CR) 10Dzahn: [C: 032] identd:lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/107032 (owner: 10Matanya) [10:36:16] (03CR) 10Dzahn: [C: 032] "lgtm, feed works" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107106 (owner: 10Nemo bis) [10:39:57] (03CR) 10Dzahn: [C: 032] "per Chad, search in Tampa is decom" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106622 (owner: 10Chad) [10:45:11] hashar: have you seen https://bugzilla.wikimedia.org/show_bug.cgi?id=59980 ? [10:46:59] matanya: yeah and replied on it [10:47:15] matanya: I think it is going to be a wont fix :/ [10:47:48] was talking to ops about it. We use 'puppet parser validate' which does have that many knowledge about which parameters are valid [10:48:35] ok, hashar thanks. I had an idea, but i guess it is too much of a hassle and resource intensive to implemnt [10:49:02] matanya: apparently we should compile the puppet catalog [10:50:10] hashar: yes, i though bringing up a vm in labs for every patch and compile the catalog [10:50:16] *thought [11:02:07] that is more or less the idea I want to eventually achieve one day [11:02:18] matanya: the crazy idea would be to have a dedicated CI project on wmflabs [11:02:32] that would run on specific servers isolated from the network [11:02:41] sounds great [11:02:42] then spawn a pool of VM to be consumed by Jenkins jobs [11:02:50] is that doable? [11:02:54] unfortunately, there is not much horse power on ops side to make it happen :-] [11:02:57] yeah it is doable [11:02:58] entirely [11:03:00] OpenStack did it [11:03:19] they wrote a python daemon that interact with OpenStack cloud API to maintain a pool of VM [11:03:33] then get a Jenkins slave installed on the VM and have it register with the Jenkins master [11:03:52] so a job can be run in the vm. Once the job is done, some magic thing deletes the vm [11:04:05] that is all that should be done in order to achive this? [11:04:14] (03CR) 10Dzahn: "i'd wait for consensus on the bug here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106892 (owner: 10Tinaj1234) [11:04:18] this sounds like something nice to do [11:05:15] matanya: http://tinyurl.com/pmgqb4c [11:05:40] that shows the number of VM being: build, available, running tests and finally being deleted [11:06:05] the little daemon attempts to maintain a pool of 100 VM apparently (yellow + green) [11:06:23] tempting to do [11:06:26] definitely [11:06:27] :D [11:06:32] but need labs to be migrated to EQIAD first [11:06:43] and then find out how to get an isolated box or two in there [11:07:15] matanya: I don't want to put pressure on ops though :/ They are busy enough like that [11:07:50] do you want me to try and help out with this a bit? if there is anything i can do? [11:08:08] (03CR) 10Dzahn: [C: 031] "lgtm (quoting, ensure first, etc), but I'll leave merge to people who were involved writing it and can babysit it to make sure" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107035 (owner: 10Matanya) [11:17:04] (03CR) 10Hashar: [C: 031] "Added a bunch of folks that might be interested in casting their voice." [operations/puppet] - 10https://gerrit.wikimedia.org/r/106892 (owner: 10Tinaj1234) [11:19:00] (03CR) 10Dzahn: [C: 031] "personal opinion, i like the new format better" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106892 (owner: 10Tinaj1234) [11:20:39] out for lunch [11:28:00] (03PS1) 10Matanya: ganglia_new: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107128 [11:30:13] (03PS2) 10Alexandros Kosiaris: retab certs.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104742 (owner: 10Hashar) [11:31:50] (03CR) 10Dzahn: [C: 031] ldap : lint cleanup (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102629 (owner: 10Matanya) [11:32:41] (03CR) 10Dzahn: "some inline comments" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102629 (owner: 10Matanya) [11:39:20] (03CR) 10Dzahn: "other comments here? don't let my -1 from Sept. block it, some platform eng. reviews could get it going again" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 (owner: 10Reedy) [11:44:08] matanya: you know what would be helpful, wikitech editing when you find references to Tampa and know it's already eqiad now [11:45:04] mutante: I wish i new what is quiad now. i'll will try and fix it when i meet it, but i really don't know where server are [11:45:08] and that table of "https-less domains" [11:45:21] checking which of them can be removed or checked as resolved/wont fix [11:45:26] what table? [11:45:38] matanya: yea, only when you can be really sure it's done from logs [11:46:02] matanya: https://wikitech.wikimedia.org/wiki/Httpsless_domains [11:46:09] and there is a matching tracking bug in BZ [11:46:14] for https/cert issues etc [11:46:24] ok, i'll sort it out [11:46:29] thank you!:) [11:48:05] matanya: https://wikitech.wikimedia.org/wiki/Tampa_cluster (just fix ticket links/status updates if you see them, fyi) [11:48:30] don't move service around as "done" though without double checking with people who did it [11:49:00] yeah, sure :) [11:49:08] cool, tyvm [11:49:19] (03CR) 10Alexandros Kosiaris: [C: 032] retab certs.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104742 (owner: 10Hashar) [11:49:26] (03CR) 10Matanya: ldap : lint cleanup (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102629 (owner: 10Matanya) [11:53:37] (03CR) 10Alexandros Kosiaris: [C: 032] stages.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104919 (owner: 10Hashar) [11:54:16] (03CR) 10Dzahn: "bump" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [11:55:59] (03CR) 10Dzahn: [C: 031] "acked by ezachte, now ideally this would be +2ed/merged by another ops" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106738 (owner: 10Dzahn) [12:00:43] (03PS7) 10Matanya: ldap : lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/102629 [12:02:11] confilcts are so much fun :/ easpicialy when i create them with other patchs [12:11:34] mutante: just to make sure i got it right : ekrem.wikimedia.org redirects to https://meta.wikimedia.org/wiki/IRC and the https version links to /dev/null [12:12:10] this means it is not https enables, yeah? [12:14:51] matanya: redirecting it to IRC was just a convenience thing to get people to the right docs, yes [12:15:22] what is ekrem anyway? [12:15:26] because it's role::ircd [12:15:35] so IRC related docs [12:15:48] but besides that it's not a http server, it's an IRC server [12:16:16] so it should not have http anyway [12:16:19] before that redirect you got a different error [12:16:24] long time ago [12:17:01] i think it doesnt and the redirect is in cluster redirects.conf [12:17:14] but need to check exactly that [12:17:41] where is that file? [12:18:06] matanya: correction, i know why [12:18:13] it runs apache for another reason [12:18:21] root@ekrem:/etc/apache2/sites-enabled# ls [12:18:21] irc.wikimedia.org mobile.wikipedia.org wap.wikipedia.org [12:18:36] need to find out about the other 2 being deprecated etc [12:18:52] and after that, remove the httpd from it , correct [12:19:17] so it had apache anyways and also ircd, and then the redirect was just to make it better than giving you "it works" [12:19:30] when people enter the URL in browser [12:19:39] mobile does work, doesn't ekrem serve it? [12:19:52] wap doesn't [12:20:02] find the related bugs and latest status there [12:20:21] they are already somewhere waiting for comment afair [12:20:35] bz tickets or rt? [12:20:38] both :p [12:20:46] ekrem has RT as a host and services on it [12:20:57] and BZ has tickets about issues with redirects, certs, .. [12:21:17] and it should be in the "what's left in Tampa" tracking bug in RT [12:21:35] search for hostname [12:21:54] and in that wikitech "Tampa cluster" template links to tickets [12:25:29] mutante: ok, found https://rt.wikimedia.org/Ticket/Display.html?id=4784 and the relevant ircd role you created. [12:26:02] this means the ircd is still in tampa, and need a replace in eqaid. what host is allocalted for that? [12:26:57] (03CR) 10Alexandros Kosiaris: [C: 032] add nuria to privatedata admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/106738 (owner: 10Dzahn) [12:36:42] akosiaris: thanks, that worked, resolved 6617 [12:36:52] matanya: put exactly that on the ticket please, valid question:) [12:37:59] mutante: :-) [12:41:12] hey, can someone restart poolcounter service on helium and potassium? it looks sickly, resulting in jawiki's main page displaying nothing but errors for anons [12:42:09] (03PS1) 10Alexandros Kosiaris: Remove all occurences of old etherpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/107136 [12:42:23] akosiaris, apergos, paravoid, mark ^ [12:42:46] MaxSem: done [12:42:52] thanks:) [12:42:54] we had a bunch of connection refused [12:42:55] ah, i had the shell open [12:42:59] aren't they monitored ? [12:43:03] almost restarted twice [12:43:12] wee, worked [12:43:27] great [12:44:02] I love our wfDebugLog( 'poolcounter' ) messages: [12:44:02] 2014-01-13 12:43:43 mw1207 ruwiki: Ошибка при подключении к серверу-счётчику пула: Connection refused [12:44:05] log it? [12:44:40] !restarted poolcounter on potassium, helium after MaxSem's request [12:44:44] hasharAway, bad example if I can read it;) [12:44:51] !log restarted poolcounter on potassium, helium after MaxSem's request [12:44:58] Logged the message, Master [12:44:59] PROBLEM - poolcounter on helium is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd [12:45:04] hmmm [12:45:09] PROBLEM - poolcounter on potassium is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd [12:45:12] !log that was https://bugzilla.wikimedia.org/show_bug.cgi?id=59993 [12:45:18] Logged the message, Master [12:45:23] ah we have a monitor but that just verify the process is around :( [12:45:42] the [rboblem was that this process stuck [12:46:02] well it is not running now [12:46:15] monitoring not recovered yet, yea [12:46:27] nope... it is really not running [12:46:33] not only monitoring, logs indicate that it doesn't work [12:46:50] !log starting poolcounter on heloum [12:46:57] Logged the message, Master [12:46:58] * Starting poolcounter poolcounter [ OK ] [12:46:59] RECOVERY - poolcounter on helium is OK: PROCS OK: 1 process with command name poolcounterd [12:47:02] ? [12:47:05] (03PS1) 10Dan-nl: adding '*.openbeelden.nl' to the wgCopyUploadsDomains array. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107138 [12:47:09] RECOVERY - poolcounter on potassium is OK: PROCS OK: 1 process with command name poolcounterd [12:47:25] !log started poolcounter on potassium [12:47:31] Logged the message, Master [12:47:32] damn, i always get the typos into the log :p [12:47:36] ok... why did this happen ? [12:47:44] heh, well, there we are, but why [12:48:42] where does poolcounter log to ? [12:48:46] different restart method? init script? [12:48:49] grrr [12:48:49] poolcounterd* [12:48:51] vs puppet? [12:49:03] puppet seems to use the init script [12:49:11] Jan 13 12:43:08 helium puppet-agent[2959]: (/Stage[main]/Poolcounter/Service[poolcounter]/ensure) change from stopped to running failed: Could not start Service[poolcounter]: Execution of '/etc/init.d/poolcounter start' returned 1: at /etc/puppet/manifests/poolcounter.pp:19 [12:49:19] it now works but is getting constant lock timeouts [12:50:14] it runs as "109" [12:50:27] UID? permissions? [12:50:53] poolcounter:x:109:113:PoolCounter,,,:/:/bin/false [12:50:58] so no problem there [12:51:02] hmm [12:51:31] well, to start used the exact same command line puppet says it used [12:51:33] mutante: i asked that. moving on to the next one. thank for the toturial :) [12:52:17] matanya: welcome, the tickets probably need just those questions to be un-stalled [12:52:40] rt bugmiester :) [12:53:20] akosiaris: your restart was also /etc/init.d/ ? [12:53:25] yes [12:53:35] uhm, then i start to run out of ideas [12:53:39] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Mon 13 Jan 2014 03:50:33 AM UTC [12:53:45] I did it via cssh [12:53:54] so I restarted both at pretty much the same time [12:54:12] MaxSem: would this cause any problems ? [12:54:38] akosiaris, no idea [12:54:47] maybe the restart method has a timing issue, when trying to kill and wait before restart ? [12:54:51] (03PS1) 10Hashar: poolcounter.pp: retab/puppet lint fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/107140 [12:54:51] but now it looks just like before the restart [12:54:53] (03PS1) 10Hashar: poolcounter: monitor TCP port 7531 replies [operations/puppet] - 10https://gerrit.wikimedia.org/r/107141 [12:54:57] ^^^^ might give us TCP monitoring for poolcounter. [12:54:59] it was a clear stop, start btw [12:54:59] remember we had some "sleep x" hacks in slightly similar things [12:55:03] hmm [12:55:31] at least, jawiki's main page now works:) [12:55:47] and hashar lints poolcounter.pp, hehe:) nice [12:55:52] (03PS2) 10Hashar: poolcounter: monitor TCP port 7531 replies [operations/puppet] - 10https://gerrit.wikimedia.org/r/107141 [12:55:59] MaxSem: good [12:56:11] (03CR) 10Hashar: "Patchset 2 explains how I found out port 7531." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107141 (owner: 10Hashar) [12:56:23] likes ja though he cant read it [12:57:56] also checked there as no package upgrade of poolcounter by grepping /var/log/apt [12:58:42] it might be hosed due to libevent upgrade or some other dependency [12:59:17] bah poolcounter.log spam us for jawiki [12:59:21] albeit with an empty message :( [12:59:37] or maybe japanese is filtered out by wfDebugLog() :D [12:59:47] potassium, last thing in apt/history.log is just 2014-01-09 upgrading puppet itself [13:00:03] mutante , akosiaris bugzilla is on zirconium which is in eqiad, right? [13:00:36] matanya: yes and no, it's not done [13:01:07] what is missing? test_user? [13:01:09] matanya: while we speak prod is still kaulen, but zirconium is prepared for new version, that's because we do multiple things at one time [13:01:25] mutante: 4.4 ? [13:01:31] moving server, upgrading BZ major version and making puppet a module [13:01:38] to solve those tickets at once [13:01:41] yes [13:01:47] heads up, jawiki is broken again [13:01:56] grmbl [13:02:07] the question is what purging it [13:02:45] or we have a memacached failure that kills that particular page's parser cache? [13:03:06] paravoid: time to chime in here? [13:04:18] worst case, we could disable PC completely and pray no cache stampede happens while we're fixingg it [13:04:39] the poolcounter process is still running [13:04:45] weird thing, errors are only jawiki and a bit of ruwiki [13:04:48] and the server doesn't look very busy [13:05:26] the process is running fine from was it sems [13:05:32] ack [13:05:47] what is different with "ja" [13:05:52] does it have any log? [13:06:13] going through epoll (libevent), receivfrom, sendto... I actually see data passing through [13:06:20] MaxSem: not that I can find out... [13:08:26] do we need Platonides? (author of poolcounter?) [13:09:59] Reedy, around? any ideas what might be causing constant main page purges on jawiki? [13:10:39] (03PS1) 10MaxSem: Disable PoolCounter on jawiki, lots of errors breaking main page [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107143 [13:11:06] not sure if I should deploy ^^ yet [13:11:15] MaxSem: add ruwiki too [13:11:32] that's relatively rare [13:12:21] need some help re: a difference between commons beta cluster and production. gwtoolset can download images from http://www.europeana1914-1918.eu without issue on beta, but not on production. the domain has been whitelisted on both servers in the wgCopyUploadsDomains array. any way i can track down the issue? is there a log on production i could look at? [13:18:32] mutante: so 4.4 is ready or not? [13:19:17] matanya: not ready, just close to it [13:19:46] but we know the missing steps and it's being worked on [13:19:59] no need for new tickets there [13:20:11] MaxSem: do we have a translated version of that error message? [13:20:34] mark, lock wait timeout [13:21:04] isn't that just poolcounter lock contention? [13:21:25] i mean, disabling poolcounter might then just make things worse [13:21:29] mark, another one is "queue full" [13:21:47] that suggests there's a lot of contention doesn't it [13:22:17] yeah, that's why I'm trying to fugure out why it gets constantly reparsed [13:23:17] full_queues: 10772 [13:23:19] that's increasing [13:26:16] so when did this start happening? [13:26:45] (03CR) 10Alexandros Kosiaris: [C: 032] Remove all occurences of old etherpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/107136 (owner: 10Alexandros Kosiaris) [13:27:01] was first reported an hour ago [13:27:32] (03CR) 10Mark Bergsma: [C: 04-1] "I'm not sure that's wise, as the poolcounter service itself seems to be functioning correctly." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107143 (owner: 10MaxSem) [13:27:53] although looking at poolcounter.log, it was this way before too, maybe in a less severe way [13:28:08] likely at least through the weekend [13:28:16] awww [13:28:25] now other wikis also report problems [13:28:42] 2014-01-13 13:28:02 mw1201 enwiki: Pool queue is full [13:31:02] I see it even back in october [13:31:14] 2013-10-19 07:53:22 mw1144 ruwiki: Накопитель запросов полон [13:31:14] 2013-10-19 07:53:22 mw1201 enwiki: Pool queue is full [13:31:14] 2013-10-19 07:53:22 mw1199 jawiki: プールキューがいっぱいです [13:31:14] 2013-10-19 07:53:22 mw1130 dewiki: Poolwarteschlange ist voll [13:31:14] 2013-10-19 07:53:22 mw1208 enwiki: Pool queue is full [13:31:28] yeah, that log is never empty [13:31:29] cute, those localised errors [13:31:41] (03CR) 10Dzahn: [C: 031] poolcounter: monitor TCP port 7531 replies [operations/puppet] - 10https://gerrit.wikimedia.org/r/107141 (owner: 10Hashar) [13:31:47] we can perhaps increase the queue size a bit, see what that does [13:32:05] however, looking in archive, yesterday's log was more than twice as long as the day before it [13:32:09] we don't have stats on pool queues do we [13:32:26] I was trying to find them, but couldn't [13:35:37] The ja.wp error was reported in https://bugzilla.wikimedia.org/show_bug.cgi?id=59993 [13:35:48] mark ^ [13:36:13] that's what we're investigating [13:36:37] the restart timestamps can be kind of found via root@neon:/var/log/icinga# grep poolcounter icinga.log [13:38:58] most traffic is for lucene [13:39:43] from the api I think [13:40:06] there was a change earlier that removed tampa search, decom'ed per chad [13:40:15] search regularly gives poolcounter errors [13:40:25] searchidx2 was removed from dsh groups [13:40:33] Nemo_bis, red teh backscroll [13:41:50] https://gerrit.wikimedia.org/r/#/c/106622/ [13:42:37] (03Abandoned) 10Odder: Disable local uploads on Korean Wikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106273 (owner: 10Odder) [13:42:40] that was earlier today, can it be related? [13:42:51] since you say lucene traffic [13:43:16] mutante: the large log started yesterday [13:43:23] matanya: ok [13:43:24] as noted above [13:43:52] so i guess it is not directly related, tough might be adding to the situation [13:44:03] -rw-r--r-- 1 udp2log udp2log 1229513 Jan 11 06:20 archive/poolcounter.log-20140111.gz [13:44:03] -rw-r--r-- 1 udp2log udp2log 1249982 Jan 12 06:23 archive/poolcounter.log-20140112.gz [13:44:03] -rw-r--r-- 1 udp2log udp2log 2718145 Jan 13 06:25 archive/poolcounter.log-20140113.gz [13:44:03] -rw-r--r-- 1 udp2log udp2log 73958635 Jan 13 13:31 poolcounter.log [13:44:20] related to the decom'ing of tampa search that occured before that? [13:44:30] maybe [13:44:52] mutante, I doubt old search used parsed wikitext [13:45:08] MaxSem: k, just ruling things out that happened today [13:45:21] (03PS1) 10Mark Bergsma: Raise ArticleView pool size by 50% [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107146 [13:45:53] the question is, does new search use parser? [13:46:05] no idea [13:46:09] any objection to this change? [13:46:14] nope [13:46:18] let's try then [13:46:24] (03CR) 10Mark Bergsma: [C: 032] Raise ArticleView pool size by 50% [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107146 (owner: 10Mark Bergsma) [13:46:58] !log mark updated /a/common to {{Gerrit|I0442878ea}}: Raise ArticleView pool size by 50% [13:47:05] Logged the message, Master [13:47:33] however, page views are affected, and in a weird way, as if parser cache was not saving parse results [13:47:50] !log mark synchronized wmf-config/PoolCounterSettings-eqiad.php 'Raise ArticleView pool queue size by 50%' [13:47:56] Logged the message, Master [13:48:54] no real improvement [13:54:21] big log started yesterday, but it was also weekend [13:54:27] it could have been some deployment on friday or thursday [13:55:19] and there have been both mediawiki upgrades as well as search migrations then [13:56:25] mutante: netmon1001 can host manutius and streber services? [13:58:34] matanya: something netmon*, not sure if all on the same, maybe netmon1002 [13:58:50] should i preper patches for those? [13:59:25] just update the tickets for now asking that, (which service to which hardware) [13:59:36] and i don't see any netmon1002, i guess not deployed yet [14:00:03] i made it up to say i'm not sure about that, i think there has simply been no discussion yet [14:00:23] for what? [14:00:33] !log powering off hooper [14:00:38] streber services have already moved [14:00:40] the question if netmon1001 replaces streber and manutius at the same time [14:00:40] Logged the message, Master [14:00:46] or there should be 2 hosts [14:00:51] sure [14:01:07] they were both torrus and ganglia at some point [14:01:18] and streber was rancid [14:01:29] mark: site.pp shows streber is still in heavy use [14:01:43] site.pp can never show whether anything is in heavy use [14:02:02] streber is certainly not in use atm [14:02:25] mark: read as: has many roles :) [14:02:49] PROBLEM - Host hooper is DOWN: PING CRITICAL - Packet loss = 100% [14:03:06] https://wikitech.wikimedia.org/wiki/Tampa_cluster#manutius [14:03:18] MaxSem: might be helpful to clue up those poolcounter error messages a bit [14:03:23] matanya: [14:03:32] queue full, which queue, for what article, etc [14:03:47] mark, to clue up = to make them all English? [14:03:51] ah [14:03:54] not necessarily [14:03:58] but some more debugging info couldn't hurt [14:04:06] do we know it's just the main page? [14:04:28] most likely not only [14:05:44] mutante: so if i undrstand correctly, manutius and streber should be replaced by some netmon servers in equiad, but not clear which or what is the current status. it this right? [14:06:00] all services on streber have already moved to netmon1001 [14:06:01] stupid fucking Status class doesn't allow you to pass both technical and end-user facing error information [14:06:06] smokeping almost, the rest is done [14:06:19] and as for manutius, at least torrus still needs to be moved, but it's not really important as torrus is pretty broken [14:06:21] matanya: no, the "observium" TODO is done meanwhile, just not updated on wiki [14:06:28] and ganglia aggregators need to be moved elsewhere [14:06:45] torrus can go onto netmon1001 too [14:07:29] thanks, that made it more clear [14:07:32] mark: would it be ok if i push puppet patches for those changes you mention? [14:10:43] do we even need torrus now? [14:11:01] it was mostly squid stats, wasn't it? [14:11:16] and power usage I think? [14:11:48] can't we just move these elsewhere and have one tool less? [14:11:53] e.g. librenms has some power stuff for example [14:11:58] I wanted to ask that same question ... [14:12:43] * matanya looks at the channel siling [14:13:36] i like torrus better [14:13:43] for? [14:13:47] what do you use torrus for? [14:13:48] everything [14:14:04] but mostly aggregated stats [14:14:09] I've used it once or twice, I'm not very familiar with it [14:16:59] anyhow, puppet patches for it, yes or no? [14:36:21] so http://gdash.wikimedia.org/dashboards/poolcounter/ shows that it indeed most likely started yesterday morning [14:36:39] i see one config change by reedy, on the CategoryTree extension [14:36:53] no idea if that could possibly cause this, parsercache related issues maybe [14:37:01] oh you moved it here [14:37:04] I have another theory then [14:37:15] jawiki & ruwiki are s6, along with frwiki [14:37:32] db1006 alerted on 07:55 UTC yesterday, briefly [14:37:34] and is s6 [14:37:36] timing out on a db query inside a poolcounter lock? [14:37:42] yeah [14:37:46] possible yeah [14:39:38] hey hashar, bd808|BUFFER, having an issue with gwtoolset working on production. when david tries to use the extension to download media from an external domain that has been whitelisted by wgCopyUploadsDomains it fails. it works fine on beta. bawolf thinks there may be a proxy whitelist as well … do you know if that's the case or if there might be another config we have to consider? [14:40:58] or could be the other way around of course, whatever causing poolcounter contention also causing db load [14:43:19] right now it's lots of wikis hitting full pool [14:43:31] it was a burst, now it's fine [14:43:39]