[03:25:33] 3Wikimedia Labs / 3deployment-prep (beta): bits.beta.wmflabs.org down with 503 - 10https://bugzilla.wikimedia.org/69921#c4 (10Matthew Flaschen) 5NEW>3RESO/FIX a:3Ori Livneh The link works for me now, so it seems fixed. If it is still broken when you test, please re-open. [07:47:30] 3Wikimedia Labs / 3deployment-prep (beta): Cannot log in to Beta commons: Infinite redirect - 10https://bugzilla.wikimedia.org/69096#c8 (10Gilles Dubuc) (In reply to Andre Klapper from comment #6) > Gilles: Do you still see this problem? > If not we should close this ticket as RESOLVED WORKSFORME (and put my... [11:28:04] Hi. anybody available who knows about the tool-labs osm database setup? [11:35:26] jongleur: tried the pages in https://wikitech.wikimedia.org/wiki/Category:OpenStreetMap ? [11:37:32] Nemo_bis: yes, but there are remaining questions, especially as the Openstreetmap-Databases page is marked as out of date. [11:38:34] my biggest issue currently is how the database is created (what's the osm2pgsql-configuration used for import) [11:39:36] jongleur: outdated doesn't mean you mustn't read it, only that you need to be more careful [11:39:42] I did read it [11:39:44] Did you check puppet [11:39:47] ok [11:40:14] no, didn't check puppet yet; may be worth trying, thanks (have to look where I have to look...) [11:41:15] the second issue I have is that my tool fails to find a suitable driver (java tool running on tomcat, postgresql-driver is contained in the war file, I don't know what's failing exactly [11:45:52] Coren: Can you enable the translate extension on wikitechwiki? :O [11:46:00] - :O [11:47:29] hmm. I can't set filter at https://wikitech.wikimedia.org/wiki/Special:NovaInstance [11:50:08] sigh this is so hideous https://wikitech.wikimedia.org/wiki/Special:PrefixIndex/Shell_Request [12:37:22] Nemo_bis: yeah, I think Shell Request shouldn't exist, everyone should get one automatically [12:37:31] that's pretty much the case now except people have to push a button [12:38:48] I don't care much about that [12:38:57] I care about ns0 being full of unrelated junk [12:40:47] well, as a side effect the shell request stuff should go away [12:41:58] side effect of what [12:43:48] of making Shell Requests be granted automatically [13:00:39] wasn't that the case earlier? I think I never had to request it [13:05:05] Vogone: it's automatically *requested* but someone has to press buttons to grant it [13:05:12] valhallasw`cloud: also texlive on tools now :) [13:05:38] YuviPanda: cool [13:06:08] YuviPanda: well, makes sense :p imagine someone registers with a user name with is not shell-compatible [13:06:24] my shell name is also different from my wikitech username [13:06:32] Vogone: there's a separate 'shell name' thing [13:06:40] we could add code to ensure it's compatible [13:06:41] hm, k [13:12:41] just created a new labs instance and when trying to run puppetd -tv I get The program 'puppetd' is currently not installed.... [13:12:48] any ideas? This seems a bit odd to me! [13:13:36] addshore: 'puppet agent -tv' [13:13:51] it changed with the puppet3 migration [13:13:57] :> [13:20:56] YuviPanda: it's manual exactly because it has to be manual - not to prevent people from getting it, but to prevent floods of bots from registering [13:21:18] valhallasw`cloud: you can't do jack with just a shell account, can you? You need to be added to a project [13:21:27] you can connect to bastion? [13:21:31] also do floods of bots register past the captcha? [13:21:36] hmm, *that* is true [13:22:02] it's not meant to be fool-proof, it's meant to be a simple prevention mechanism [14:24:06] hi hashar do you know if beta labs is known to be slow right now? pages seem to be taking a while to load completely [14:24:35] hello chrismcmahon ! [14:24:42] good morning :] [14:24:58] on friday I gave a shot at limiting one broswster test job per instance [14:25:06] seems to work as intended at the expense of the run of all browser test jobs taking ~10 hours to complete :-/ [14:25:17] and yeah, I have noticed the browser tests being wonky all the week end [14:25:35] on friday evening (8 or 9pm UTC) ori looked at bits.beta.wmflabs.org yielding 503 [14:25:43] i.e. causing css/js to be broken entirely [14:25:46] apparently got solved [14:25:58] if any slowness is occurring, I havent looked at them today :-( [14:26:28] hashar: I'm looking at pages on beta just in my own regular browser and they seem slow, often waiting on bits or meta. and tests are timing out. [14:27:32] hashar: hit http://en.wikipedia.beta.wmflabs.org/wiki/Special:Random and see what you think. maybe I need to ask Ori [14:31:19] chrismcmahon: definitely slow :( [14:33:44] hashar: OK, good, it's not just me then. [14:33:54] !ping [14:33:54] !pong [14:33:57] that probably causes a bunch of timeout :( [14:34:09] ah mediawiki02 is overloaded [14:34:11] !ping [14:34:11] !pong [14:34:43] !log deployment-prep mediawiki02 / partition is 100% full [14:34:44] Logged the message, Master [14:35:18] poor instance [14:38:33] hashar: thanks for finding that! can it be fixed? [14:38:46] !log deployment-prep /var/log/upstart/hhvm.log is filled with hphp notices (Note: unable to serialize ... and others). That cause / to fill up [14:38:52] chrismcmahon: of course :-D [14:38:59] gotta figure out what needs to be fixed [14:39:02] hashar: forget it, I see you're talking to brian [14:41:13] filled up https://bugzilla.wikimedia.org/show_bug.cgi?id=69976 [14:41:16] 3Wikimedia Labs / 3Infrastructure: Log files on labs instance fill up disk (/var is only 2GB) (tracking) - 10https://bugzilla.wikimedia.org/69601 (10Antoine "hashar" Musso) [14:41:21] 3Wikimedia Labs / 3deployment-prep (beta): hhvm fill up /var/log/upstart/hhvm.log - 10https://bugzilla.wikimedia.org/69976 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None deployment-mediawiki02 labs instance only has 2GB of disk. The hhvm upstart process emits log to /var/log/upstart/hhv... [14:42:04] !log deployment-prep on mediawiki02 , clearing out some /var/log/upstart/hhvm.* log files see {{bug|69976}} [14:42:07] Logged the message, Master [14:42:54] !log mediawiki02 ran apt-get autoclean to reclaim some disk space [14:42:54] mediawiki02 is not a valid project. [14:43:02] hashar: Ori has better logging stuff for hhvm in prod now. You should poke _joe_ to get mediawiki0[12] running on puppet again. [14:43:33] ah [14:43:37] right now 02 is hand built by _joe_ and 01 is not working [14:43:42] :- [14:43:43] ( [14:43:46] 3Wikimedia Labs / 3deployment-prep (beta): hhvm fill up /var/log/upstart/hhvm.log - 10https://bugzilla.wikimedia.org/69976#c1 (10Chris McMahon) s:5normal>3major changed importance to "major", this makes beta hard to use for real people and causes a lot of false test failures. [14:44:49] bd808: also hhvm on mediawiki02 is starving CPU [14:44:54] maybe we need a backtrace of it and restart it ? [14:46:00] 3Wikimedia Labs / 3deployment-prep (beta): hhvm fill up /var/log/upstart/hhvm.log - 10https://bugzilla.wikimedia.org/69976#c2 (10Antoine "hashar" Musso) Bryan told me the log issue is apparently fixed in production. Since mediawiki02 does not run puppet, it is not taking in account the recent changes made i... [14:46:33] !log deployment-prep restarting udp2log-mw service on -bastion. It is stalled for some reason [14:46:35] Logged the message, Master [14:47:25] 2014-08-25 14:47:18 deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused [14:47:26] 2014-08-25 14:47:18 deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused [14:47:29] that is never ending :( [14:47:45] nutcracker listens on port 11212 [14:53:46] !log deployment-prep mediawiki02 : removed /var/lib/puppet/state/agent_catalog_run.lock [14:53:49] Logged the message, Master [14:55:51] 3Wikimedia Labs / 3deployment-prep (beta): deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused - 10https://bugzilla.wikimedia.org/69978 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None For some reason the hhvm process on deployment-mediaw... [14:58:25] Unexpected end of buffer during unserialization. in /srv/common-local/wmf-config/CommonSettings.php on line 188 [14:58:28] youuuuououu [14:59:58] ahhh [14:59:59] nice [15:00:09] we cache mediawiki conf under /tmp [15:00:15] which is the same partition as /var [15:00:17] which is filled [15:00:23] so some configuration files are 0 [15:00:34] heh [15:00:47] !log deployment-prep mediawiki02 rm /var/log/upstart/hhvm* [15:00:49] Logged the message, Master [15:01:21] !log deployment-prep mediawiki02 has mw conf caches under /tmp/mw-cache-master/ and since that partition is filled up, that ends up with conf caches being null file [15:01:23] Logged the message, Master [15:01:46] !log deployment-prep mediawiki02 rm /tmp/mw-cache-master/conf* [15:01:48] Logged the message, Master [15:01:56] that is a nice thing [15:02:38] bd808: can we get rid of core files on mediawiki02 in /tmp ? [15:03:02] -rw------- 1 apache apache 625M Aug 22 22:37 hhvm.29585.core [15:03:02] -rw------- 1 apache apache 641M Aug 22 22:37 hhvm.3112.core [15:03:02] -rw------- 1 apache apache 2.1G Aug 22 21:45 hhvm.25555.core [15:03:02] -rw------- 1 apache apache 2.3G Aug 22 20:56 hhvm.3314.core [15:04:45] 3Wikimedia Labs / 3deployment-prep (beta): deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused - 10https://bugzilla.wikimedia.org/69978#c1 (10Antoine "hashar" Musso) The MediaWiki configuration cache is in /tmp/mw-cache-master which might end up having nul... [15:05:57] !log deployment-prep mediawiki02 rm /tmp/hhvm*.core . Filled as {{bug|69979}} [15:06:00] Logged the message, Master [15:06:03] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None On deployment-mediawiki02:/tmp -rw------- 1 apache apache 625M Aug 22 22:37... [15:06:53] that is a bit better now [15:08:07] We need to get those cores logged to nfs instead of /tmp [15:08:08] thanks hashar [15:08:45] The new puppet logging config puts them in /var/log/hhvm which maybe we could point to /data/project/logs/hhvm in beta [15:09:03] yeah, and hhvm coring multiple times on friday doesn't give me a great feeling of confidence [15:09:14] chrismcmahon: so basically, hhvm write its core file to /tmp/ which fills up the disk entirely. Causing the MediaWiki conf under /tmp/mw-cache-master to be empty file which cause a bunch of php errors in MEdiaWiki itself. Those errors are logged to /var/log/upstart/hhvm.log and /var/log/syslog which fill up the disk as eel and cause hhvm to use 400% CPU :] [15:09:18] I poked _joe_ in another channel but no response yet [15:09:50] bd808: maybe we could use the lvs extended disk for /var ? [15:10:03] We already used it for /srv [15:10:26] which is where I was putting the mw code [15:10:37] ah true [15:10:56] I don't know if we can have multiple lvs disks or not [15:11:03] we could [15:11:08] but not with our current puppet manifests [15:11:30] of course :/ [15:13:02] bd808: role::labs::lvm::srv allocates all the available disk space [15:13:22] anyway, the hhvm core could go to /srv [15:15:19] commented on https://bugzilla.wikimedia.org/show_bug.cgi?id=69979 [15:15:33] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c1 (10Antoine "hashar" Musso) The instance has some local disk space allocated under /srv/ (via puppet class role::labs::lvm::srv ). Would be a... [15:15:46] chrismcmahon: so I guess most of the browser tests jobs that ran over the week-end can be discarded [15:21:59] hashar: we had high rates of failure, yes [15:24:08] we need to rethink the way we trigger all those jobs [15:24:17] I am not sure running them all iis a good idea :-] [15:25:04] or at least, have some kind of job that assert beta seems to be mostly running [15:25:10] and if that pre check job, trigger the actual browser test [15:45:48] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c2 (10Bryan Davis) The latest production puppet code for setting up hhvm moves the cores to /var/log/hhvm. We need to get deployment-mediawiki02... [15:58:03] 3Wikimedia Labs / 3deployment-prep (beta): deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused - 10https://bugzilla.wikimedia.org/69978#c2 (10Bryan Davis) From mediawiki-config/wmf-config/mc-labs.php: $wgObjectCaches['memcached-pecl']['servers'] => array... [16:05:36] Coren (and whoever else cares): All of your sudo policy dreams should now be realized. Sudo as all users; Sudo as service user; service user sudo as , a less-broken default sudo policy for new projects. [16:05:40] Please let me know if there are any misfires. [16:07:46] 3Wikimedia Labs / 3deployment-prep (beta): bits.beta.wmflabs.org down with 503 - 10https://bugzilla.wikimedia.org/69921#c5 (10physikerwelt) 5RESO/FIX>3REOP It worked fine this morning but now I'm getting 503 again. It's somehow hard to decide when a Temporarily bug has been fixed. [16:08:08] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c3 (10Chris McMahon) multiple HHVM cores per day seems like a real problem [16:16:24] yuvipanda: around? [16:16:28] or a puppety person? :P [16:16:29] addshore: sup [16:16:33] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[php5-cli] is already declared in file /etc/puppet/modules/composer/manifests/init.pp:6; cannot redeclare at /etc/puppet/modules/mediawiki/manifests/packages.pp:12 on node wdjenkins-node1.eqiad.wmflabs [16:16:43] how can I resolve that? [16:16:46] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c4 (10Bryan Davis) (In reply to Chris McMahon from comment #3) > multiple HHVM cores per day seems like a real problem Likely some new to us hh... [16:17:33] addshore: hmmm [16:17:42] addshore: remove the Package[php5-cli] from one of them [16:17:48] but..... [16:17:54] addshore: does the composer module depend on the mediawiki module? [16:18:00] no [16:18:08] ah, quintessential puppet problems :) [16:18:15] there is no easy solution to it, sadly [16:18:33] bah! thats just stupid :P [16:18:33] well, easaiest is to do ensure_package() puppet function in both places instead of package {} [16:18:37] andrewbogott: Yeay! [16:18:44] addshore: indeed, it is, perhaps, stupidest part of puppet [16:18:53] yuvipanda: would doing ensure_package in just one of the places work? [16:18:58] addshore: nope [16:19:02] >.> [16:19:07] bah! [16:19:10] inorite. [16:19:28] addshore: you could do !defined, but that will fail based on order of puppet execution [16:20:03] 3Wikimedia Labs / 3deployment-prep (beta): deployment-mediawiki02 enwiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused - 10https://bugzilla.wikimedia.org/69978#c3 (10Antoine "hashar" Musso) Yeah I though it was related to the /tmp/mw-cache-master being dirty (0 bytes files). But... [16:21:28] addshore: the 'clean' option is to abstract out the packages into their own class and do an include [16:21:44] addshore: so you could potentially abstract them into a packages::php class and include that. That might have already happened as well [16:36:47] Hi I wanted to get a dedicted instance for machine learning application I use, is somebody there to make it happen :D [16:36:48] https://wikitech.wikimedia.org/wiki/New_Project_Request/FATG,_Persian_Tajik_Translator. [16:37:51] Pouyan: Did you already fill in a project request? [16:38:13] :Coren https://wikitech.wikimedia.org/wiki/New_Project_Request/FATG,_Persian_Tajik_Translator [16:38:41] Is the OS Ubuntu 14.04? [16:38:47] or 12.04 [16:38:50] Pouyan: You can pick either [16:39:02] Ok 14.04 would be better [16:39:12] Moses works on my local machine [16:39:17] That link leads nowhere [16:39:30] Ah, I found it [16:40:56] Pouyan: Done [16:41:11] Coren: Thanks :D [17:12:19] 3Wikimedia Labs / 3tools: tools-db is down; need to flush hosts - 10https://bugzilla.wikimedia.org/69828#c3 (10Marc A. Pelletier) 5NEW>3RESO/FIX flush hosts done. [17:17:02] 3Wikimedia Labs / 3tools: Web services continually restarting - 10https://bugzilla.wikimedia.org/69934#c3 (10Marc A. Pelletier) 5NEW>3UNCO A quick perusal of the logs show that this happens only to a short (~12) list of webservices, in bursts. My current working hypothesis is that this is due to leaking... [17:32:31] 3Wikimedia Labs / 3tools: Install MLT and kdenlive on tool labs - 10https://bugzilla.wikimedia.org/69365#c1 (10Marc A. Pelletier) 5NEW>3ASSI kdenlive pulls in, basically, all of KDE and X11 and is not suitable for inclusion on normal compute nodes. (melt is marginally better and only pulls in a faction... [17:57:07] yuvipanda: maybe I should add https://blogs.atlassian.com/2014/07/git-guilt-blame-code-review/ to the gerrit reviewer bot :-D [18:04:48] valhallasw`cloud: is toolslabs only slow for me today? [18:06:49] valhallasw`cloud: :D [18:07:24] valhallasw`cloud: indeed, that does sound quite useful [18:08:15] valhallasw`cloud: that, along with defualts for people with 'most merges in last 30 days' and 'most commits in last 30 days' should make most manual config unnecessary [18:14:22] yuvipanda: yeah, exactly. Maybe a second username, though [18:14:36] valhallasw`cloud: Thanks for the link. Cool tool, but the results probably get skewed by changes in indentation & Co., i. e. if you have put a "if (condition) { }" around a code block, the tool will blame you for the whole code block. [18:14:48] Steinsplitter: tools-login seems OK for me [18:14:50] scfc_de: git blame has a parameter that ignores indents [18:14:51] (And so will "git blame".) [18:14:54] scfc_de: and other spacing changes [18:15:11] scfc_de: sure, as with all heuristics, it's probably not perfect [18:15:23] yuvipanda: Ah! Okay, I take that back, then :-). [18:15:26] scfc_de: but if it gives a better signal to noise ratio than the current reviewer bot, it's an improvement :-) [18:15:28] scfc_de: :D [18:15:28] k :) [18:15:42] scfc_de: valhallasw`cloud -w [18:16:16] 3Wikimedia Labs / 3tools: Some users can't connect to replica DB servers - 10https://bugzilla.wikimedia.org/69679 (10Marc A. Pelletier) 5NEW>3ASSI [18:18:41] scfc_de: it'll still miss things when you move blocks of code around, tho [18:30:32] !log deployment-prep deployment-mediawiki02: cleared /tmp; running puppet [18:30:35] Logged the message, Master [18:31:40] beta labs was giving 503 errors for 5 minutes, now http://en.wikipedia.beta.wmflabs.org/w/api.php (load.php,index.php) all return a 404 [18:31:54] that's me [18:31:57] trying to fix [18:32:29] sure [18:32:37] ori: istr there was some issue with puppet on deployment-mediawiki02 not correct or overridden with other config or some such? [18:32:45] Hi, seems zhwiki_p is building up replag? [18:32:48] * chrismcmahon waves hands vaguely [18:36:24] 3Wikimedia Labs / 3Infrastructure: Log files on labs instance fill up disk (/var is only 2GB) (tracking) - 10https://bugzilla.wikimedia.org/69601#c1 (10Marc A. Pelletier) 5NEW>3ASSI There are two things that can be done to alleviate/fix this issue entirely: (a) /var/log itself can be made much bigger at... [18:37:43] re: ^topic , tools.wmflabs.org and other labs instances seem fine, I think only Beta labs is down. [18:37:58] oh, thanks, good point [18:53:31] beta's up [18:54:01] chrismcmahon, hashar, greg-g ^ [18:55:12] thanks ori [18:55:28] bd808: why is the trebuchet user not defined in labs btw? [18:55:52] not defined? [18:56:23] modules/deployment/manifests/deployment_server.pp, L28: if $::realm != 'labs' { [18:56:28] etc. [18:56:57] ldap craziness probably [18:57:03] uid=604(trebuchet) gid=604(trebuchet) groups=500(wikidev),604(trebuchet [18:57:38] is this something andrewbogott could fix? i recall there being something similar with the apache or mwdeploy user accounts [18:58:45] I'm in a meeting atm, I'll catch up with this sortly [18:58:48] *shortly [19:01:24] kk, sorry for the ping then [19:01:58] The mwdeploy user is still a problem. I ended up committing a local patch for that to work around it. [19:02:16] I think the "problem" is that the NFS server needs to know those users/groups as well, so you can't just define them in Puppet, they have to exist in LDAP, and as that's "read-only" for Puppet, I assume at most you could define it to be an expectation that raises an assertion error if it fails. [19:03:07] scfc_de: Yup. Things are ok as long as ldap and the puppet manifest agree, but if they differ puppet will break because I can't change ldap [19:03:22] s/I/it/ [19:03:40] ori: on labs whenever a user is shared on multiple instance, it should be defined in LDAP to have a constant uid [19:04:05] ori: or puppet will create the user via adduser which would yield a different uid on each host which is a mess whenever we hit the shared /data/project [19:04:12] ori: and thanks for fixing up hhvm [19:04:19] * andrewbogott is available now but hasn't read the backscroll yet [19:05:39] So… I can indeed create users in ldap to correspond with puppet, and have done so in the past. I don't /think/ that causes conflicts with puppet, but maybe scfc_de is remembering a corner case that I've forgotten. [19:06:41] andrewbogott: I think it is all good. LDAP has a trebuchet user and group [19:06:51] ok then :) [19:06:59] hashar: so is it ok to remove "if $::realm != 'labs' {" from modules/deployment/manifests/deployment_server.pp ? [19:07:12] We maybe just haven't pulled the conditional out of the puppet manifest to test yet [19:07:18] hashar, with the same uid? [19:07:22] ori: It's worth testing [19:07:36] bd808: But then you must ensure that no local user is created, right? [19:07:39] what's the trebuchet master on beta? deployment-salt? [19:07:53] deploymnet-prep [19:08:01] deployment-prep == tin [19:08:08] ah right, thanks [19:08:23] ori: na you dont want the user to be created by adduser [19:08:24] deployment-salt == pladium (I think) [19:08:32] so need to be skipped with a $::realm :/ [19:09:01] is that the case for *every* user resource then? [19:09:09] ori: It's going to fail because of the /bin/false shell [19:09:13] how does mediawiki/manifests/users.pp not work? [19:09:25] how does it work, rather [19:09:26] That's the same bug with mwdeploy user [19:09:54] puppet sees the ldap users as having /bin/bash shells even if ldap sets them to /bin/false [19:09:57] though for the apache user (which is created by the debian package), having the apache user in LDAP prevent adducer from creating a user local to the instance [19:10:02] so you just can't declare daemon user accounts in labs? [19:10:38] why do they have to be in ldap vs. local? [19:10:55] hashar, ori: user resources work in labs. But /bin/false shell is some other labs bug/feature [19:11:27] https://bugzilla.wikimedia.org/show_bug.cgi?id=65591 [19:11:48] fwiw the problem post apache restart was not hhvm-related; sites-enabled/99-monitoring.conf was intercepting all requests [19:11:53] ori: we need consistent UID whenever using some shared resources [19:12:05] ori: such as mwdeploy owned files, l10nupdate , apache logs [19:12:15] or the different UID numbers cause a mess [19:12:29] The UID will be consistent as long as the ldap user is created before the first puppet run [19:12:33] that's the source of a lot of bugs, no? i don't think beta should use shared storage [19:12:46] puppet won't create a local user if the user is found in ldap [19:13:09] but the shell thing is a problem still. [19:13:33] no nfs -> no need for ldap -> no need for consistent uid/gids [19:13:34] I have commited a local hack on deployment-salt for the mwdeploy user's shell; see https://bugzilla.wikimedia.org/show_bug.cgi?id=65591 [19:14:27] bd808: thanks for your help with this, btw [19:15:38] I'm trying to take a lesser role (pointing out who to bug instead of fixing) because I'm *supposed* to be working on other stuff. [19:16:05] * bd808 needs less things to juggle for sanity sake [19:16:42] * ori nods [19:19:06] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c7 (10Tim Landscheidt) On tools-login, /etc/ldap.conf has: | [...] | nss_override_attribute_value loginshell /bin/bash | [...] I haven't rea... [19:28:19] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c8 (10Tim Landscheidt) The line (in essence) was introduced by Ryan's commit f8724e60664a33a37a327434f5c3cb71837f4c20 (Sep 13 00:26:36 2011) w... [21:44:21] !log deployment-prep Deployed scap 116027f (Make sync-common update l10n cdb files by default) [21:44:24] Logged the message, Master [21:52:04] valhallasw`cloud: wikibugs looks stuck [21:52:19] ctcp seems to be alive still [21:52:29] it's not just that nothing is happening? :-p [21:53:14] mmm, last e-mail received seems to be From wikibugs-l-bounces@lists.wikimedia.org Mon Aug 25 21:00:26 2014 [21:53:39] so either there are no changes, wikibugs-l is broken, or labs mail delivery is broken [21:54:01] well, we already passed bug 70k and wikibugs didn't report anything in -feed. [21:54:10] hm. [21:55:18] https://www.mail-archive.com/wikibugs-l@lists.wikimedia.org/maillist.html looks to be fine [21:57:22] Unable to run job: error writing object "3453826" to spooling database [21:57:22] aborting transaction (rollback) [21:57:22] job 3453826 was rejected cause it couldn't be written. [21:57:22] Exiting. [21:57:22] cat: /data/project/.system/store/mail/.deliver.1409003819.16170.out: No such file or directory [21:57:25] Coren: ^ [21:57:51] or scfc_de, maybe? :-) [21:58:21] that's what exim replies to mails to wikibugs.whatever@tools.wmflabs.org [22:02:25] * Coren reads [22:02:59] Coren: a simple 'qlogin' also fails, by the way [22:03:40] Yep, somthing going on. I'm on it. [22:05:57] 3Wikimedia Labs / 3tools: Errors in e-mail pipes should go to local error log, not e-mail sender - 10https://bugzilla.wikimedia.org/70003 (10Merlijn van Deen) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier wikibugs' .forward contains: |jmail /data/project/wikibugs/toredis.py wikibugs-l 2>&1 > /data/p... [22:06:27] Coren: if you have an idea on how to get the error log in a file instead of in wikibugs-l@lists.wm.o nonexistant inbox would also be great for the future :-) [22:06:31] Not sure I understood that question. [22:06:32] Ah, I see what's going on. [22:06:32] Ah, the queue is alive again :-) [22:06:34] Well, bug 70003 just above, basically. [22:06:48] I tried piping, but errors still just get sent as reply [22:07:38] job queue should be fixed: I was out of space due to too much old accounting info kept. [22:08:24] Ah. Well, luckily wikibugs-l doesn't seem to unsubscribe users for sending bounces, so that's good. [22:10:19] Coren: Did something change with ip routing / translation ? Please take a look at: https://bugzilla.wikimedia.org/show_bug.cgi?id=69995 [22:10:59] hedonil: afaik the servers always saw an internal IP [22:11:02] recent NAT changes? there was something requested [22:11:18] at least, I can remember making non-logged in wiki edits that were seen as from 10.something [22:12:03] valhallasw`cloud: Hmm, the problem: worked until ~2-4 days ago. Something must have changed [22:12:25] yeah, that surprises me. [22:12:39] 3Wikimedia Labs / 3tools: Errors in e-mail pipes should go to local error log, not e-mail sender - 10https://bugzilla.wikimedia.org/70003#c1 (10Marc A. Pelletier) 5NEW>3ASSI Redirection doesn't do what you expect because jmail actually fires up a grid job for processing the email; so the redirection send... [22:12:44] hedonil: Nothing I changed or that I heard has changed. [22:12:53] brb [22:13:31] anomie: so what to do? [22:14:07] hedonil: Adjust the IP restrictions to include the private IP range. [22:14:19] anomie: ok [22:14:51] (I should have mentioned that on the bug, sorry; asking here was just to see if anything did change) [22:16:09] 3Wikimedia Labs / 3tools: Errors in e-mail pipes should go to local error log, not e-mail sender - 10https://bugzilla.wikimedia.org/70003#c2 (10Merlijn van Deen) I actually meant the output from the jmail submission -- if job submission fails (e.g. SGE queue out of space), jmail (I assume) sends the error to... [22:27:39] 3Wikimedia Labs / 3tools: Errors in e-mail pipes should go to local error log, not e-mail sender - 10https://bugzilla.wikimedia.org/70003#c3 (10Marc A. Pelletier) Ah! Hm. I'll need to consider the right way to do that; that behaviour comes from exim itself. Perhaps an additional level of error checking aro... [22:31:09] hedonil: Ah, note the responce from Chris. Hitting the prod cluster from labs indeed was always using the private IPs; apparently the check was only enabled recently. [22:38:41] Coren: ok then. source of trouble identified. [22:39:39] csteipp: how long does it take for the restriction update to take effect? [22:40:08] hedonil: It should be immediate [22:40:35] Still getting the exception? [22:40:54] csteipp: Hmm, configured {"IPAddresses":["208.80.155.0/24","10.68.16.0/24","10.68.17.0/24","::/0"]}. still error [22:42:45] csteipp: error changed now to {"error":"mwoauth-oauth-exception"} [22:43:33] Well, it's progress... [22:44:29] I'll have to dig into where that one is coming from. [22:50:41] I think I killed beta labs again. It's crawling. ori, you want to have a look? [22:51:31] sure [22:52:20] there are 67 apache procs [22:52:22] Filesystem Size Used Avail Use% Mounted on [22:52:22] /dev/vda1 7.4G 5.6G 1.5G 79% / [22:52:31] but not due to disk space exhaustion [22:53:06] csteipp: mind if I try a new configuration? {"IPAddresses":["10.68.0.0/32","::/0"]} [22:53:21] remembers some change that was "new way to restart apaches" which is beta-only.. could be related? [22:53:36] used salt instead of dsh to restart them [22:54:09] /var/run/apache2/apache2.pid exists and contains the right PID [22:54:31] mutante: haven't heard about that change [22:55:51] /server-status: http://noc.wikimedia.org/~ori/mw2-status.html [22:56:30] maybe apache is doing what it's supposed to and hhvm is hanging [22:57:11] tried {"IPAddresses":["10.68.0.0/16","::/0"]}, still no luck [22:57:15] the longest-pending request has been 12 seconds [22:58:30] well, apache is utilizing 100% cpu [22:58:47] no, nvm; hhvm is [22:58:48] ori: it was this https://gerrit.wikimedia.org/r/#/c/125888/ but just a wild guess [22:59:18] and for some reason i thought that was more recent,, nevermind [22:59:27] hedonil: Might try 0.0.0.0/0 too, just to make sure that works [22:59:30] Krinkle: sry for the missed .svn's. cleared now [22:59:41] * hedonil tries [22:59:51] chrismcmahon: is it going to seriously mess things up if i leave it in that state and debug a little longer? [22:59:56] hedonil: no worries. One step at a time :) [23:00:08] chrismcmahon: i could just restart apache which would make it "better", though we can safely presume this will recur [23:00:15] chrismcmahon: so if now is not a good time i could just restart apache [23:00:20] Krinkle: ;) [23:00:44] ori: knock yourself out. I'd rather find the root cause than have this over and over [23:00:51] chrismcmahon: appreciate it; thanks [23:07:58] csteipp: tried with {"IPAddresses":["0.0.0.0/0"]} -> {"error":"mwoauth-oauth-exception"} [23:08:33] That's not good. hedonil, what the consumer key? [23:08:48] csteipp: eb9f0961f8b86ff83aeb3db49a55ec5f [23:09:22] restarted webservice & cleared sessions to be sure [23:12:21] hedonil: Can you try the restriction default? "IPAddresses":["0.0.0.0/0","::/0"] [23:12:37] If that fails, then there's something wrong with the consumer.. [23:13:33] * hedonil tries [23:15:37] csteipp: tried {"IPAddresses":["0.0.0.0/0","::/0"]} still error [23:15:52] csteipp: another consumer of mine works with {"IPAddresses":["208.80.155.0/24","10.68.16.0/24","10.68.17.0/24","::/0"]} [23:17:34] csteipp: works now [23:17:44] Something on your end? [23:18:36] csteipp: after the first restriction block I changed from [23:18:39] $mwOAuthUrl = 'https://meta.wikimedia.org/w/index.php?title=Special:OAuth'; [23:18:51] to $mwOAuthUrl = 'https://www.mediawiki.org/wiki/Special:OAuth'; [23:19:19] Ah, yeah, that only works if you explicitly tell your library to sign title.. [23:19:26] Most don't handle it [23:19:30] csteipp: switched that back now to meta [23:19:47] csteipp: ...and will try the new restriction there [23:20:09] hedonil: Yeah, the wiki isn't so important, it's the title= in the url [23:21:43] csteipp: ok. works now with {"IPAddresses":["208.80.155.0/24","10.68.16.0/24","10.68.17.0/24","::/0"]} [23:22:12] csteipp: double- error .. and w/index.php?title=Special:OAuth it is! [23:22:30] glad it's working! [23:22:57] csteipp: thx for looking into it [23:38:32] andrewbogott: The Nova interface isn't displaying any tables (instance and proxy lists) for me. I can select a project but only a h1 heading shows up for that project with no table. [23:39:13] Negative24: I hate to say this, but… do you mind logging out and in again? [23:39:25] Not at all... [23:40:27] Is it possible to regenerate backup codes for OAuth? [23:40:31] Krinkle: if you want to check out the proposed .lighttpd settings for static (caching etc.)- I have a temporary setup running with these settings and some files from static http://tools.wmflabs.org/newwebtest/res/ [23:40:59] chrismcmahon: is it still slow? (i don't hit beta often enough to have a well-calibrated sense of 'normal') [23:41:37] Negative24: by OAuth you mean two-factor auth? [23:41:45] Yea [23:42:01] And by 'is it possible' do you mean you don't have your keys anymore? :) [23:42:26] If you still have some codes left you can disable and reenable 2fa which will give you a new key and new paper codes. [23:42:54] Well my phone just had an update and is still busy so I can't get to any codes like I normally would so I was going to use a backup code. [23:43:33] ori: still seems sluggish. response should be similar to enwiki. Also, appending "?veaction=edit" to a small page should return in <5 seconds and I've been seeing >20 seconds when there is a problem e.g. http://en.wikipedia.beta.wmflabs.org/wiki/0.761408440987044?veaction=edit [23:44:26] andrewbogott: Nevermind. My update just finished. I'll let you know of the results. [23:44:35] ok! [23:45:19] <10 seconds at worst [23:46:13] chrismcmahon: heh, ok, figured it out [23:46:27] it's running a debug build [23:46:31] andrewbogott: Logging out and back in seems to have fixed it. Did Nova recently get updated? [23:46:39] i need to restart it to fix [23:47:05] Negative24: yes, although it shouldn't have caused that. I don't know what the problem was. [23:47:14] !log deployment-prep stopping hhvm/apache on deployment-mediawiki02 to replace debug build of hhvm with release build [23:47:17] Logged the message, Master [23:47:45] andrewbogott: Probably one of those problems that I only get. Nova doesn't like me all that much :) [23:48:04] ori: that's kind of weird. for a brief shining moment today performance on beta was mostly OK I think [23:48:42] chrismcmahon: heh, that's probably to hhvm's credit [23:48:47] it should get a lot better in a few [23:50:05] chrismcmahon: it's back up; could you try just hitting it with random requests for a few mins to get the jit warmed up, and then check the speed? [23:51:11] ori: yeah, its much much better now [23:53:30] :) [23:53:52] thanks for your patience with that [23:54:33] sure