[00:00:16] mwalker: I don't have +2 on ops/puppet :( But I can cherry-pick it into beta for you [00:00:45] hah; ok; that'll work too [00:00:58] I'll get jeff to deploy it for tantalum tomorrow [00:02:46] mwalker: cherry-picked into beta [00:03:05] (03CR) 10BryanDavis: [C: 031] "Cherry-picked into beta puppet master" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [00:08:09] today i get a lot of errors like (backend-fail-move): Could not move file "mwstore://local-swift-eqiad/local-public/c/cb/NaughtyBoyBBC.jpg" to "mwstore://local-swift-eqiad/local-deleted/n/d/6/nd6c6jfv74vtiem2ovp2o9b2wraou8d.jpg". at Fri, 23 May 2014 00:05:59 GMT served by mw1207 [00:09:53] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 74939 MB (3% inode=99%): [00:11:05] bd808, awesomely possums -- thanks much [00:19:37] (03CR) 1020after4: "fwiw, I managed phabricator at deviantart manually because it really deoesn't lend it's self to proper puppetization." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [00:23:47] greg-g, would you have a problem if I pushed a change to the beta cluster CommonSettings? [00:24:02] or would you rather I wait till tomorrow or monday? [00:24:49] (03PS1) 10Mwalker: Enable the new PDF renderer in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134977 [00:25:03] ^ that being the change I'd like to push [00:35:23] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [00:47:23] (03PS1) 10Dr0ptp4kt: Remove more noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 [00:50:41] (03PS2) 10Dr0ptp4kt: Remove more noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 [00:52:00] (03PS3) 10Dr0ptp4kt: Remove more noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 [00:54:04] RECOVERY - Disk space on analytics1019 is OK: DISK OK [01:20:34] (03CR) 10Yurik: [C: 04-1] Remove more noise. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 (owner: 10Dr0ptp4kt) [02:39:36] !log LocalisationUpdate completed (1.24wmf5) at 2014-05-23 02:38:33+00:00 [02:39:42] Logged the message, Master [03:03:30] mwalker: nope! [03:23:10] !log LocalisationUpdate completed (1.24wmf6) at 2014-05-23 03:22:07+00:00 [03:23:14] Logged the message, Master [03:51:03] (03CR) 1020after4: "So it seems to be that trebuchet would be pretty cool but the main opposition is that it requires a separate instance to run trebuchet in " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [04:17:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 23 04:16:15 UTC 2014 (duration 16m 14s) [04:17:26] Logged the message, Master [04:29:38] (03PS1) 10Legoktm: Allow sysops on mw.o to add/remove the autopatrolled group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134994 [04:29:48] ori, ^d: ^ [04:32:31] legoktm: looks fine, but it'd be good to have a working hypothesis about why it went away [04:32:35] do you see anything in ? [04:33:11] it stripped sysops of review, validate, unreviewedpages, autoreview rights [04:33:18] are any of those tied to autopatrolled somehow? [04:35:09] they shouldn't be [04:35:35] I don't see anything in the extension itself either. [04:35:42] me neither [04:36:35] and I don't see anything else in https://github.com/wikimedia/operations-mediawiki-config/commits/master that could have done that [04:43:03] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [04:52:43] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [05:02:03] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [06:24:03] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:30:22] <_joe_> !log powercycling ms-be1007, unresponsive, console blank, no way to debug [07:30:27] Logged the message, Master [07:32:11] (03CR) 10Nemo bis: [C: 04-1] "We don't use autopatrol, we'll continue using autochecked/autoreview (which was renamed on wiki: https://www.mediawiki.org/w/index.php?tit" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134994 (owner: 10Legoktm) [07:33:33] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [07:34:59] (03CR) 10Nemo bis: "Rillke, it's to teach them that mediawiki.org is ruled by coders. ;-) Seriously, it was very kind of Chad to remove the rights on wiki man" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134935 (owner: 10Chad) [07:39:45] (03PS2) 10Nemo bis: Allow to add/remove the autoreview group on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134994 (owner: 10Legoktm) [07:39:53] Nemo_bis: thanks :) [07:40:06] (03CR) 10Nemo bis: [C: 031] Allow to add/remove the autoreview group on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134994 (owner: 10Legoktm) [07:52:59] (03CR) 10Rillke: "> mediawiki.org is ruled by coders." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134935 (owner: 10Chad) [07:56:31] (03PS3) 10Giuseppe Lavagetto: move contents of mail.ini to standalone file [operations/puppet] - 10https://gerrit.wikimedia.org/r/134636 (owner: 10Ori.livneh) [08:02:43] (03CR) 10Giuseppe Lavagetto: [C: 032] "this is a no-brainer, apparently" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134636 (owner: 10Ori.livneh) [08:09:20] (03PS7) 10Giuseppe Lavagetto: dissolve mediawiki::config::* [operations/puppet] - 10https://gerrit.wikimedia.org/r/134642 (owner: 10Ori.livneh) [08:50:39] (03PS1) 10Giuseppe Lavagetto: puppet3: correct pin file extension [operations/puppet] - 10https://gerrit.wikimedia.org/r/135004 [10:03:31] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: correct pin file extension [operations/puppet] - 10https://gerrit.wikimedia.org/r/135004 (owner: 10Giuseppe Lavagetto) [10:30:00] springle: revisions lost on mediawiki.org https://bugzilla.wikimedia.org/show_bug.cgi?id=65665 [10:32:47] Nemo_bis: thanks. looking [10:33:01] old ones. odd [10:34:23] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Fri 23 May 2014 07:33:18 AM UTC [10:43:59] springle: can you send a quick bug link to ops list? so that it's on the radar when you go to sleep [10:44:38] Nemo_bis: odd, because the example revs do appear to exist in all dbs [10:44:47] might need dev help to track this one down [10:46:20] yep [10:50:03] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [10:54:10] <_joe_> for the record, this was the usual spike [10:54:19] <_joe_> checking it now [10:56:21] <_joe_> ulsfo again [10:56:42] <_joe_> those spikes have to do with connectivity [10:56:50] ulsfo to eqiad? [10:57:15] <_joe_> yes [10:57:21] <_joe_> the other day, that was the problem [10:57:22] we have two links between them [10:57:28] packet loss, or? [10:57:31] <_joe_> it happened again today [10:57:53] <_joe_> let me check to be sure :) [11:01:49] !log Setup BFD on GTT link between cr1-ulsfo and cr2-eqiad [11:01:54] Logged the message, Master [11:03:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [11:05:01] !log Setup BFD on Zayo link between cr2-ulsfo and cr1-eqiad [11:05:06] Logged the message, Master [11:32:37] (03CR) 10Odder: [C: 031] "Thank you for your patience, Nemo, and for explaining this to me in detail. The patch looks OK to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [11:33:03] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Fri May 23 11:33:02 UTC 2014 [12:00:16] (03CR) 10Nemo bis: "Ok, now we're ready to ask Reedy to confirm that adding in this way to groupOverrides2 is ok. :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [13:17:34] !log resarting jenkins because it seems stuck [13:17:39] Logged the message, Master [13:32:30] ^d: I tried to do a self service jenkins resart but I think it is too stuck for that. can you ssh into gallium and kill -9 it and restart it for me? I don't have rights but I'm told you do. [13:42:34] anyone want to unstick jenkins for me? or give me access to it so I can unstick it? [13:42:39] gallium is the machine it is on [13:42:56] <_joe_> manybubbles: what do you mean with 'unstick'? [13:43:31] _joe_: find the java process, grab a stack trace (maybe), maybe snif and see if it is stick gcing, then just kill -9 it and restart it [13:43:45] we're used to it crashing from time to time and we just live with it [13:43:49] <_joe_> a stack trace of a java process is useless [13:43:50] so mostly we kill -9 and restart [13:43:55] <_joe_> :) [13:44:01] _joe_: :) [13:44:11] <_joe_> a thread dump maybe, but that will take a lot of time [13:44:26] _joe_: heap dump would, but it probably isn't worth it [13:44:29] <_joe_> java.lang.NullPointerException [13:44:42] thread dump is pretty quick [13:45:13] jstack [13:45:36] well, sudo su -s /bin/bash then jstack > /tmp/place [13:46:22] jstat -gcutil 1s 100 will spit out if it is gc bound [13:46:28] but who knows [13:46:39] <_joe_> give me 2 minutes I'm looking at the logs [13:46:44] <_joe_> I doubt it's gc [13:47:21] <_joe_> jenkins itself is working correctly [13:47:27] <_joe_> it's the web app that is borked [13:48:18] it isn't taking any jobs, or, rather, wasn't [13:48:27] so I hit it with a "self service restart" [13:48:35] I have no clue what that means on a lower level [13:48:41] anyway, it hasn't come back [13:50:51] <_joe_> !log killed & started jenkins, jvm stuck, unresponsive to jstack [13:50:55] Logged the message, Master [13:51:13] <_joe_> INFO: Green Balls! [13:51:20] <_joe_> gotta love the logs of java apps [13:53:20] <_joe_> I see a ton of gearman sessions coming up [13:53:45] <_joe_> so it's up [13:53:46] thanks [13:56:33] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [13:57:33] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [14:01:02] looks like something is stillstuck..... [14:01:11] hashar isn't in [14:01:20] and I don't know anyone that knows it [14:02:30] <_joe_> manybubbles: what is stuck? [14:02:54] _joe_: jenkins isn't picking up jobs from https://integration.wikimedia.org/zuul/? [14:03:09] might be that zomething onthe zuul side is stuck but I dunno [14:03:13] <_joe_> I was taking a look [14:03:18] <_joe_> manybubbles: I doubt that [14:03:34] <_joe_> manybubbles: why did you try to restart jenkins in the first place? [14:03:44] _joe_: it wasn't picking up jobs from zuul [14:04:12] <_joe_> ok so, it may be zuul that is not sending jobs [14:04:13] hashar sent out an email about a month ago saying that if it was stuck and he wasnt' around then to try that [14:04:16] might be [14:04:43] but if it wasn't responding then it might be some kind of feedback caused by jenkins getting stuck [14:04:44] <_joe_> well, it's usually better to understand what is happening before acting [14:05:06] <_joe_> for instance, did you try to trigger a build manually? [14:05:21] _joe_: no - I was just following instruction [14:05:26] <_joe_> I do see builds running on jenkins [14:05:57] <_joe_> so I'd say it's zuul [14:06:13] <_joe_> let me study how it works. [14:07:25] _joe_: I just did kick off a build and it hasn't picked it up yet. Its not waiting long enough for me to declare something broke, but it is odd [14:08:27] <_joe_> manybubbles: from jenkins itself? [14:08:42] <_joe_> because I did send a job to a slave [14:08:52] <_joe_> and it worked fine [14:08:57] <_joe_> and I do wee other jobs [14:10:03] _joe_: yeah, I just kicked off a job from jenkin's interface. [14:10:19] its still waiting for an executor [14:10:22] <_joe_> manybubbles: https://www.mail-archive.com/wikibugs-l@lists.wikimedia.org/msg260625.html [14:11:15] <_joe_> manybubbles: the url of your job? [14:11:45] _joe_: https://integration.wikimedia.org/ci/job/search-highlighter/ is where I see it but the link isn't live for it yet because it hasn't started [14:13:16] <_joe_> manybubbles: the problem is zuul I'd say [14:14:53] <_joe_> manybubbles: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ this job has just executed, for example [14:15:38] <_joe_> and zuul is not restarting. [14:15:47] _joe_: weird - my manual one still hasn't though I don't normally start them manually so I can't be sure that it even works [14:15:52] like, zuul won't come back up? [14:16:13] <_joe_> like it won't stop :) [14:16:16] _joe_: Queue only mode: preparing to exit, queue length: 6 [14:16:20] that is what the u said [14:16:23] ui [14:19:47] now queue length 11 [14:19:53] so, probably not going to ever stop.... [14:21:14] <_joe_> !log killed zuul server, as was stuck [14:21:19] Logged the message, Master [14:23:18] _joe_: that seems to have helped [14:23:37] rather, I could kick off a build manually [14:24:39] <_joe_> manybubbles: really? [14:24:48] <_joe_> manybubbles: zuul refuses to start [14:25:02] <_joe_> I know that was the reason why jenkins was stuck [14:25:07] _joe_: well, it _looked_ like it was working [14:26:53] (03PS1) 10Mark Bergsma: Fix indent [operations/puppet] - 10https://gerrit.wikimedia.org/r/135035 [14:26:55] (03PS1) 10Mark Bergsma: Add ulsfo targets to smokeping monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/135036 [14:28:05] <_joe_> manybubbles: I am tired and dumb, sorry [14:28:12] <_joe_> it should be started now [14:29:44] _joe_: that is better when I start my own job [14:29:53] <_joe_> and I'd say it's sending everything in the right place [14:30:08] it hasn't started any of the queued jobs yet, but maybe it is getting to i [14:30:09] it [14:30:17] (03CR) 10Mark Bergsma: [C: 032 V: 032] Fix indent [operations/puppet] - 10https://gerrit.wikimedia.org/r/135035 (owner: 10Mark Bergsma) [14:30:37] (03CR) 10Mark Bergsma: [C: 032 V: 032] Add ulsfo targets to smokeping monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/135036 (owner: 10Mark Bergsma) [14:34:12] (03PS1) 10Mark Bergsma: Move bast4001 to a new Hosts sub menu [operations/puppet] - 10https://gerrit.wikimedia.org/r/135037 [14:34:31] _joe_: it still hasn't picked up many jobs.... [14:34:40] (03CR) 10Mark Bergsma: [C: 032 V: 032] Move bast4001 to a new Hosts sub menu [operations/puppet] - 10https://gerrit.wikimedia.org/r/135037 (owner: 10Mark Bergsma) [14:35:24] <_joe_> manybubbles: well, for what the ops team is concerned, it is working now, so don't touch it for the moment [14:35:38] <_joe_> manybubbles: else, clean after your mess :) [14:35:38] k [14:35:56] <_joe_> manybubbles: if something is still not working in 30 mins, we'll check again [14:35:58] there it goes [14:36:46] <_joe_> sorry but I just figured out the coarse architecture of the zuul-jenkins integration and I'm not confident troubleshooting it [14:50:49] hello [14:51:37] <_joe_> hashar: ciao :) [14:52:06] was traveling early and just woke up from nap [14:52:11] poor Jenkins/Zuul died grr [14:52:28] <_joe_> there was zuul stuck, I found and old mail by krinkle on the zuul dev ml [14:52:36] <_joe_> restarted it, it worked [14:52:40] (03CR) 10Springle: [C: 031] "Reedy, is this appropriate:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134103 (owner: 10Manybubbles) [14:52:40] !log killed -9 a remaining Jenkins process [14:52:45] Logged the message, Master [14:53:04] yeah for some reason Zuul/Jenkins get stuck somehow :-( [14:53:37] there is a gearman bus in between that is most probably causing jobs to slowly disappaear [14:54:12] hashar: when we killed it it lost the pipeline [14:54:23] <_joe_> hashar: jenkins had a ton of gearman slot taken but they hanged [14:54:35] <_joe_> manybubbles: no, the zuul part was stuck in the first place [14:54:52] <_joe_> we did restart jenkins but I'm not sure it was needed at all [14:55:37] _joe_: yeah, I don't think bouncing jenkins was needed but zuul had a bunch of work waiting before the restart that it (probably) never sent to jenkins after [14:56:00] usually restarting Jenkins solve it [14:56:21] Zuul keep the job around and Jenkins will reregister all its jobs with Zuul server [14:56:36] <_joe_> hashar: looking at the logs it was pretty clear the problem was on zuuls side [14:59:40] so easiest way is usually to kill jenkins; The init script is crap though and let the java process running behind :-( [15:13:54] (03CR) 10BryanDavis: "There is a shared salt master for labs (on virt1000?), but it is outside the control of the labs users (only available to roots). Setting " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [15:14:09] (03PS1) 10Faidon Liambotis: smokeping: also monitor esams (network + bastion) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135045 [15:14:44] (03CR) 10Faidon Liambotis: [C: 032 V: 032] smokeping: also monitor esams (network + bastion) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135045 (owner: 10Faidon Liambotis) [15:57:28] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: add ferm rule to allow web access [operations/puppet] - 10https://gerrit.wikimedia.org/r/135050 [16:07:18] (03PS1) 10Rush: irc bot fixup for self.bot namespace exception [operations/puppet] - 10https://gerrit.wikimedia.org/r/135052 [16:08:16] (03CR) 10Rush: [C: 032 V: 032] "needed for bot to run" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135052 (owner: 10Rush) [16:17:43] !log Elasticsearch on logstash1002 dead due to OOM at 2014-05-23T00:34:03Z [16:17:52] Logged the message, Master [16:20:01] !log restarted elasticsearch on logstash1002 [16:20:05] Logged the message, Master [16:33:04] greg-g: As soon as the logstash elasticsearch cluster is unbroken (logstash1002 rebuildiing replicas at the moment), I think I should upgrade the elasticsearch version there. [16:33:28] I just did the beta cluster and logstash seems to be working fine. [16:36:10] bd808: doit [16:59:13] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:02:15] !log Starting rolling update of elasticsearch for logstash cluster [17:02:21] Logged the message, Master [17:08:29] i get a database error (after a long wait) at https://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Module%3ANavbar&namespace=10 [17:08:39] A database query error has occurred. This may indicate a bug in the software. Function: SpecialWhatLinksHere::showIndirectLinks Error: 0 [17:20:11] If anyone wants to ack the icinga warnings for logstash100[123] that would be cool. Apparently I don't have the rights to do so. [17:29:49] (03CR) 10Mwalker: [C: 032] Enable the new PDF renderer in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134977 (owner: 10Mwalker) [17:30:00] (03Merged) 10jenkins-bot: Enable the new PDF renderer in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134977 (owner: 10Mwalker) [17:48:05] oooh [17:48:18] * Nemo_bis pats mwalker [17:48:35] heh; it doesn't work yet; but the link is there [17:48:47] Hah [17:49:53] I'll get it today; I think there's a configuration issue somewhere; but I have to do some fundraising work first; then figure out why the bundler isn't logging; and then fix whatever the underlying issue is [17:49:58] *small steps*! [17:57:38] Reedy or someone else: can you check and deploy https://gerrit.wikimedia.org/r/#/c/133228/ [17:57:47] it has been asked by a user again :/ [18:05:35] (03PS1) 10Rush: send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 [18:06:44] (03CR) 10jenkins-bot: [V: 04-1] send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 (owner: 10Rush) [18:09:57] (03PS2) 10Rush: send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 [18:11:04] (03CR) 10jenkins-bot: [V: 04-1] send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 (owner: 10Rush) [18:12:22] (03CR) 10Jgreen: [C: 031] Move OCG default port to 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [18:14:58] (03PS2) 10Mwalker: Move OCG default port to 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 [18:18:17] (03PS3) 10Rush: send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 [18:18:30] (03CR) 10Jgreen: [C: 031 V: 032] Move OCG default port to 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [18:19:51] (03CR) 10Jgreen: [C: 031 V: 032] add wikimedia.community, link to wikimedia.com [operations/dns] - 10https://gerrit.wikimedia.org/r/134836 (owner: 10Dzahn) [18:20:18] (03CR) 10Jgreen: [C: 032] add wikimedia.community, link to wikimedia.com [operations/dns] - 10https://gerrit.wikimedia.org/r/134836 (owner: 10Dzahn) [18:22:19] !log ran authdns-update to merge new wikimedia.community dns zone [18:22:25] Logged the message, Master [18:25:15] (03CR) 10Dr0ptp4kt: Remove more noise. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 (owner: 10Dr0ptp4kt) [18:34:38] (03PS4) 10Rush: send user and channel count to statsd for ircd [operations/puppet] - 10https://gerrit.wikimedia.org/r/135074 [18:52:36] (03CR) 10Dzahn: bast1001 to admin yaml (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [18:57:37] @seen hoo [19:01:04] hoo: Think twkozlowski wanyts you :p [19:01:34] * hoo hides :D [19:01:37] What's up? [19:03:21] (03CR) 10Dzahn: [C: 04-1] "the following users have no keys:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:03:52] (03CR) 10Gergő Tisza: "$wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [19:03:56] (03CR) 10Rush: bast1001 to admin yaml (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:11:55] (03CR) 10Rush: bast1001 to admin yaml (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:12:56] (03CR) 10Rush: "mhoover -- disabled user -- removed entirely" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:15:06] (03CR) 10Dzahn: "all the keys in data.yaml match existing keys in admins.pp" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:16:15] (03PS1) 10MaxSem: Kill all vestiges of GeoData's Solr support [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135088 [19:16:27] (03PS2) 10Rush: bast1001 to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 [19:16:29] (03PS2) 10Rush: data.yaml sanity testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/134922 [19:17:39] (03Abandoned) 10Rush: data.yaml sorted [operations/puppet] - 10https://gerrit.wikimedia.org/r/134923 (owner: 10Rush) [19:18:17] (03CR) 10jenkins-bot: [V: 04-1] data.yaml sanity testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/134922 (owner: 10Rush) [19:18:28] (03CR) 10MaxSem: [C: 04-2] "Will be deployed later." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135088 (owner: 10MaxSem) [19:20:33] * anomie is going to deploy a fix for bug 65665 to wmf6 in a few minutes, unless someone objects. greg-g already approved it. [19:24:08] (03CR) 10Dzahn: bast1001 to admin yaml (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:27:08] (03CR) 10Dzahn: [C: 031] bast1001 to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:27:52] (03PS3) 10Rush: bast1001 to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 [19:27:59] (03CR) 10Rush: [C: 032 V: 032] "go" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134921 (owner: 10Rush) [19:28:03] woot [19:30:23] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Fri 23 May 2014 04:30:05 PM UTC [19:35:35] (03PS1) 10Rush: nimishg does not exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/135091 [19:35:47] (03PS3) 10Rush: data.yaml sanity testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/134922 [19:35:51] (03CR) 10jenkins-bot: [V: 04-1] nimishg does not exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/135091 (owner: 10Rush) [19:36:04] (03CR) 10Rush: [C: 032 V: 032] "merging to get to later fixes, also this just just POC" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134922 (owner: 10Rush) [19:36:14] (03CR) 10Rush: [C: 032 V: 032] nimishg does not exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/135091 (owner: 10Rush) [19:36:15] (03PS2) 10Rush: nimishg does not exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/135091 [19:36:21] (03CR) 10Rush: [V: 032] "go" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135091 (owner: 10Rush) [19:43:18] !log anomie synchronized php-1.24wmf6/includes/HistoryBlob.php 'Backport fix for bug 65665 to 1.24wmf6 [[gerrit:135089]]' [19:43:23] Logged the message, Master [19:44:32] (03PS1) 10Rush: misc admin yaml includes [operations/puppet] - 10https://gerrit.wikimedia.org/r/135094 [19:47:17] (03CR) 10Rush: [C: 031] migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 (owner: 10Dzahn) [19:53:02] (03PS4) 10Dr0ptp4kt: Remove more noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 [19:53:07] (03PS1) 10Rush: fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 [19:53:46] (03CR) 10Dr0ptp4kt: "Feedback incorporated into PS4." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 (owner: 10Dr0ptp4kt) [19:55:07] (03PS1) 10Rush: fluorine.eqiad.wmnet to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135096 [19:57:17] (03PS2) 10Dzahn: migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 [20:00:43] (03PS1) 10Rush: misc hosts for admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135099 [20:05:58] (03PS1) 10Rush: updating ops list [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 [20:06:15] (03CR) 10jenkins-bot: [V: 04-1] updating ops list [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [20:06:18] (03PS3) 10Dzahn: migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 [20:06:56] (03PS2) 10Rush: misc admin yaml includes [operations/puppet] - 10https://gerrit.wikimedia.org/r/135094 [20:07:07] (03PS2) 10Rush: fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 [20:07:22] (03CR) 10Dzahn: [C: 032] migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 (owner: 10Dzahn) [20:09:18] (03PS4) 10Dzahn: migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 [20:10:08] (03CR) 10Dzahn: [C: 032] migrate cassandra roots to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/134733 (owner: 10Dzahn) [20:19:37] (03PS1) 10Dzahn: fix format of cassandra roots class [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 [20:20:18] (03CR) 10Yurik: Remove more noise. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 (owner: 10Dr0ptp4kt) [20:22:24] (03PS2) 10Dzahn: fix format of cassandra roots class [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 [20:23:07] (03PS3) 10Dzahn: fix format of cassandra roots class [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 [20:23:27] (03CR) 10Dzahn: [C: 032] fix format of cassandra roots class [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 (owner: 10Dzahn) [20:23:35] (03CR) 10Dzahn: [V: 032] fix format of cassandra roots class [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 (owner: 10Dzahn) [20:26:21] (03CR) 10Dzahn: "yep, that fixed it (cerium et al)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135105 (owner: 10Dzahn) [20:28:08] (03CR) 10Dzahn: [C: 031] misc admin yaml includes [operations/puppet] - 10https://gerrit.wikimedia.org/r/135094 (owner: 10Rush) [20:28:27] (03PS3) 10Rush: misc admin yaml includes [operations/puppet] - 10https://gerrit.wikimedia.org/r/135094 [20:28:34] (03CR) 10Rush: [C: 032 V: 032] "go" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135094 (owner: 10Rush) [20:31:35] (03PS5) 10Dr0ptp4kt: Remove more noise. [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 [20:32:19] (03CR) 10Dr0ptp4kt: "See PS5. Grouped www..com" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134984 (owner: 10Dr0ptp4kt) [20:34:21] (03PS3) 10Dzahn: fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 (owner: 10Rush) [20:34:36] (03PS4) 10Rush: fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 [20:34:43] (03CR) 10Rush: [C: 031] fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 (owner: 10Rush) [20:35:21] (03CR) 10Rush: [C: 032 V: 032] fenari to admins yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135095 (owner: 10Rush) [20:38:26] (03CR) 10Dzahn: [C: 04-2] "needs to be done in yaml now and is about downgrading users from restricted to bastiononly" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [20:39:41] (03PS2) 10Dzahn: fluorine.eqiad.wmnet to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135096 (owner: 10Rush) [20:40:52] (03CR) 10Dzahn: [C: 032] fluorine.eqiad.wmnet to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135096 (owner: 10Rush) [20:59:09] (03PS1) 10Andrew Bogott: Replace the bits of labsmediawiki that aren't wikidata-related. [operations/puppet] - 10https://gerrit.wikimedia.org/r/135107 [20:59:43] chasemp: still around by any chance? :) [21:00:01] chasemp: mutante: I can get the admin data_admin.py lint command added in Jenkins for linting the admin data.yaml :D [21:00:05] (03PS2) 10Andrew Bogott: Replace the bits of labsmediawiki that aren't wikidata-related. [operations/puppet] - 10https://gerrit.wikimedia.org/r/135107 [21:00:13] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Fri May 23 21:00:05 UTC 2014 [21:00:15] (03PS1) 10Hashar: admin: make data linter exit 1 on error [operations/puppet] - 10https://gerrit.wikimedia.org/r/135108 [21:00:17] (03PS1) 10Hashar: admin: wrap data_admin.py lint in a tox env [operations/puppet] - 10https://gerrit.wikimedia.org/r/135109 [21:00:22] (03CR) 10jenkins-bot: [V: 04-1] admin: make data linter exit 1 on error [operations/puppet] - 10https://gerrit.wikimedia.org/r/135108 (owner: 10Hashar) [21:00:29] if I knew how to use puppet [21:01:33] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135108 (owner: 10Hashar) [21:01:56] (03CR) 10jenkins-bot: [V: 04-1] admin: wrap data_admin.py lint in a tox env [operations/puppet] - 10https://gerrit.wikimedia.org/r/135109 (owner: 10Hashar) [21:03:02] (03CR) 10Andrew Bogott: [C: 032] Replace the bits of labsmediawiki that aren't wikidata-related. [operations/puppet] - 10https://gerrit.wikimedia.org/r/135107 (owner: 10Andrew Bogott) [21:05:29] (03PS1) 10Hashar: admin: lint data_admin.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/135110 [21:06:56] (03CR) 10jenkins-bot: [V: 04-1] admin: lint data_admin.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/135110 (owner: 10Hashar) [21:07:59] (03PS2) 10Hashar: admin: lint data_admin.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/135110 [21:15:48] (03PS2) 10Hashar: admin: make data linter exit 1 on error [operations/puppet] - 10https://gerrit.wikimedia.org/r/135108 [21:16:10] (03PS2) 10Hashar: admin: wrap data_admin.py lint in a tox env [operations/puppet] - 10https://gerrit.wikimedia.org/r/135109 [21:22:33] (03PS2) 10Dzahn: helium,holmium,hooft,manutius,iron to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135099 (owner: 10Rush) [21:22:49] (03CR) 10jenkins-bot: [V: 04-1] helium,holmium,hooft,manutius,iron to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135099 (owner: 10Rush) [21:23:16] (03PS3) 10Dzahn: helium,holmium,hooft,manutius,iron to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135099 (owner: 10Rush) [21:27:20] (03CR) 10Dzahn: [C: 032] helium,holmium,hooft,manutius,iron to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/135099 (owner: 10Rush) [21:29:11] hey, deployers, did you know you also have hooft , btw? [21:29:18] is that ever used [21:29:20] (03CR) 10Hashar: "Jenkins job defined with https://gerrit.wikimedia.org/r/135111" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135109 (owner: 10Hashar) [21:29:58] box in esams where you have shells [21:30:28] mutante: poke greg-g about hooft server :-) [21:31:06] mortals are not called mortals anymore [21:31:11] you are now deployers [21:31:35] and you all have a shell in Europe as well [21:33:24] hashar: yay for lint admin stuff [21:33:35] i assume you talked to chase [21:34:41] mutante: hooft I never heard about it. Might want to remind people about it :) [21:34:54] the lint hmm. there is a few puppet changes that needs to be reviewed/merged [21:35:04] and I can get it enabled [21:36:06] mutante: I have added both chase and you as reviewers :D [21:37:25] alright!:) [21:37:28] thx [21:37:43] hashar: have you seen this before? [21:37:51] FATAL: Unable to delete script file /tmp/hudson8543464839401118848.sh [21:37:52] if you two can review it this afternoon, I can deploy the job over the week-end or next monday [21:37:55] doh [21:37:57] never [21:37:59] as a reason for failed builds [21:38:01] Job ? [21:38:11] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35/console [21:38:18] it's the puppet3 compiler thing [21:38:34] but that error sounded like not related to actual change [21:39:01] _joe_ and I have added that job this week [21:39:09] yea, he asked me to rebuild it [21:39:13] that is #36 [21:39:31] currently running [21:39:46] i had not really logged in on jenkins before and manually triggered that [21:39:58] just looked without logging in [21:40:03] the file did not get deleted indeed [21:40:22] <_joe_> mutante: the 35 had failed because I rebooted the slave [21:40:23] <_joe_> :) [21:40:27] sounds like a bug in Jenkins :-( [21:40:35] _joe_: aaah:) [21:41:22] ahhhh [21:41:31] mutante: I think the master Jenkins has lost connection with the instance [21:42:39] or something weird along those lines. Anyway, mutante, you can restart your job I guess [21:42:56] what _joe_ said (slave rebooted hehe) [21:43:33] yep, i have clicked rebuild earlier [21:43:36] _joe_: I am super happy to see you namespaces the diff results http://puppet-compiler.wmflabs.org/change/134642/html/ [21:43:39] \O/ [21:44:36] <_joe_> hashar: that is the change id [21:44:44] <_joe_> hashar: namespace for individual builds will come next week [21:45:56] sounds great [21:46:08] mutante: I used hooft *once* since I started. I was debugging cache purge issues and Mar-k told me about it. [21:46:08] I am off for this week [21:46:43] bye hashar o/ [21:46:44] mutante: if you can get chasemp to review the admin lint change that would be very nice :) [21:47:01] bd808: :) [21:48:01] hashar: i have like one more of Chase's changes i'm reviewing.. i'll look, enjoy your weekend [21:49:40] more or less :) [21:49:46] I am writing some doc tomorrow ! [21:51:31] * hashar vanishes [22:00:08] (03CR) 10Dzahn: [C: 04-1] "keys for Tim and Domas need fixing" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:00:21] hm [22:00:28] no worries domas [22:00:36] we are just switching it to yaml based admin file [22:00:48] and i'll fix it before merging that [22:00:51] ok [22:00:54] meeting! [22:17:30] (03CR) 10Dzahn: [C: 031] "taking that back, it's just that they actually have multiple keys unlike other users. and it should work, array of strings" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:17:55] (03PS2) 10Dzahn: updating ops list [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:19:10] _joe_: and #36 success [22:20:23] <_joe_> mutante: output looks good as well - no evident regressions [22:20:41] <_joe_> although the test is not really probing on your changes [22:21:30] (03PS8) 10Ori.livneh: dissolve mediawiki::config::* [operations/puppet] - 10https://gerrit.wikimedia.org/r/134642 [22:21:54] (03CR) 10Ori.livneh: "Bump. This doesn't represent the end-all-be-all of the refactoring. I want to do this in small pieces, and being stalled sucks :/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134642 (owner: 10Ori.livneh) [22:26:33] (03CR) 10BryanDavis: "Drive-by yaml style comment" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:27:58] (03CR) 10Dzahn: updating ops list (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:41:55] (03PS3) 10Dzahn: updating ops list [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:46:47] (03CR) 10Dzahn: [C: 032] updating ops list [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:48:33] (03CR) 10Dzahn: "yaml style comments: done :) thx!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135100 (owner: 10Rush) [22:55:31] (03CR) 10Dzahn: [C: 032] admin: lint data_admin.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/135110 (owner: 10Hashar) [22:57:26] (03CR) 10Dzahn: [C: 032] admin: make data linter exit 1 on error [operations/puppet] - 10https://gerrit.wikimedia.org/r/135108 (owner: 10Hashar) [23:03:23] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Fri 23 May 2014 08:02:59 PM UTC [23:06:36] icinga-wm_: lies [23:07:18] (03CR) 10Dzahn: [C: 032] admin: wrap data_admin.py lint in a tox env [operations/puppet] - 10https://gerrit.wikimedia.org/r/135109 (owner: 10Hashar) [23:08:34] (03PS1) 10Ori.livneh: Rather than suppress with '@', call isset() first [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135125 [23:09:23] (03CR) 10MaxSem: [C: 031] Rather than suppress with '@', call isset() first [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135125 (owner: 10Ori.livneh) [23:10:24] (03CR) 10Ori.livneh: [C: 032] Rather than suppress with '@', call isset() first [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135125 (owner: 10Ori.livneh) [23:10:32] (03Merged) 10jenkins-bot: Rather than suppress with '@', call isset() first [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135125 (owner: 10Ori.livneh) [23:12:44] (03CR) 1020after4: "So most of phabricator's configuration is stored in the database. And the schema needs to be set up somehow. Sow there are a lot of manual" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [23:22:23] PROBLEM - Puppet freshness on caesium is CRITICAL: Last successful Puppet run was Fri 23 May 2014 08:21:46 PM UTC [23:23:00] (03PS1) 10Aaron Schulz: Avoid letting warnings into the runJobs.php DB name param [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 [23:29:38] (03PS1) 10Dzahn: let reseacher group read researchdb password file [operations/puppet] - 10https://gerrit.wikimedia.org/r/135134 [23:30:52] (03PS2) 10Dzahn: let reseacher group read researchdb password file [operations/puppet] - 10https://gerrit.wikimedia.org/r/135134 [23:34:12] (03CR) 10Ori.livneh: "I don't think this will fix it:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 (owner: 10Aaron Schulz) [23:34:47] (03CR) 10Dzahn: "Change-Id: I9c63c560f4" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122401 (owner: 10Ottomata) [23:35:37] (03PS1) 10Aaron Schulz: Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 [23:39:44] (03CR) 10Ori.livneh: [C: 031] Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 (owner: 10Aaron Schulz) [23:42:37] (03PS5) 10Dzahn: admins: add manybubbles and elasticsearch group [operations/puppet] - 10https://gerrit.wikimedia.org/r/134796 (owner: 10Matanya) [23:44:31] (03PS6) 10Dzahn: admins yaml - add elasticsearch group [operations/puppet] - 10https://gerrit.wikimedia.org/r/134796 (owner: 10Matanya) [23:47:45] (03CR) 10Dzahn: [C: 032] admins yaml - add elasticsearch group [operations/puppet] - 10https://gerrit.wikimedia.org/r/134796 (owner: 10Matanya) [23:48:41] (03PS3) 10Dzahn: admins: add elasticsearch-roots to elasticsearch nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/134797 (owner: 10Matanya) [23:52:39] (03CR) 10Dzahn: [C: 032] admins: add elasticsearch-roots to elasticsearch nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/134797 (owner: 10Matanya)