[00:00:11] the other one would be really low but not 0 [00:01:22] binasher: heh, 1e9 works [00:01:53] a very high value for server_failure_limit would be good, but probably just 1 for retry_timeout [00:02:15] that's what I was thinking, want to try that? [00:03:04] there's a dead timeout option too [00:03:53] paravoid: are you looking at libmemcached? [00:04:22] I did [00:04:31] I don't know what else to look [00:04:44] look into that is [00:05:33] what do you need? [00:05:43] the pecl extension doesn't have constants for everything, and is a bit out of sync with libmemcached [00:05:46] * AzaToth throws a http://paste.debian.net/ towards paravoid  [00:06:08] probably more out of sync now [00:07:00] (on an unrelated note, I'm glad you didn't use this: https://github.com/wuakitv/puppet-twemproxy/blob/master/manifests/install.pp ) [00:07:13] heh [00:07:20] New patchset: Aaron Schulz; "Tweaked memcached options as 0 is not valid." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68319 [00:07:26] https://graphite.wikimedia.org/render/?width=801&height=394&_salt=1371080615.671&target=Setup.php-memcached.tavg [00:08:09] so 12.5 vs. 3.75 vs. 0.75 or something [00:08:15] not bad :) [00:08:25] s/tavg/tp90 [00:08:41] good point [00:08:53] same pattern :) [00:08:59] that's pretty cool [00:11:22] we could abandon twemproxy, the libmemcached fix is significant enough [00:11:33] lol [00:11:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68319 [00:11:46] you had other improvements in mind too, didn't you? [00:11:54] with twemproxy I mean [00:12:07] eh, just the site not going down [00:13:06] that's always a nice goal [00:13:14] what specifically though? :) [00:13:28] !log asher synchronized wmf-config/mc-eqiad.php 'large server_failure_limit, small retry_timeout for twemproxy test hosts' [00:13:37] Logged the message, Master [00:14:23] * Aaron|home hmms at https://graphite.wikimedia.org/render/?width=840&height=398&_salt=1371082426.763&target=MemcachedPeclBagOStuff.getMulti.tp90 [00:14:42] paravoid: better timeout handling of downed hosts and rehashing keys of ejected hosts [00:15:09] sounds nice [00:15:19] if we end up using it, I'll create a proper package and upload it to Debian [00:15:22] :-) [00:15:32] I assume there would be less packets in the air too, less to drop heh [00:16:16] hrm, getmulti times [00:16:59] same as get btw [00:17:35] it may just be the deferred connections showing up in the timing [00:17:52] that would make sense [00:17:54] presumably it no longer dumbly connections to everything anymore in startup [00:19:08] New patchset: Asher; "try twemproxy on all eqiad hosts" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68322 [00:20:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68322 [00:21:19] !log asher synchronized wmf-config/mc-eqiad.php 'trying twemproxy on all eqiad hosts' [00:21:29] Logged the message, Master [00:23:11] Aaron|home: still some occurrences of SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [00:23:35] not many though, odd [00:23:35] i'm pretty sure it didn't fail 1e9 consecutive times first [00:23:41] !log ceph readding osd.81, osd.125, osd.137, osd.141 after disk replacement (#5202, #5228, #5248, #5263) [00:23:49] Logged the message, Master [00:23:55] 2013-06-13 00:23:50.293977 mon.0 [INF] pgmap v8556272: 16760 pgs: 15476 active+clean, 1237 active+remapped+wait_backfill, 40 active+remapped+backfilling, 7 active+clean+scrubbing+deep; 44673 GB data, 138 TB used, 123 TB / 261 TB avail; 21536152/854606636 degraded (2.520%); recovering 701 o/s, 110MB/s [00:24:06] 144MB/s etc., not bad [00:24:15] binasher: right, I was wondering if there is some other criteria [00:24:30] Aaron|home: https://bugs.php.net/bug.php?id=60049 [00:24:34] not great either, but certainly much better than swift :) [00:25:02] binasher: you should merge https://gerrit.wikimedia.org/r/#/c/67316/ btw ;) [00:26:26] you don't want to reply to nikerabbit / double hash every key or add a str length check to wfMemcKey?! [00:27:51] New patchset: Asher; "disable persistent conns to twemproxy (test)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68324 [00:28:13] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68324 [00:28:19] New patchset: Dzahn; "remove Apache_site[no_default] line, this is now in webserver.pp and creates a duplicate definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68325 [00:28:59] !log asher synchronized wmf-config/mc-eqiad.php 'disable persistent conns to twmprxy' [00:29:06] Logged the message, Master [00:30:18] binasher: I see libmemcached does unix socket as well [00:30:33] maybe it could connect via unix socket to twemproxy? [00:30:35] twemproxy supports that as well [00:30:41] that could be better [00:30:55] well, we're down to < 1ms, so it doesn't get a lot better I guess [00:31:01] New review: Dzahn; "fix duplicate def." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/68325 [00:31:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68325 [00:31:14] but for now i'm going to head home.. think i should revert twemproxy for now? [00:31:38] what was what TimStarling said during one of the eqiad meetings? [00:32:05] i wonder if ruwiki represents most of the apache traffic [00:32:10] oh? [00:33:14] nope, just most of the SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY requests.. odd [00:33:46] non-latin1 key names? [00:33:50] Ryan_Lane: @wikistats-01 - Finished catalog run in 54.25 seconds (side error: E: Couldn't find package libweb-scraper-perl, but doesn't break runs) [00:34:10] awesome. thanks [00:34:43] paravoid: could be.. there aren't any in the actual "SERVER HAS FAILED" messages, but a prior non-latin1 key could be triggering something [00:35:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:39] Aaron|home: opinion on whether or not i should revert before heading home? [00:36:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [00:36:27] * Aaron|home checks the logs [00:37:14] Ryan_Lane: oh, wow, somebody merged stuff in that but meant the other wikistats i believe :p [00:37:53] https://github.com/wikimedia/analytics-wikistats/tree/master/pageviews_reports/t [00:38:03] heh, that's so unrelated in this file:) [00:38:10] binasher: it can be left if people are around that know how to revert [00:38:20] I'm not planning to stay around [00:40:20] * paravoid is annoyed by the twemproxy vs. nutcracker duality [00:40:20] I mean the 1 second retry makes it unlikely to explode, and it is easy to revert [00:40:21] hmm, to email ops@ or just revert… [00:40:25] i like http://bit.ly/18z9F7m [00:40:26] gah just pick oneee [00:41:05] i'm just glad we aren't using twemcached, saying that would bug me [00:42:46] okay, packaging seems trivial [00:42:57] a single embedded library that is unmodified, so easily worked around [00:43:11] good copyright/license status, well documented [00:44:55] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [00:47:48] New review: Dzahn; "Hey all, there must have been some confusion here about wikistats again. There are different project..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60965 [00:49:48] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [00:50:22] hey mutante, is there another way this can be resolved? [00:50:40] this will break unit-testing for wikistats [00:51:14] drdee: but it was never actually applied to any analytics machine that way [00:51:24] ok, mailed ops about how to revert, now taking off [00:51:25] drdee: all this did was apply it on an unrelated labs instance [00:51:50] mmmmmmmm, 1 sec [00:51:55] drdee: and..even there it cant find the packages [00:52:15] but.. i was about create one more patch set anyways, hold on [00:52:39] binasher: would be nice if graphite .count graphs hits/min or something [00:53:06] mutante: okay ignore me :) [00:59:21] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [01:00:34] drdee: https://gerrit.wikimedia.org/r/#/c/68327/3 reverts it but also adds the same packages that were meant to be installed and puts it back in contint.pp just like before https://gerrit.wikimedia.org/r/#/c/68327/2/modules/contint/manifests/packages.pp [01:00:41] noone writes manpages anymore [01:00:57] just READMEs in markdown [01:01:48] thanks mutante! [01:01:52] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.001399517059 secs [01:03:02] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.006385445595 secs [01:03:06] New review: Dzahn; "confirmed all 3 packages were already installed on gallium, now puppet will also ensure this" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/68327 [01:03:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [01:07:20] drdee: np, all done and puppet is fine on both sides, my labs instance and gallium (jenkins) as well, it had the packages before, now puppet just ensures it [02:08:12] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 13 02:08:12 UTC 2013 [02:08:21] Logged the message, Master [02:10:32] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:14:58] !log LocalisationUpdate completed (1.22wmf5) at Thu Jun 13 02:14:58 UTC 2013 [02:15:10] Logged the message, Master [02:28:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 13 02:28:27 UTC 2013 [02:28:35] Logged the message, Master [04:53:36] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [04:53:36] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:12:08] PROBLEM - RAID on analytics1019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:18:01] New review: Mxn; "I haven?t gotten around to porting the Vietnamese IME script over to ULS yet, so there?s little effe..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/68146 [06:34:24] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:34:24] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:34:35] RECOVERY - Disk space on ms-be1 is OK: DISK OK [06:34:37] RECOVERY - swift-container-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:34:37] RECOVERY - DPKG on ms-be1 is OK: All packages OK [06:34:43] RECOVERY - swift-container-updater on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:34:44] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:34:53] RECOVERY - swift-object-auditor on ms-be1 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:34:53] RECOVERY - RAID on ms-be1 is OK: OK: State is Optimal, checked 1 logical device(s) [06:34:53] RECOVERY - swift-object-updater on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:35:03] RECOVERY - swift-account-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:35:03] RECOVERY - swift-object-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:35:03] RECOVERY - swift-object-server on ms-be1 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:37:10] apergos: that you? [06:37:17] yes [06:37:22] ah, cool [06:37:28] log? :) [06:37:34] RECOVERY - swift-account-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:37:42] RECOVERY - swift-account-reaper on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:37:48] nothing to log yet [06:37:55] what's not cool is ms-be9 [06:38:15] its disk usage is far below all the other hosts and I can't figure out why that is [06:38:42] when I check the obj replicators everywhere they all look fine [06:39:03] and yet its partitions are using 800gb or less while everywhere else its 1.1 or 1.2T, I forget [06:41:42] is it balanced? [06:41:49] the rings that is [06:42:29] they were the last time they went around. ms-be9 has a bunch of incoming network traffic (much more than the rest of the boxes) and I suppoose that's why [06:42:36] but how it got that wya, that's what's bugging me [06:42:40] ms-be9 is @ 66% [06:43:23] ah :-D I should have looked... but then why is it getting so much more incoming network traffic? that's actually what caught my eye [06:43:31] 66% of 1.1-1.2T is 800gb, so about right [06:43:46] it should be looking like all the rest, and while there is one disk that's a problem (and an open ticket) that can't account for it [06:43:58] looking [06:44:07] thanks, hopefully your eyes are shaarper [06:44:49] another set of eyes always helps [06:44:59] doesn't need to be sharper :) [06:45:12] also true :-) [06:45:32] (and they're not sharper, I pulled an allnighter again :) [06:45:40] woops [06:45:48] you gotta pull an all-dayer [06:45:52] heh [06:45:52] it's better for your sleep schedule [06:50:30] nothing that stands out [06:51:25] ok well I'm going to do what I was doing and ignore it for now but keep an eye out [06:51:27] thanks for looking [06:51:32] RECOVERY - NTP on ms-be1 is OK: NTP OK: Offset -0.0143879652 secs [06:52:00] back in a few (mus do dishes), hope you have gone sleep by then :-D [06:52:02] *must [06:52:06] *gone to [06:52:15] maybe I need sleep instead! [07:32:22] !log the last of the c2100s in the swift pool has been replaced (and new rings pushed), finally. now we "only" have the H310 controllers to replace... [07:32:31] Logged the message, Master [07:42:53] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:54] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:55] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [07:42:55] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [07:42:56] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:56] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:57] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:45:27] reminder for me (to pass this on to cmjohnson later): https://wikitech.wikimedia.org/wiki/Swift/Deploy_Plan_-_R720xds_in_tampa [08:02:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.009769558907 secs [08:05:30] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [08:09:04] good morning [08:12:22] 'lo [08:13:44] morning [08:15:26] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [08:18:45] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:31:45] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.01350402832 secs [08:32:46] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.01323068142 secs [08:34:27] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:55] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [08:42:57] hashar: any idea what is up with jenkins here? https://gerrit.wikimedia.org/r/#/c/68117/ [08:43:14] should VE simply increase its qunit timeout setting? [08:54:09] ori-l: possibly [08:54:23] ori-l: honestly I am not sure how qunit works :( [08:55:08] note that the job has always failed apparently. [08:55:35] I am not sure why it triggers two run as well [08:55:49] ori-l: are the qunit tests passing on your local install ? [08:56:55] hashar: i can't get them to run [09:01:34] New review: Akosiaris; "LGTM" [operations/debs/kafka] (debian) C: 2; - https://gerrit.wikimedia.org/r/68026 [09:06:43] New review: Nemo bis; "Thanks, this fixed LocalisationUpdate on Wikimedia projects for core. Extensions are still broken." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68110 [09:28:11] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [09:30:11] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [09:36:10] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [09:43:50] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [09:44:32] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [09:45:31] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [09:57:10] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:00:30] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [10:01:30] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [10:03:09] New patchset: Odder; "(bug 49358) Remove MoodBar from it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68352 [10:09:48] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:15:58] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [10:21:08] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [10:21:08] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [10:21:19] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [10:23:08] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [10:23:58] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [10:37:09] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [10:39:16] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.012 second response time [10:41:06] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [10:44:23] New review: Mark Bergsma; "Comments inline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [11:28:52] New review: Akosiaris; "Comments mostly inline. Biggest hurdle i see is the JDK7 requirement. " [operations/debs/buck] (master) C: -1; - https://gerrit.wikimedia.org/r/67999 [11:47:06] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [11:48:26] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [11:48:47] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [11:50:55] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [11:51:06] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [11:53:06] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [11:53:18] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [11:54:05] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [12:02:06] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [12:10:36] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [12:11:04] New patchset: Mark Bergsma; "Add FIXMEs to be dealt with when configuring the new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68368 [12:11:07] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [12:12:08] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [12:14:20] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [12:15:53] haha [12:15:59] due to a puppet bug, arrays are mutable [12:18:06] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Thu Jun 13 12:18:04 UTC 2013 [12:30:33] New patchset: Mark Bergsma; "Add warning note" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68373 [12:30:33] New patchset: Mark Bergsma; "Move (commented out) packages version class instance to ancestor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68374 [12:30:34] New patchset: Mark Bergsma; "Factor out Varnish logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:31:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68368 [12:31:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68373 [12:32:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68374 [12:34:29] New patchset: Mark Bergsma; "Factor out Varnish logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:36:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:44:46] New patchset: Mark Bergsma; "Factor out addition of localhost IPs to ancestor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68379 [12:45:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68379 [12:57:54] New patchset: Mark Bergsma; "Factor out $varnish_directors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68382 [12:57:54] New patchset: Mark Bergsma; "Mummy says dashes in puppet names are bad!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [12:58:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [13:04:18] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [13:04:35] New review: Peachey88; "> Mummy says dashes in puppet names are bad!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [13:04:38] New patchset: Mark Bergsma; "Add another level of class hierarchy, descend bits from a 1layer class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:05:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68382 [13:06:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [13:07:20] New patchset: Mark Bergsma; "Add another level of class hierarchy, descend bits from a 1layer class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:08:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:20:16] hey guys. sdb in neon is failing => sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed. mdadm has not yet detected it.... How do we handle this? ticket in rt ? [13:26:13] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [13:29:42] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:29:52] akosiaris: yeah, a ticket in the RT queue for the respective datacenter [13:29:54] which is eqiad in this case [13:30:01] mark: ok thanx [13:31:12] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:31:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:32:42] New review: Diederik; "Ok" [operations/debs/kafka] (debian); V: 2 - https://gerrit.wikimedia.org/r/68026 [13:32:43] Change merged: Diederik; [operations/debs/kafka] (debian) - https://gerrit.wikimedia.org/r/68026 [13:33:37] New patchset: Mark Bergsma; "Typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68387 [13:34:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68387 [13:40:23] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68388 [13:41:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68388 [13:46:28] New patchset: Mark Bergsma; "Provide a plain array of backend values" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68389 [13:47:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68389 [13:51:01] New patchset: Mark Bergsma; "Flatten harder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68390 [13:52:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68390 [13:56:08] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [13:57:19] New review: Andrew Bogott; "the latest output is here: https://dpaste.de/Do8x5/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [13:59:29] mark: ^ [13:59:54] ok [13:59:55] New patchset: Mark Bergsma; "Simply bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:00:23] doh I could have put the comments in the manifest instead hehe [14:00:43] *shrug* it was easy to follow as you did it [14:02:12] so mchenry needs to be added to relay_from_hosts [14:02:21] I see there's a hostlist for that but it's not included into relay_from_hosts [14:02:37] you can ditch it and include it directly instead if that's easier [14:02:53] hmm [14:03:03] perhaps it would be better to define a "mail relays" variable in some puppet manifest [14:03:05] and use that [14:03:12] instead of hardcoding mchenry [14:03:30] hmm wait [14:03:34] no this isn't needed [14:03:41] sorry [14:06:09] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:07:41] andrewbogott: I see in the original RT config that hostlist didn't really do anything [14:07:42] just ditch it [14:07:52] i'll run a diff now actually [14:08:58] ok, ditched [14:12:05] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:15:51] New review: Mark Bergsma; "Please change the acl_check_connect ACL from the original config to the following, in the template. ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:16:16] hehe [14:16:20] there's also no smarthost router yet [14:16:28] this template is clearly not very well tested yet ;) [14:18:47] New review: Mark Bergsma; "Add one last "router" which sends any remaining mail to the outbound mail relays (mchenry/sodium):" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:20:05] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68011 [14:20:20] andrewbogott: found a few more problems, added comments for them [14:20:24] but it should be nearly there [14:20:32] thanks [14:24:26] New patchset: Mark Bergsma; "Simply bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:24:58] the puppet parser is such a piece of shit [14:27:18] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:29:09] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:30:18] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:31:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:34:36] enable_external_mail implies "This server does not accept external mail"? [14:43:15] mark, do you see that the smart router block you suggested is already present in the imap_delivery section of the template? Should I just duplicate it, or does that need to be reconciled somehow? [14:43:42] reconcile it [14:43:52] get it out of imap_delivery, make it a separate smarthosts parameter [14:44:38] And… leave it to future imap definitions to fill the array? [14:44:48] I don't see any existing uses of imap delivery [14:53:22] New review: Nemo bis; "Seems good. I noticed their lack some days ago. :)" [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/67203 [14:56:31] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [14:57:21] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:57:30] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.010 second response time [14:57:57] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [14:58:09] andrewbogott: imap is now done by sanger, also not puppetized yet [15:00:39] mark: Yeah, the change clearly doesn't break any existing systems, but it does change the behavior of enable_imap_delivery for future users [15:00:58] that's fine [15:01:08] New review: Andrew Bogott; "the latest: https://dpaste.de/fnqcH/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:01:11] that template has only been put to one use before [15:01:16] which is the mailman server [15:01:27] but it's been compiled as an aggregate config of mchenry, sanger, and the previous lists server [15:01:31] because a lot of stuff is shared [15:01:38] so we'll figure out the bugs for the remaining uses as we go ;) [15:02:11] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:04:15] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:07:12] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:17:53] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:23:33] New review: Mark Bergsma; "Looks pretty good. Only a few minor comments left." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:27:52] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:27:55] i keep using dashes [15:38:14] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:40:11] mark, I have a question re 68404 [15:40:24] oh? [15:40:40] to me it looks like you remove the front/backend separation [15:40:46] that's correct [15:40:49] doesn't that also remove the cache splitting? [15:40:52] yes [15:40:54] which is the point [15:40:55] hmm [15:41:03] I think cache splitting is very bad if you have just two boxes :) [15:41:07] a single box dies and you lose half your cache [15:41:14] so if one of the caches goes down, 100% is gone rather than 50% [15:41:15] and other than that it's just needless complexity [15:41:25] now both caches should get the same content [15:41:40] that may not work the way you currently refresh stuff [15:41:46] thanks for all your help, mark [15:41:49] but that's why I'm changing it now ;) [15:41:58] mark: how will one box get the copy of the other's cache? [15:42:02] it won't [15:42:07] we don't want to render pages twice [15:42:21] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [15:42:51] so far I thought that the 50/50 split was a good compromise between resilience and performance [15:44:01] if you want hashing anyway, wouldn't that be better implemented inside mediawiki? [15:44:21] mediawiki is not involved in this really, except as a client [15:44:29] yes, as a client [15:44:42] we also perform requests to the caches from Parsoid [15:44:44] !log reedy synchronized php-1.22wmf7 'Initial file sync' [15:44:53] that's what I mean [15:44:54] Logged the message, Master [15:45:01] ok, also in parsoid then ;) [15:45:20] (to me, "mediawiki" is "all that wiki platform stuff I stay away from" ;) [15:45:30] then we'd have to do failover etc both in MW and Parsoid [15:45:53] when that is currently nicely abstracted [15:46:00] !log reedy synchronized docroot [15:46:10] Logged the message, Master [15:46:17] "nicely" [15:46:29] i think this whole setup is fugly tbh, as varnish is used for storage [15:46:34] instead of for caching [15:46:42] mark: is your goal to have 100% of the cache in case one of the servers goes down? [15:46:42] i know it's temporary [15:46:58] no, my goal is to not have double the amount of requests to parsoid or mediawiki [15:47:09] er [15:47:13] right now we only send one request [15:47:13] 50% of all requests [15:47:20] yeah I understand [15:47:23] that i a problem with a one layer setup [15:47:41] but the hashing isn't that reliable either [15:47:43] we used to send two and then relied on request coalescing on the single backend [15:47:51] but that does not work with forced missed [15:47:54] *misses [15:48:00] no [15:48:03] i had an alternative idea for that btw [15:48:07] but didn't want to put that in yet [15:48:17] (just purge on hit & restart the request) [15:48:25] the frontends don't need caching anyway [15:48:40] you're really only using them for hashing eh [15:48:44] yup [15:48:51] that was a bit hidden in the config [15:48:56] request rates average around 20/s [15:48:57] the proper way would be to return (pass) on everything [15:49:04] peaks maybe 100/s [15:49:15] so 0.5% cpu usage on one core at most ;) [15:49:57] we don't currently purge really [15:50:28] for edits, we request the new url (including the oldid) and a header pointing to the predecessor version [15:50:38] i know [15:50:40] the cached predecessor DOM is then used to speed up re-rendering [15:50:44] let me demonstrate it with code then ;) [15:51:30] a purge for the predecessor is sent out then, but currently does not reach the backends [15:51:58] mark: I'm not sure which problem you are trying to fix [15:52:23] the ugly dual layer setup [15:52:59] I agree that it is not super-pretty [15:53:57] but I don't know of a way to do a single-layer setup without duplicate backend requests [15:56:06] gerrit-wm: ping [15:56:13] mark: got to run to catch a train, should be back in ~10 minutes [15:56:26] here's an alternative patchset [15:56:26] New patchset: Mark Bergsma; "Alternative way to refresh content with purging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68413 [15:56:30] with dual layer [15:56:33] but working purging [16:00:58] !log reedy synchronized wmf-config/InitialiseSettings.php [16:01:01] Logged the message, Master [16:01:08] New patchset: Mark Bergsma; "Alternative way to refresh content with purging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68413 [16:01:20] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [16:02:22] New patchset: Reedy; "Set testwikidatawiki logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68414 [16:02:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68414 [16:03:09] New patchset: Nemo bis; "Update gitweb/gitblit RSS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68415 [16:03:40] !log reedy synchronized w [16:03:47] Logged the message, Master [16:04:10] New patchset: Reedy; "Add symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68416 [16:04:14] New patchset: Mark Bergsma; "Alternative way to refresh content with purging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68413 [16:04:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68416 [16:10:21] New patchset: Nemo bis; "Whitelist WMF blog and GitBlit feeds again on MediaWiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68417 [16:10:21] no ops on Friday?! OH NOES! [16:11:24] !log reedy Started syncing Wikimedia installation... : testwiki to 1.22wmf7 and build l10n cache [16:11:32] Logged the message, Master [16:19:14] New review: GWicke; "I agree that the front/backend setup is relatively complex, but on the other hand that complexity is..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68404 [16:23:12] gwicke: also see https://gerrit.wikimedia.org/r/68413 [16:23:35] i'm still undecided on the dual layer setup [16:23:41] on one hand it's good to avoid dual backend requests [16:24:04] but on the other hand it's better to have double the requests consistently all the time, than to have a massive amount of backend requests when one cache goes down [16:24:17] mark: on template refresh we'd eventually like to reuse extensions and images [16:24:19] because the first is much easier to capacity plan for [16:24:25] purging will make that hard [16:25:03] are you talking about 68413 now? [16:25:08] yes [16:25:14] i don't know how that relates [16:25:24] this is just what you're currently doing, except it actually removes the old object from the cache too [16:25:43] the purge & restart logic means that we can't use the current version to speed up template refreshes [16:26:22] how would you use the current version? [16:27:50] while processing a no-cache request we can request the current version from the cache [16:28:07] as another request you mean? [16:28:09] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.22wmf7 and build l10n cache [16:28:16] and then return an updated version [16:28:22] Logged the message, Master [16:28:25] yes, with only-if-cached header set [16:28:39] so WHILE you're doing a refresh [16:28:43] yup [16:28:47] you're doing ANOTHER request with only-if-cached? [16:28:57] not right now, but that is the plan [16:29:01] ok, perhaps a few comments in the VCL to that effect would be appropriate [16:29:27] if there is some other way to do this then that would be great too [16:29:50] ori-l: hashar: "You're wrong" [16:30:03] maybe if you do some awful hack with setting a specific ttl [16:30:12] but yuck [16:30:26] ori-l: hashar: VE qunit doesn't time out, the http request served to phantomjs in there is broken [16:30:39] I don't like the fact that the no-cache refreshes are not cleared from storage [16:30:42] ori-l: With no qunit, jquery, mediawiki in the response. [16:30:50] that's what I was trying to fix for you [16:31:01] ori-l: So phantomjs (waiting for QUnit to appear and start dancing) "times out" [16:31:17] ori-l: Stop talking, start fixing. :-) [16:31:26] but I didn't get your extremely evil scheme to also request that object during that same request :P [16:31:36] (moving to #wikimedia-dev) [16:32:06] mark: parsoid can do another request to the cache, but it currently only does that for the predecessor version [16:32:35] for template / image updates, the idea is to request the current URL instead [16:32:47] i understand now [16:33:42] alright [16:34:12] it is a bit tricky, will try to amend http://www.mediawiki.org/wiki/User:GWicke/Minimal_performance_strategy_for_July_release a bit [16:36:07] New review: Mark Bergsma; "I agree that having double the amount of backend requests is not great. It might however still be be..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [16:36:37] New patchset: AzaToth; "Initial debian build" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [16:37:29] Change abandoned: Mark Bergsma; "Gabriel explained that this wouldn't work, because they want to be able to fetch the old object unde..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68413 [16:38:24] New review: AzaToth; "The bundled jars are there bacuase they don't exists in debian/ubuntu, or is outdated or didn't work..." [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [16:41:29] New patchset: Mark Bergsma; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [16:44:06] New patchset: Mark Bergsma; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [16:44:12] New patchset: Jforrester; "Enable the EventLogging integration for VisualEditor on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68430 [16:45:00] New review: Jforrester; "Depends on https://gerrit.wikimedia.org/r/#/c/68117/ which is not yet merged." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/68430 [16:45:27] New patchset: Mark Bergsma; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [16:46:12] New patchset: Mark Bergsma; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [16:46:19] i need many patchsets today [16:47:17] aww [16:51:52] mark: will return(pass) preserve the Cache-Control header? [16:52:26] it was being eaten by the frontend caches before I transferred it to bereq explicitly [16:53:18] gwicke: really? [16:53:28] yes [16:53:31] pass should preserve it, but I didn't test [16:53:50] hmm [16:53:57] that's something we'll want to check for the text cluster as well [16:54:46] Varnish normally tries hard to ignore client-side cache refreshing.. [17:00:16] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [17:09:16] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.012 second response time [17:09:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:14:08] drdee: [17:14:08] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics102%5B12%5D.eqiad.wmnet&mreg[]=kafka_network_SocketServerStats.ProduceRequestsPerSecond&z=large>ype=stack&title=kafka_network_SocketServerStats.ProduceRequestsPerSecond&aggregate=1&r=hour [17:14:10] looks ok, no? [17:16:40] yah, weird [17:22:52] New patchset: Ottomata; "No need to have a special 'BROKER_JMX_PORT' variable if kafka.default is only read by kafka.init" [operations/debs/kafka] (debian) - https://gerrit.wikimedia.org/r/68443 [17:23:09] Change merged: Ottomata; [operations/debs/kafka] (debian) - https://gerrit.wikimedia.org/r/68443 [17:31:32] New patchset: Aaron Schulz; "Increased the Parsoid job pipeline (to account for non-template edits)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68446 [17:31:47] notpeter: ^ [17:34:55] heya paravoid, you around? [17:35:00] yes [17:35:08] i'm repuppetizing kafka for the 0.8 package, just got an annoying thing i'm not sure what's best to do with [17:35:38] https://gerrit.wikimedia.org/r/#/c/50385/6/manifests/server.pp [17:35:42] see line 42 [17:35:49] where i'm inferring the broker_id from numbers in the node's hostname? [17:35:54] i don't really want to do that anymore [17:36:10] aha [17:36:11] i thought it was kinda slick, but i'm leaning against it right now [17:36:17] AaronSchulz: cool, will merge [17:36:33] and thinking about making broker_id not ahve a default, so you always have to pass it [17:36:56] but that would mean that we'd either need parameterized role classes, or a role class for each kafka broker [17:37:28] actually, i guess I could do like we did for zookeeper [17:37:37] with the hash... [17:37:44] i *do* need a list of brokers in 0.8 [17:37:47] (I didn't in 0.7.2) [17:38:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68446 [17:38:20] right [17:38:24] I was about to suggest that [17:38:29] it sounded all very similar to zookeeper [17:38:41] AaronSchulz: live [17:38:52] class { 'kafka': [17:38:52] brokers => { 'analytics1021' => 1, 'analytics1022' => 2 … } [17:39:08] hmmm ok thanks for the brain bounce I will work with that :) [17:39:15] ori-l, do you know what's up with the 'kubo' instance in the editor-engagement project? Puppet seems very sad there [17:39:15] (hehe, sometimes you just need a mirror!) [17:39:22] yep [17:40:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [17:43:13] AaronSchulz: , notpeter: thanks! [17:43:40] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:40] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:40] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:40] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:40] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [17:43:40] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:41] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [17:43:41] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:42] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [17:43:42] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:43] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:43] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [17:53:13] mark: just tested, Cache-Control is preserved when using pass [17:57:52] gwicke: that's what i suspected yeah [18:06:16] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiversity and wikipedia to 1.22wmf6 [18:06:24] Logged the message, Master [18:06:54] New patchset: GWicke; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [18:07:05] mark: I tweaked your patch minimally [18:07:18] New patchset: Reedy; "testwiki to 1.22wmf7 ahead of later deploy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68459 [18:07:18] New patchset: Reedy; "Wikipedia and Wikiversity to 1.22wmf6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68460 [18:08:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68459 [18:08:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68460 [18:09:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:10:47] orenwolf: when do you head down to sf ? [18:12:23] New review: Mark Bergsma; "No, I want to keep the VCL hook functions themselves out of these "common" include files, because th..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68428 [18:12:51] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki, testwikidatawiki and mediawikiwiki to 1.22wmf7 [18:12:59] Logged the message, Master [18:13:38] gwicke: yeah, but I didn't like those vcl_recv and stuff put in the common files [18:13:49] New patchset: Reedy; "test2wiki, testwikidatawiki and mediawikiwiki to 1.22wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68466 [18:14:08] because they all get concatenated and with more include files it gets confusing what's run in which order [18:14:24] andrewbogott: dunno. i'll take a look. [18:14:44] andrewbogott: can you check out https://gerrit.wikimedia.org/r/#/c/68135/ ? [18:14:48] ok, I can move the host header clamping to the front/backend too [18:14:56] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68466 [18:15:02] why do you need it in the backend? [18:15:11] all queries arrive on the frontends first, no? [18:15:16] otherwise debugging and backend purging is harder than necessary [18:15:53] well you can put it in a vcl_recv_common function then [18:15:57] and call that from both [18:16:08] at least then the flow of the code is clear [18:16:44] ok [18:19:56] LeslieCarr: Monday! [18:20:03] I'll be in all week [18:21:36] try the veal ? [18:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:24:51] New patchset: Pyoungmeister; "moving en, fr, de, and ja search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68470 [18:27:09] New patchset: GWicke; "Use pass to avoid caching in frontend, refactoring, explanatory comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [18:29:21] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68470 [18:31:24] Ryan_Lane: is it ok I fold operations/debs/gerrit into operations/gerrit? [18:32:12] I think our normal naming scheme for debs is to have them in operations/debs/ [18:32:16] so maybe the opposite? [18:32:46] !log py synchronized wmf-config/lucene-production.php 'moving search traffic to pmtpa for en, de, fr, and ja' [18:32:46] Ryan_Lane: I meant more to only have one repo [18:32:52] yep. that works [18:32:55] Logged the message, Master [18:32:55] the one I made isn't that good [18:36:36] Change abandoned: AzaToth; "moving it all to the other gerrit repo" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/68281 [18:40:02] New review: GWicke; "I have renamed the shared vcl_recv to vcl_recv_common and am now explicitly calling that from front/..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/68428 [18:44:53] New patchset: Pyoungmeister; "moving more search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68480 [18:46:56] ori-l: That morebots change looks good… it won't go live until I repackage the bot though [18:47:10] (you tested it, I trust?) [18:48:31] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68480 [18:49:39] !log py synchronized wmf-config/lucene-production.php 'moving more search traffic to pmtpa' [18:49:47] Logged the message, Master [18:49:55] New review: Andrew Bogott; "This looks fine -- it'll take a while to trickle into production since I'll need to repackage the bot." [operations/debs/adminbot] (master) C: 2; - https://gerrit.wikimedia.org/r/68135 [18:50:01] Change merged: jenkins-bot; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [18:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:54:16] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:06] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:55:18] New patchset: Pyoungmeister; "moving search pools 4 and 5 to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68482 [18:57:45] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68482 [18:58:24] !log py synchronized wmf-config/lucene-production.php 'moving last of search traffic to pmtpa' [18:58:35] Logged the message, Master [19:05:40] !log kaldari synchronized php-1.22wmf7/extensions/Thanks/modules/ext.thanks.thank.js 'fix IE Thanks bug' [19:05:51] Logged the message, Master [19:05:59] ^demon: can you create branch wmf-debian onto gerrit? [19:06:46] or anyone [19:07:02] * AzaToth can never remember who can do what [19:08:02] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [19:08:22] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [19:08:26] why does parsoid go away on these boxes? [19:08:45] * apergos looks around for RoanKattouw_away boo, away [19:08:52] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [19:08:59] Ryan_Lane: poke [19:09:04] gwicke: ^^ [19:09:22] paravoid: whom can add a branch? [19:09:22] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [19:09:37] create a branch where? [19:09:39] I checked one of them earlier, nothing of use in the log as far as I could gell, nor atop.. just no parsoid. [19:09:43] Ryan_Lane: gerrit [19:09:44] oh [19:09:45] *tell [19:09:45] in gerrit [19:09:49] wmf-debian [19:09:50] I can. one sec [19:09:55] I think you should be able to as well [19:09:59] hmm [19:10:08] apergos: I don't have the rights to restart Parsoid directly [19:10:18] but can restart them by pushing out new code [19:10:22] well I think puppet must do so (something does) [19:10:27] I just wonder why they die [19:10:32] AzaToth: just gerrit? [19:10:47] ooohhhh [19:10:47] I can go restart them sooner but it would be nice to know the cause [19:10:49] Ryan_Lane: ! [remote rejected] wmf -> wmf-debian (prohibited by Gerrit) [19:11:00] now I see why you wanted to merge gerrit and operations/debs/gerrit [19:11:01] apergos: one possibility is that the thermal issues in dmesg are actually relevant [19:11:06] that makes a lot more sense :D [19:11:07] Ryan_Lane: yup ツ [19:11:23] well, shit [19:11:23] apergos: I personally doubt it, but am not sure [19:11:25] I thought I could [19:11:26] I can't [19:11:28] doubtful, there owuld be something else other than parsoid disappearing and everything else running fine [19:11:36] *nod* [19:11:40] seems I don't have permissions in it [19:11:43] daym [19:11:45] I think you'll need ^demon [19:11:52] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:53] oh [19:11:54] wait [19:11:55] nevermind [19:11:59] interface has changed [19:12:12] done [19:12:22] https://gerrit.wikimedia.org/r/#/admin/projects/gerrit,access [19:12:33] ensure => running, [19:12:47] so puppet will restart if it's gone (might take up to 1/2 hour or so) [19:13:30] AzaToth: added. let me know if you have issues [19:13:34] Ryan_Lane: which did you base wmf-debian on? [19:13:39] HEAD [19:13:41] oh [19:13:45] bas it on wmf [19:13:47] base [19:13:52] ah [19:13:53] sorry [19:13:55] np [19:14:04] deleting [19:14:27] done [19:16:13] hmm [19:16:28] back in a bit. lunch [19:16:31] gerrit changes ends up in #wikimedia-dev instead of here [19:16:40] New patchset: AzaToth; "adding gitreview file" [gerrit] (wmf-debian) - https://gerrit.wikimedia.org/r/68484 [19:16:41] New patchset: AzaToth; "Copying and modifying operations/debs/gerrit into gerrit" [gerrit] (wmf-debian) - https://gerrit.wikimedia.org/r/68485 [19:18:19] ^demon: ↑ [19:19:36] <^demon> AzaToth: Do you have maybe a readme or something I can follow? I'm an idiot when it comes to doing debian packaging. [19:19:45] <^demon> I'm totally wanting to try out your work though. [19:19:50] ^demon: git buildpackage [19:19:56] <^demon> Ha, ok. [19:20:21] though changelog needs to be changes, or use flags "-uc -us" to not sign it [19:20:43] or flag "-k gnupg_id" [19:20:44] ^demon, this was really helpful for me [19:20:53] http://honk.sigxcpu.org/projects/git-buildpackage/manual-html/gbp.html [19:21:32] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [19:21:49] ^demon: so "default/debian" branch is "wmf-debian" and "upstream" is "wmf" [19:22:19] ^demon: you need buck installed first though [19:22:32] which is still in review :-P [19:22:52] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [19:23:04] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [19:23:38] ^demon: https://gerrit.wikimedia.org/r/#/c/67999/ [19:24:02] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [19:24:10] ^demon: every time I work with java, I somehow hates java more and more, dunno why [19:26:32] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [19:28:22] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:28:24] <^demon> AzaToth: Is buck or git-buildpackage missing some dependencies? I had to install debhelper, javahelper and ant by hand. [19:30:20] <^demon> Ok, well got further. reports as BUILD SUCCESSFUL but I get the following: [19:30:33] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [19:31:04] <^demon> gbp:info: buck_0.0+g410fc434.orig.tar.gz does not exist, creating from 'upstream/0.0+g410fc434' [19:31:17] <^demon> fatal: Not a valid object name upstream/0.0+g410fc434 [19:32:52] ^demon: they are build dependices [19:33:12] hmm [19:34:16] ^demon: git checkout upstream [19:34:38] you need to have the upstream branch setup [19:35:28] !log dns update [19:35:37] Logged the message, Master [19:35:45] <^demon> AzaToth: I've got upstream listed under git branch. [19:35:52] <^demon> And I'm on master, with your change cherry picked on top [19:36:22] in buck right? [19:36:49] <^demon> Yep [19:37:39] ^demon: gbp:info: buck_0.0+g410fcf34.orig.tar.gz does not exist, creating from 'upstream/0.0+g410fcf34' [19:37:39] dpkg-buildpackage -rfakeroot -D -us -uc -i -I [19:37:43] wfm [19:38:06] ^demon: http://paste.debian.net/10217/ [19:39:11] <^demon> I cloned from wmf repo on this box, not github. [19:39:18] <^demon> So I've only got one remote. [19:40:17] $ git describe --tags upstream [19:40:17] 0.0+g410fcf34 [19:40:56] <^demon> fatal: no names found, cannot describe anything [19:41:04] oh [19:41:06] offcourse [19:41:25] I can't make a patchset for a tag [19:42:18] ^demon: just make a tag 0.0+g410fcf34 pointing to upstream (410fcf34) [19:43:04] PROBLEM - Host mc1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:00] !log rebooted mc1009 as a firedrill, chaos monkey ahoy! [19:44:08] Logged the message, Master [19:45:03] <^demon> AzaToth: Bah, so I've got the tag, git describe looks sane, but still same error about upstream/... [19:46:14] RECOVERY - Host mc1009 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [19:46:14] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:05] ^demon: can you paste the whole log? [19:47:14] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:17] <^demon> Not easily, I'm in a VM [19:47:23] <^demon> I'll screenshot [19:47:34] New review: Ori.livneh; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [19:47:35] I can copy paste in VM:s [19:48:04] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [19:49:55] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:04] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [19:53:04] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [19:56:44] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.027 second response time [19:57:27] are these parsoid nodes recovering on their own, or is somebody restarting parsoid? [19:58:43] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [19:59:09] ottomata: ! [19:59:14] that was fast :) [19:59:20] :) [19:59:23] heya hashar, you around? [19:59:27] ahh nope [19:59:29] so [19:59:30] this [19:59:31] 19:58:45 err: Could not parse for environment production: Syntax error at 'producer_properties_template'; expected ')' at /srv/ssd/jenkins/workspace/operations-puppet-kafka-pplint-HEAD/manifests/init.pp:45 [19:59:31] 19:58:45 err: Try 'puppet help parser validate' for usage [19:59:32] oh btw, did I tell you I have packaging for librdkafka pretty much ready? [19:59:37] oh awesome! [19:59:40] no cooOOoooOoL! [19:59:45] also ITPed in Debian? [19:59:51] ITPed? [19:59:54] intent to package [19:59:59] Oooo super awesome [20:00:00] = I'll upload it to Debian [20:00:22] I need to commit it somewhere I gues... [20:00:24] PROBLEM - NTP on mc1009 is CRITICAL: NTP CRITICAL: Offset unknown [20:00:27] yeah so um, jenkins parser thinks that commas at the end of arg lists in puppet parameters are invalid [20:00:29] not so! [20:01:07] <^demon> AzaToth: I'm spinning up a new VM in labs to do this on instead. Will be way easier for me. [20:01:07] hmmm [20:01:08] i mean [20:01:16] it works anyway…. puppet validate says no [20:01:24] and I like ending commas! [20:01:25] oh well [20:01:48] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [20:02:22] OOps i was missing a comma there [20:03:03] but ja paravoid, this was mostly already done for 0.7.2, i just had to adapt to changes for 0.8 [20:03:14] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [20:03:18] when is that going to be released? [20:03:22] whenver [20:03:23] haha [20:03:27] dunno! [20:03:28] soon? [20:03:29] never? [20:03:34] we're not going to use it until it is though [20:04:02] 0.8 is a huge step forward in functionality from 0.7.x. There are still some rough edges and likely some bugs, but we are feature complete and are beginning the process of rolling it out to various test clusters to see what happens. We wanted to make an early pre-release version available for the brave to try out even before all the documentation is up to date and production hardening is complete. Please let us know any problems you find. [20:04:09] is what https://cwiki.apache.org/KAFKA/kafka-08-quick-start.html says [20:04:12] not very encouraging [20:04:14] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:35] grrr: root 21153 0.0 0.1 54348 5276 pts/1 T Jun11 0:00 vim wmnet [20:05:02] this makes me want to track that account back to a bastion and kill all their sessions :-P [20:05:24] RECOVERY - NTP on mc1009 is OK: NTP OK: Offset 0.000244140625 secs [20:05:50] I'm updating parsoid [20:06:04] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.014 second response time [20:06:34] !log updated Parsoid to a33980cf [20:06:42] Logged the message, Master [20:06:44] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [20:08:02] New patchset: AzaToth; "Initial debian build" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [20:08:18] ^demon: there [20:08:43] I realize as long there's no version to speak about, we need to only look on the branches [20:10:10] New review: Ottomata; "My biggest waffle is weather to go with "log_dir" or "data_dir" for server.properties "log.dir". " [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [20:10:35] * drdee mmmmmm waffles, particularly stroop waffles [20:11:09] oh man I ate so many of those [20:11:31] drdee, is it true that the dutch like to put stroop waffles on top of their hot coffee or tea mug to warm up the gooey insides? [20:12:06] drdee: http://www.flickr.com/photos/72939801@N00/461001594/ [20:12:19] that aint no stroopwaffle! [20:12:20] ^demon: it works now? [20:12:42] nope it aint :) [20:12:51] drdee, is it true that it is a dutch tradition to use the first stroop waffle of every batch as a frisbee? [20:12:58] ottomata: yes we like that just as we like putting eggs in our beer [20:12:59] thought so. [20:13:58] is it true Dutch hasa sprinkles on their sandwiches? [20:14:03] New review: Faidon; "A few minor comments." [operations/puppet/kafka] (master) C: -1; - https://gerrit.wikimedia.org/r/50385 [20:14:34] AzaToth: yes that's trrue as well [20:14:47] * AzaToth pukes [20:15:03] yeah i know man [20:15:06] that is weiirrrd [20:15:07] !log reedy synchronized php-1.22wmf5/cache/ [20:15:08] !log kaldari synchronized php-1.22wmf6/extensions/Thanks/modules/ext.thanks.thank.js 'fix IE Thanks bug' [20:15:14] Logged the message, Master [20:15:22] Logged the message, Master [20:15:43] ^demon: ? [20:15:52] <^demon> AzaToth: Waiting for VM to finish building. [20:15:54] <^demon> Silly puppet [20:15:56] ok [20:16:10] I wanted to know if the new patch worked [20:16:25] i.e. did it not throw the fatal error? [20:17:21] !log olivneh synchronized php-1.22wmf7/extensions/UploadWizard/includes/specials/SpecialUploadWizard.php 'Ia09ace5dc: Fix UploadWizard's config variables' [20:17:29] Logged the message, Master [20:18:01] can someone run this on tin: ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R mw1020 [20:18:02] !log olivneh synchronized php-1.22wmf7/extensions/UploadWizard/resources/mw.LanguageUpWiz.js 'Ia09ace5dc: Fix UploadWizard's config variables (2/2)' [20:18:10] Logged the message, Master [20:18:18] i presume the host identification is not a security issue, since it's been logged here multiple times and no one seemed alarmed [20:18:29] * marktraceur looks at commons [20:19:09] Lol, they hacked something in. [20:27:36] marktraceur: doesn't Commons ever not do so? [20:28:03] Nemo_bis: There was a notification on my login page that said "UploadWizard is broken in Opera" [20:28:08] This made me sad [20:36:41] <^demon> AzaToth: Better. http://p.defau.lt/?lRD5QPjeFwjITMFTTulBZA [20:39:45] deploying EventLogging to wmf7 [20:40:12] !log spage synchronized php-1.22wmf7/extensions/EventLogging 'latest EventLogging' [20:40:19] Logged the message, Master [20:40:33] mw1020: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! [20:40:39] !log authdns update [20:40:47] Logged the message, Master [20:41:08] should I worry? [20:42:31] I hope not. [20:42:41] LeslieCarr: ^ (just because you're on RT) [20:42:53] hrm, lemme check mw1020 [20:43:09] spagewmf: is that via bast1001, fenari, etc ? [20:43:30] ssh -A tin, then I ran sync-dir [20:43:36] <^demon> AzaToth: Take that back, works fine when I got off labs /home and onto /. [20:44:11] thanks [20:46:33] ah [20:46:43] it was reinstalled but never repuppetized [20:46:45] that is why [20:46:58] thanks for pointing it out greg-g , i'll fix this now [20:47:03] !log repuppetizing mw1020 after a reinstall [20:47:10] Logged the message, Mistress of the network gear. [20:47:13] sweet, good deal. [20:47:15] <^demon> AzaToth: I think libargs4j-java still needs to be a dependency. [20:47:23] <^demon> `buck --help` complained about missing class. [20:47:58] !log spage synchronized php-1.22wmf7/extensions/GuidedTour 'latest GuidedTour' [20:48:05] Logged the message, Master [20:48:54] That sync-dir as well as mw1020 ssh fail, I had two [20:48:55] mw1171: ssh: connect to host mw1171 port 22: Connection timed out [20:48:55] mw1173: ssh: connect to host mw1173 port 22: Connection timed out [20:49:25] well, why'd you kill those? :p [20:49:29] lemme check those hosts [20:49:46] 1171 has broken hardware, known [20:50:17] thanks LeslieCarr. FWIW my Bible https://wikitech.wikimedia.org/wiki/How_to_deploy_code says "you can consider this completely normal" [20:50:59] !log mw1173 has bad memory, rt 5294 [20:51:06] Logged the message, Mistress of the network gear. [20:51:14] spagewmf: it is, though it never hurts to double check/open tickets with the bad machines [20:51:20] well i guess it can hurt, but within reason ;) [20:56:02] <^demon> AzaToth: Also, I wonder if upstream would take some of the debian/* stuff. Less for us to maintain :) [20:59:19] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.002 second response time [21:01:18] PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 3275 bytes in 0.011 second response time [21:02:18] RECOVERY - Frontend Squid HTTP on cp1004 is OK: HTTP OK: HTTP/1.0 200 OK - 1293 bytes in 0.005 second response time [21:06:19] RECOVERY - RAID on mw1020 is OK: OK: no RAID installed [21:06:20] RECOVERY - Disk space on mw1020 is OK: DISK OK [21:06:31] RECOVERY - DPKG on mw1020 is OK: All packages OK [21:08:19] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection refused [21:12:58] !log spage synchronized php-1.22wmf6/extensions/EventLogging 'latest EventLogging' [21:13:09] Logged the message, Master [21:14:21] !log spage synchronized php-1.22wmf6/extensions/GuidedTour 'latest GuidedTour' [21:14:29] New review: Demon; "Seems to be mostly good. Two things:" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [21:14:29] Logged the message, Master [21:15:20] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [21:15:32] cmjohnson1: i've opened a few hardware tickets for eqiad the last few days [21:15:33] sorry [21:15:38] i keep breaking the servers [21:16:20] yeah i see that...i am on most of them...I already knew about mw11et73...haven't created a ticket y [21:16:23] yet [21:16:43] but thx for keeping me busy ;-] [21:16:43] hehe [21:16:49] mwhahaha [21:16:49] RECOVERY - twemproxy process on mw1020 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:17:12] i'd say it's all part of my secret plan… but i can't think of how keeping you busy fits in [21:17:49] yeah..doesn't help you take over the world or anything [21:18:47] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/software] (master) - https://gerrit.wikimedia.org/r/68561 [21:19:48] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/software] (master) - https://gerrit.wikimedia.org/r/68561 [21:20:14] Change abandoned: Hashar; "(no reason)" [operations/software] (master) - https://gerrit.wikimedia.org/r/68561 [21:21:49] there is a problem with jenkins [21:21:55] it tests things too fast [21:23:01] RECOVERY - NTP on mw1020 is OK: NTP OK: Offset -0.006588339806 secs [21:26:14] lesliecarr: have time to get the new 4500 (asw-c8-eqiad) updated and part of network fabric [21:26:33] sounds like an excellent idea for this afternoon :) [21:26:49] are you ready ? [21:26:51] yes [21:27:00] to see some fun, you can login to asw-c-eqiad.mgmt and do "monitor start messages" [21:27:13] and lots of show virtual-chassis status [21:27:21] hashar: how so tests things too fast? [21:27:32] let's attach one vcp cable and then plug in asw-c8 [21:27:50] already done [21:27:55] we talked about that the other day [21:28:01] ah yes [21:28:07] is it powered on ? [21:28:09] yes [21:28:11] oh it is :) [21:28:31] chrismcmahon: yeah we have some linters that are less than a second to run :) [21:32:21] oh shit :-/ [21:32:24] the 4550's require 12.x [21:33:22] it's a 4500 not 4550 [21:33:55] oh [21:33:56] whew [21:34:08] well at least 12.2 will support nonstop upgrades [21:35:38] cmjohnson1: doh so i have to upgrade the firmware …. can you attach a cable between it's management ethernet port and a management switch [21:35:47] i'm going ot give it a temporary ip address so i can upload the software [21:35:55] already there [21:36:10] and it is connected to console [21:36:26] oh cool [21:36:27] :) [21:36:36] i'm on its console, didn't realize you had hte mgmt ethernet connection [21:36:40] yay you're prepared :) [21:37:28] it seems that Parsoid is getting errors from the API cluster [21:37:32] Request: POST http://en.wikipedia.org/w/api.php, from 10.64.0.124 via cp1004.eqiad.wmnet (squid/2 [21:37:33] .7.STABLE9) to ()
[21:37:35] Error: ERR_SOCKET_FAILURE, errno (98) Address already in use at Thu, 13 Jun 2013 21:35:34 GMT [21:39:45] New patchset: Spage; "Add Campaigns extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [21:40:24] there are quite a few of those errors in the log [21:41:15] New review: Spage; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [21:41:44] most requests seem to be fine, could be a single broken backend [21:42:40] cmjohnson1: sigh, office wifi is slow, it's taking a while to upload ... [21:42:41] :( [21:42:58] boo [21:43:16] could somebody check the API cluster? [21:43:20] next time I can do it here if you want [21:43:45] or you need to be root? so maybe not [21:44:11] oh well you can upload it into /var/tmp [21:44:15] that is an excellent idea :) [21:46:23] who would be the right person to ping re API cluster issues? [21:48:04] all errors seem to be through cp1004.eqiad.wmnet [21:50:02] gwicke, best send a quick note to ops@ as well [21:50:16] notpeter, can you help? [21:50:30] yeah, will do [21:53:14] gwicke / Eloquence - the correct thing to do is first open a ticket with all relevant information. Then since it is urgent, ping either the person on RT duty in irc (and make sure they ack) and/or email the ops list referring to your ticket [21:54:41] !log LocalisationUpdate failed: git pull of extensions failed [21:54:50] Logged the message, Master [21:57:13] LeslieCarr: https://rt.wikimedia.org/Ticket/Display.html?id=5295&results=4d20b1233431a0dcf559b9c3eaa901db [21:57:22] thank you gwicke for the ticket [21:57:31] hmm, that link does not seem to work [21:57:41] https://rt.wikimedia.org/Ticket/Display.html?id=5295 [21:59:03] LeslieCarr: subbu just mentioned that he got the same error when testing with Parsoid locally [22:00:51] i'm restarting squid on that machine as it seems to have a lot of "Jun 13 21:40:34 cp1004 squid[1961]: commBind: Cannot bind socket FD 23 to *:0: (98) Address already in use" [22:01:51] gwicke: how does it look now ? [22:01:53] ottomata, paravoid: kafka 0.8 beta has been released http://people.apache.org/~joestein/kafka-0.8.0-beta1-candidate1/RELEASE_NOTES.html (via Snaps) [22:01:57] also, i can't help the local thing [22:02:46] LeslieCarr: that was just as an indicator that it is probably not an IP block of some production IP [22:03:01] LeslieCarr: so far it looks good, no new errors [22:03:03] !log LocalisationUpdate failed: git pull of extensions failed [22:03:11] Logged the message, Master [22:04:26] LeslieCarr: I think that fixed it, thanks! [22:05:12] :> [22:05:52] LeslieCarr: you might be able to answer a wee question I have :) [22:07:35] Are these squids (https://noc.wikimedia.org/conf/highlight.php?file=squid-labs.php), dedicated for labs traffic? :) [22:08:00] !log restarted squid on cp1004 due to "Jun 13 21:40:34 cp1004 squid[1961]: commBind: Cannot bind socket FD 23 to *:0: (98) Address already in use" [22:08:11] Logged the message, Mistress of the network gear. [22:08:51] New review: Ori.livneh; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [22:09:00] addshore: those are labs machine ip's [22:09:10] addshore: so, not sure what someone is doing with those labs machines [22:09:18] !log LocalisationUpdate failed: git pull of extensions failed [22:09:27] Logged the message, Master [22:10:31] back. is there still an issue? [22:11:07] notpeter: https://rt.wikimedia.org/Ticket/Display.html?id=5295 [22:11:09] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [22:11:16] I believe it is fixed [22:11:38] ok, cool [22:14:23] New review: Ori.livneh; "> Have you tested it already by hacking the local python file someplace?" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [22:14:42] ParsoidCacheUpdateJob: 2 queued; 207 claimed (20 active, 187 abandoned) [22:14:50] https://graphite.wikimedia.org/render/?width=1486&height=641&_salt=1371161654.988&from=-24hours&target=stats.job-insert-ParsoidCacheUpdateJob.count&target=stats.job-pop-ParsoidCacheUpdateJob.count [22:14:52] gwicke: heh [22:15:08] AaronSchulz: that was a nice sprint ;) [22:16:04] New review: Spage; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [22:16:16] New patchset: Spage; "Add Campaigns extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [22:17:03] !log Fixing perms on php-1.22wmf7, not group-writable [22:17:05] Reedy: ---^^ [22:17:10] Logged the message, Mr. Obvious [22:18:58] RoanKattouw: Add a chmod command to multiversion/checkoutMediaWiki after it's cloned mediawiki [22:18:58] New review: Jforrester; "Now mergeable; we should switch this on to get some test data ahead of real data read next week." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/68430 [22:19:06] Reedy: Will do [22:20:01] To which extent... https://gerrit.wikimedia.org/r/#/c/67274/ needs merging so ExtensionMessages are group writable by default [22:28:15] New patchset: Reedy; "Make mediawiki checkout folder 775" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68579 [22:28:21] RoanKattouw: ^ [22:28:58] Nice [22:29:08] I was about to submit something myself [22:29:14] Looks good but has trailing tabs [22:29:21] hah [22:30:09] New patchset: Reedy; "Make mediawiki checkout folder 775" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68579 [22:30:26] !log catrope Started syncing Wikimedia installation... : VisualEditor update to master, and adding Campaigns in wmf7 [22:30:35] Logged the message, Master [22:31:09] PROBLEM - Puppet freshness on srv293 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:10] New patchset: Reedy; "Don't create lib directory, we don't use it" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68580 [22:31:49] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68580 [22:31:49] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68579 [22:32:09] PROBLEM - Puppet freshness on srv294 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:09] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 13 22:33:09 UTC 2013 [22:33:17] Logged the message, Master [22:33:52] That's a lie [22:35:09] PROBLEM - Puppet freshness on db1048 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:06] Reedy: no it isn't [22:36:09] PROBLEM - Puppet freshness on aluminium is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on amssq37 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on amssq31 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on amssq40 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:18] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:18] PROBLEM - Puppet freshness on amssq32 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:18] PROBLEM - Puppet freshness on amssq33 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:18] PROBLEM - Puppet freshness on amssq36 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:18] PROBLEM - Puppet freshness on amssq41 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:14] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 13 22:38:14 UTC 2013 [22:38:18] PROBLEM - Puppet freshness on amssq34 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:18] PROBLEM - Puppet freshness on amssq39 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:18] PROBLEM - Puppet freshness on amssq42 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:18] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:18] PROBLEM - Puppet freshness on amssq54 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:22] Logged the message, Master [22:39:25] !log catrope Finished syncing Wikimedia installation... : VisualEditor update to master, and adding Campaigns in wmf7 [22:39:33] Logged the message, Master [22:39:58] ^demon: buck should depend on ini4j afaik [22:40:19] <^demon> Maybe I missed a step? [22:40:23] unless some build fnucked up [22:41:09] http://paste.debian.net/10255/ [22:41:26] <^demon> Hmm [22:41:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68430 [22:42:42] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 13 22:42:42 UTC 2013 [22:42:52] Logged the message, Master [22:44:03] ^demon: here is my build log: http://paste.debian.net/10256/ [22:46:00] ^demon: what version of libini4j-java do you have? [22:46:23] as I'm using debian here, there could be missmatches [22:47:32] !log catrope synchronized wmf-config/InitialiseSettings.php 'Enable EventLogging for VE on enwiki' [22:47:41] Logged the message, Master [22:47:46] <^demon> AzaToth: 0.5.2-SNAPSHOT, how useful. [22:47:48] mw1020: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [22:47:59] same as I have [22:48:59] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:49:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 13 22:49:02 UTC 2013 [22:49:10] Logged the message, Master [22:49:28] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:51:05] RoanKattouw: I filed an RT ticket about mw1020 [22:51:22] !log catrope synchronized wmf-config/CommonSettings.php 'Enable EventLogging for VE on enwiki' [22:51:25] ori-l: Thanks, I was about to do that [22:51:30] Logged the message, Master [22:51:37] ori-l: BTW ---^^ should start receiving EE events for VE now [22:51:53] RoanKattouw: cool, i'll look [22:52:04] ori-l: Let me trigger one for you :) [22:53:34] ori-l: Opened [[North America]] in VE [22:54:38] RoanKattouw: I got a few events already, but not from [[North America]] [22:54:47] Hmm [22:54:52] What do I have to do to trigger events? [22:55:14] Oh, hah, I see [22:55:20] ResourceLoaderConfigVars [22:55:23] * RoanKattouw goes and touches startup.js [22:56:17] !log catrope synchronized php-1.22wmf6/resources/startup.js 'touch' [22:56:25] Logged the message, Master [22:57:33] anomie: What are you doing in your LD? Will you be running scap? [22:57:45] I ask because we just made a last-minute i18n change and need to scap again [22:58:18] RoanKattouw: I'm going to be updating LocalisationUpdate in wmf7 and then running l10nupdate against wmf7 to make sure it's fixed. [22:58:29] OK [22:58:38] (the bug was introduced in wmf7, so wmf6 is ok) [22:58:40] That sounds like it might do what I need [22:58:43] Oh, wait, wmf7 [22:58:50] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [22:58:54] I need an i18n change in .... well both really but primarily wmf6 [22:58:55] I can run l10nupdate against the whole thing [22:58:59] OK [22:59:06] I'll just put my change in and let you take it from there then [22:59:08] New review: Andrew Bogott; "Work in progress, do not merge!" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/68584 [23:02:25] anomie: OK I'm all done, changes have been pulled down on tin. Take it away [23:04:41] New review: Asher; "This change is missing an include of mediawiki::updatequerypages on a server in tampa. " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [23:05:04] cmjohnson1: success! [23:05:11] link up the second cable ? [23:05:13] woohoo [23:05:14] k [23:05:27] RoanKattouw_away: http://imgur.com/ESewgIQ [23:06:29] lesliecarr done [23:06:56] !log anomie synchronized php-1.22wmf7/extensions/LocalisationUpdate/LocalisationUpdate.php 'Update LocalisationUpdate to master in wmf7' [23:07:05] Logged the message, Master [23:07:27] !log anomie synchronized php-1.22wmf7/extensions/LocalisationUpdate/LocalisationUpdate.class.php 'Update LocalisationUpdate to master in wmf7' [23:07:35] Logged the message, Master [23:09:58] RoanKattouw_away: Does anything need to be synced out for your change, or just the l10nupdate run? [23:10:08] anomie: Probably just LU [23:10:11] Will check back after [23:10:16] Let me know when you're done [23:12:00] ok. Waiting on marktraceur to merge something quick before I run the l10nupdate. [23:13:17] !log mholmquist synchronized php-1.22wmf6/extensions/UploadWizard/includes/specials/SpecialUploadWizard.php 'Opera UploadWizard fix' [23:13:24] Logged the message, Master [23:13:59] One more [23:14:17] !log mholmquist synchronized php-1.22wmf6/extensions/UploadWizard/resources/mw.LanguageUpWiz.js 'Opera UploadWizard fix' [23:14:26] Logged the message, Master [23:14:59] * anomie starts l10nupdate [23:15:08] marktraceur: thanks [23:15:42] Oh, actually, it looks like it's probably working now [23:16:01] Sweet. [23:16:08] * marktraceur declares bug fixed [23:20:43] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 13 23:20:43 UTC 2013 [23:20:51] Logged the message, Master [23:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:24:21] New patchset: Ori.livneh; "Add Campaigns & CoreEvents extensions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [23:26:08] error: unable to unlink old '.gitmodules' (Permission denied) [23:26:29] Reedy: ^ [23:26:32] ori-l: Which dir? [23:26:35] I fixed that in wmf7 [23:26:39] (I hope) [23:26:41] RoanKattouw: /common/php-1.22wmf6 [23:26:45] ugh [23:27:02] Fixed [23:27:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:23] RoanKattouw: much obliged [23:28:23] New patchset: Ori.livneh; "Add Campaigns & CoreEvents extensions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [23:28:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [23:28:36] ori-l: Is anomie already done deploying? [23:28:48] (Sorry for my ignorance, I was on a video call) [23:28:57] RoanKattouw: he's done with wmf6 [23:29:11] OK [23:29:40] Meh looks like what he did didn't update the message [23:29:41] RoanKattouw: l10nupdate completed the first part (update wmf6), working on the second now (wmf7), then there's the "refresh ResourceLoader message cache" part. [23:29:48] Right [23:29:54] ori-l: Are you gonna scap for that new extension? [23:29:57] !log olivneh synchronized php-1.22wmf6/extensions/CoreEvents 'Adding CoreEvents extension to wmf6' [23:30:05] ... guess not [23:30:05] Logged the message, Master [23:30:06] RoanKattouw: wasn't going to [23:30:06] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 13 23:30:06 UTC 2013 [23:30:15] Logged the message, Master [23:30:16] I'll sync i18n after you're done then [23:30:37] !log olivneh synchronized php-1.22wmf6/extensions/Campaigns 'Adding Campaigns extension to wmf6' [23:30:45] Logged the message, Master [23:31:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [23:35:34] RoanKattouw: k; I'm waiting for the i18n update to finish on wmf7 to sync there [23:35:36] OK [23:35:36] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 13 23:35:36 UTC 2013 [23:35:44] Logged the message, Master [23:36:21] New patchset: Nemo bis; "Add Not Confusing (Max Klein)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68594 [23:36:24] RoanKattouw, ori-l : l10nupdate is done now. Also, yay it's fixed! [23:36:49] !log olivneh synchronized php-1.22wmf7/extensions/CoreEvents 'Adding CoreEvents extension to wmf7' [23:36:50] anomie|awayish: :) [23:36:57] Logged the message, Master [23:37:37] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68317 [23:38:37] LeslieCarr: can I ask you a simple feed addition to the planet? https://gerrit.wikimedia.org/r/68594 [23:39:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68594 [23:39:58] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'Enabling Campaigns & CoreEvents (1/4)' [23:40:06] Logged the message, Master [23:40:26] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling Campaigns & CoreEvents (1/4)' [23:40:34] Logged the message, Master [23:40:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:45] thanks! [23:40:54] !log olivneh synchronized wmf-config/extension-list 'Enabling Campaigns & CoreEvents (3/4)' [23:41:03] Logged the message, Master [23:41:23] !log olivneh synchronized wmf-config/CommonSettings.php 'Enabling Campaigns & CoreEvents (4/4)' [23:41:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [23:41:32] Logged the message, Master [23:42:19] RoanKattouw: done [23:45:56] So much lightning. [23:47:10] * Nemo_bis sprinkles some gasoline on Elsie's hair [23:52:12] RoanKattouw: .....I have one more config change, if there's time. [23:53:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [23:56:09] New patchset: Ori.livneh; "wmgUseCampaigns: default => true" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68600 [23:56:45] * ori-l goes for it [23:56:52] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68600 [23:58:40] !log olivneh synchronized wmf-config/InitialiseSettings.php 'I3728a4cd9' [23:58:48] Logged the message, Master