[00:00:11] the other one would be really low but not 0 [00:01:22] binasher: heh, 1e9 works [00:01:53] a very high value for server_failure_limit would be good, but probably just 1 for retry_timeout [00:02:15] that's what I was thinking, want to try that? [00:03:04] there's a dead timeout option too [00:03:53] paravoid: are you looking at libmemcached? [00:04:22] I did [00:04:31] I don't know what else to look [00:04:44] look into that is [00:05:33] what do you need? [00:05:43] the pecl extension doesn't have constants for everything, and is a bit out of sync with libmemcached [00:05:46] * AzaToth throws a http://paste.debian.net/ towards paravoid [00:06:08] probably more out of sync now [00:07:00] (on an unrelated note, I'm glad you didn't use this: https://github.com/wuakitv/puppet-twemproxy/blob/master/manifests/install.pp ) [00:07:13] heh [00:07:20] New patchset: Aaron Schulz; "Tweaked memcached options as 0 is not valid." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68319 [00:07:26] https://graphite.wikimedia.org/render/?width=801&height=394&_salt=1371080615.671&target=Setup.php-memcached.tavg [00:08:09] so 12.5 vs. 3.75 vs. 0.75 or something [00:08:15] not bad :) [00:08:25] s/tavg/tp90 [00:08:41] good point [00:08:53] same pattern :) [00:08:59] that's pretty cool [00:11:22] we could abandon twemproxy, the libmemcached fix is significant enough [00:11:33] lol [00:11:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68319 [00:11:46] you had other improvements in mind too, didn't you? [00:11:54] with twemproxy I mean [00:12:07] eh, just the site not going down [00:13:06] that's always a nice goal [00:13:14] what specifically though? :) [00:13:28] !log asher synchronized wmf-config/mc-eqiad.php 'large server_failure_limit, small retry_timeout for twemproxy test hosts' [00:13:37] Logged the message, Master [00:14:23] * Aaron|home hmms at https://graphite.wikimedia.org/render/?width=840&height=398&_salt=1371082426.763&target=MemcachedPeclBagOStuff.getMulti.tp90 [00:14:42] paravoid: better timeout handling of downed hosts and rehashing keys of ejected hosts [00:15:09] sounds nice [00:15:19] if we end up using it, I'll create a proper package and upload it to Debian [00:15:22] :-) [00:15:32] I assume there would be less packets in the air too, less to drop heh [00:16:16] hrm, getmulti times [00:16:59] same as get btw [00:17:35] it may just be the deferred connections showing up in the timing [00:17:52] that would make sense [00:17:54] presumably it no longer dumbly connections to everything anymore in startup [00:19:08] New patchset: Asher; "try twemproxy on all eqiad hosts" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68322 [00:20:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68322 [00:21:19] !log asher synchronized wmf-config/mc-eqiad.php 'trying twemproxy on all eqiad hosts' [00:21:29] Logged the message, Master [00:23:11] Aaron|home: still some occurrences of SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [00:23:35] not many though, odd [00:23:35] i'm pretty sure it didn't fail 1e9 consecutive times first [00:23:41] !log ceph readding osd.81, osd.125, osd.137, osd.141 after disk replacement (#5202, #5228, #5248, #5263) [00:23:49] Logged the message, Master [00:23:55] 2013-06-13 00:23:50.293977 mon.0 [INF] pgmap v8556272: 16760 pgs: 15476 active+clean, 1237 active+remapped+wait_backfill, 40 active+remapped+backfilling, 7 active+clean+scrubbing+deep; 44673 GB data, 138 TB used, 123 TB / 261 TB avail; 21536152/854606636 degraded (2.520%); recovering 701 o/s, 110MB/s [00:24:06] 144MB/s etc., not bad [00:24:15] binasher: right, I was wondering if there is some other criteria [00:24:30] Aaron|home: https://bugs.php.net/bug.php?id=60049 [00:24:34] not great either, but certainly much better than swift :) [00:25:02] binasher: you should merge https://gerrit.wikimedia.org/r/#/c/67316/ btw ;) [00:26:26] you don't want to reply to nikerabbit / double hash every key or add a str length check to wfMemcKey?! [00:27:51] New patchset: Asher; "disable persistent conns to twemproxy (test)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68324 [00:28:13] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68324 [00:28:19] New patchset: Dzahn; "remove Apache_site[no_default] line, this is now in webserver.pp and creates a duplicate definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68325 [00:28:59] !log asher synchronized wmf-config/mc-eqiad.php 'disable persistent conns to twmprxy' [00:29:06] Logged the message, Master [00:30:18] binasher: I see libmemcached does unix socket as well [00:30:33] maybe it could connect via unix socket to twemproxy? [00:30:35] twemproxy supports that as well [00:30:41] that could be better [00:30:55] well, we're down to < 1ms, so it doesn't get a lot better I guess [00:31:01] New review: Dzahn; "fix duplicate def." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/68325 [00:31:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68325 [00:31:14] but for now i'm going to head home.. think i should revert twemproxy for now? [00:31:38] what was what TimStarling said during one of the eqiad meetings? [00:32:05] i wonder if ruwiki represents most of the apache traffic [00:32:10] oh? [00:33:14] nope, just most of the SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY requests.. odd [00:33:46] non-latin1 key names? [00:33:50] Ryan_Lane: @wikistats-01 - Finished catalog run in 54.25 seconds (side error: E: Couldn't find package libweb-scraper-perl, but doesn't break runs) [00:34:10] awesome. thanks [00:34:43] paravoid: could be.. there aren't any in the actual "SERVER HAS FAILED" messages, but a prior non-latin1 key could be triggering something [00:35:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:39] Aaron|home: opinion on whether or not i should revert before heading home? [00:36:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [00:36:27] * Aaron|home checks the logs [00:37:14] Ryan_Lane: oh, wow, somebody merged stuff in that but meant the other wikistats i believe :p [00:37:53] https://github.com/wikimedia/analytics-wikistats/tree/master/pageviews_reports/t [00:38:03] heh, that's so unrelated in this file:) [00:38:10] binasher: it can be left if people are around that know how to revert [00:38:20] I'm not planning to stay around [00:40:20] * paravoid is annoyed by the twemproxy vs. nutcracker duality [00:40:20] I mean the 1 second retry makes it unlikely to explode, and it is easy to revert [00:40:21] hmm, to email ops@ or just revert… [00:40:25] i like http://bit.ly/18z9F7m [00:40:26] gah just pick oneee [00:41:05] i'm just glad we aren't using twemcached, saying that would bug me [00:42:46] okay, packaging seems trivial [00:42:57] a single embedded library that is unmodified, so easily worked around [00:43:11] good copyright/license status, well documented [00:44:55] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [00:47:48] New review: Dzahn; "Hey all, there must have been some confusion here about wikistats again. There are different project..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60965 [00:49:48] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [00:50:22] hey mutante, is there another way this can be resolved? [00:50:40] this will break unit-testing for wikistats [00:51:14] drdee: but it was never actually applied to any analytics machine that way [00:51:24] ok, mailed ops about how to revert, now taking off [00:51:25] drdee: all this did was apply it on an unrelated labs instance [00:51:50] mmmmmmmm, 1 sec [00:51:55] drdee: and..even there it cant find the packages [00:52:15] but.. i was about create one more patch set anyways, hold on [00:52:39] binasher: would be nice if graphite .count graphs hits/min or something [00:53:06] mutante: okay ignore me :) [00:59:21] New patchset: Dzahn; "Revert "wikistats packages needed for Jenkins environment"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [01:00:34] drdee: https://gerrit.wikimedia.org/r/#/c/68327/3 reverts it but also adds the same packages that were meant to be installed and puts it back in contint.pp just like before https://gerrit.wikimedia.org/r/#/c/68327/2/modules/contint/manifests/packages.pp [01:00:41] noone writes manpages anymore [01:00:57] just READMEs in markdown [01:01:48] thanks mutante! [01:01:52] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.001399517059 secs [01:03:02] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.006385445595 secs [01:03:06] New review: Dzahn; "confirmed all 3 packages were already installed on gallium, now puppet will also ensure this" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/68327 [01:03:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [01:07:20] drdee: np, all done and puppet is fine on both sides, my labs instance and gallium (jenkins) as well, it had the packages before, now puppet just ensures it [02:08:12] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 13 02:08:12 UTC 2013 [02:08:21] Logged the message, Master [02:10:32] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:14:58] !log LocalisationUpdate completed (1.22wmf5) at Thu Jun 13 02:14:58 UTC 2013 [02:15:10] Logged the message, Master [02:28:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 13 02:28:27 UTC 2013 [02:28:35] Logged the message, Master [04:53:36] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [04:53:36] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:12:08] PROBLEM - RAID on analytics1019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:18:01] New review: Mxn; "I haven?t gotten around to porting the Vietnamese IME script over to ULS yet, so there?s little effe..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/68146 [06:34:24] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:34:24] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:34:35] RECOVERY - Disk space on ms-be1 is OK: DISK OK [06:34:37] RECOVERY - swift-container-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:34:37] RECOVERY - DPKG on ms-be1 is OK: All packages OK [06:34:43] RECOVERY - swift-container-updater on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:34:44] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:34:53] RECOVERY - swift-object-auditor on ms-be1 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:34:53] RECOVERY - RAID on ms-be1 is OK: OK: State is Optimal, checked 1 logical device(s) [06:34:53] RECOVERY - swift-object-updater on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:35:03] RECOVERY - swift-account-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:35:03] RECOVERY - swift-object-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:35:03] RECOVERY - swift-object-server on ms-be1 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:37:10] apergos: that you? [06:37:17] yes [06:37:22] ah, cool [06:37:28] log? :) [06:37:34] RECOVERY - swift-account-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:37:42] RECOVERY - swift-account-reaper on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:37:48] nothing to log yet [06:37:55] what's not cool is ms-be9 [06:38:15] its disk usage is far below all the other hosts and I can't figure out why that is [06:38:42] when I check the obj replicators everywhere they all look fine [06:39:03] and yet its partitions are using 800gb or less while everywhere else its 1.1 or 1.2T, I forget [06:41:42] is it balanced? [06:41:49] the rings that is [06:42:29] they were the last time they went around. ms-be9 has a bunch of incoming network traffic (much more than the rest of the boxes) and I suppoose that's why [06:42:36] but how it got that wya, that's what's bugging me [06:42:40] ms-be9 is @ 66% [06:43:23] ah :-D I should have looked... but then why is it getting so much more incoming network traffic? that's actually what caught my eye [06:43:31] 66% of 1.1-1.2T is 800gb, so about right [06:43:46] it should be looking like all the rest, and while there is one disk that's a problem (and an open ticket) that can't account for it [06:43:58] looking [06:44:07] thanks, hopefully your eyes are shaarper [06:44:49] another set of eyes always helps [06:44:59] doesn't need to be sharper :) [06:45:12] also true :-) [06:45:32] (and they're not sharper, I pulled an allnighter again :) [06:45:40] woops [06:45:48] you gotta pull an all-dayer [06:45:52] heh [06:45:52] it's better for your sleep schedule [06:50:30] nothing that stands out [06:51:25] ok well I'm going to do what I was doing and ignore it for now but keep an eye out [06:51:27] thanks for looking [06:51:32] RECOVERY - NTP on ms-be1 is OK: NTP OK: Offset -0.0143879652 secs [06:52:00] back in a few (mus do dishes), hope you have gone sleep by then :-D [06:52:02] *must [06:52:06] *gone to [06:52:15] maybe I need sleep instead! [07:32:22] !log the last of the c2100s in the swift pool has been replaced (and new rings pushed), finally. now we "only" have the H310 controllers to replace... [07:32:31] Logged the message, Master [07:42:53] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:53] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:54] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:55] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [07:42:55] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [07:42:56] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:56] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:57] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:45:27] reminder for me (to pass this on to cmjohnson later): https://wikitech.wikimedia.org/wiki/Swift/Deploy_Plan_-_R720xds_in_tampa [08:02:40] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.009769558907 secs [08:05:30] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [08:09:04] good morning [08:12:22] 'lo [08:13:44] morning [08:15:26] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [08:18:45] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [08:31:45] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.01350402832 secs [08:32:46] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.01323068142 secs [08:34:27] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:55] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [08:42:57] hashar: any idea what is up with jenkins here? https://gerrit.wikimedia.org/r/#/c/68117/ [08:43:14] should VE simply increase its qunit timeout setting? [08:54:09] ori-l: possibly [08:54:23] ori-l: honestly I am not sure how qunit works :( [08:55:08] note that the job has always failed apparently. [08:55:35] I am not sure why it triggers two run as well [08:55:49] ori-l: are the qunit tests passing on your local install ? [08:56:55] hashar: i can't get them to run [09:01:34] New review: Akosiaris; "LGTM" [operations/debs/kafka] (debian) C: 2; - https://gerrit.wikimedia.org/r/68026 [09:06:43] New review: Nemo bis; "Thanks, this fixed LocalisationUpdate on Wikimedia projects for core. Extensions are still broken." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68110 [09:28:11] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [09:30:11] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [09:36:10] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [09:43:50] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [09:44:32] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [09:45:31] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [09:57:10] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:00:30] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [10:01:30] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [10:03:09] New patchset: Odder; "(bug 49358) Remove MoodBar from it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68352 [10:09:48] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:15:58] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [10:21:08] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [10:21:08] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [10:21:19] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [10:23:08] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [10:23:58] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [10:37:09] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [10:39:16] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.012 second response time [10:41:06] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [10:44:23] New review: Mark Bergsma; "Comments inline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [11:28:52] New review: Akosiaris; "Comments mostly inline. Biggest hurdle i see is the JDK7 requirement. " [operations/debs/buck] (master) C: -1; - https://gerrit.wikimedia.org/r/67999 [11:47:06] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [11:48:26] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [11:48:47] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [11:50:55] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [11:51:06] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [11:53:06] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [11:53:18] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [11:54:05] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [12:02:06] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [12:10:36] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [12:11:04] New patchset: Mark Bergsma; "Add FIXMEs to be dealt with when configuring the new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68368 [12:11:07] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [12:12:08] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [12:14:20] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [12:15:53] haha [12:15:59] due to a puppet bug, arrays are mutable [12:18:06] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Thu Jun 13 12:18:04 UTC 2013 [12:30:33] New patchset: Mark Bergsma; "Add warning note" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68373 [12:30:33] New patchset: Mark Bergsma; "Move (commented out) packages version class instance to ancestor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68374 [12:30:34] New patchset: Mark Bergsma; "Factor out Varnish logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:31:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68368 [12:31:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68373 [12:32:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68374 [12:34:29] New patchset: Mark Bergsma; "Factor out Varnish logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:36:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68375 [12:44:46] New patchset: Mark Bergsma; "Factor out addition of localhost IPs to ancestor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68379 [12:45:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68379 [12:57:54] New patchset: Mark Bergsma; "Factor out $varnish_directors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68382 [12:57:54] New patchset: Mark Bergsma; "Mummy says dashes in puppet names are bad!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [12:58:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [13:04:18] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [13:04:35] New review: Peachey88; "> Mummy says dashes in puppet names are bad!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [13:04:38] New patchset: Mark Bergsma; "Add another level of class hierarchy, descend bits from a 1layer class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:05:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68382 [13:06:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68383 [13:07:20] New patchset: Mark Bergsma; "Add another level of class hierarchy, descend bits from a 1layer class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:08:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68384 [13:20:16] hey guys. sdb in neon is failing => sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed. mdadm has not yet detected it.... How do we handle this? ticket in rt ? [13:26:13] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [13:29:42] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:29:52] akosiaris: yeah, a ticket in the RT queue for the respective datacenter [13:29:54] which is eqiad in this case [13:30:01] mark: ok thanx [13:31:12] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:31:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68385 [13:32:42] New review: Diederik; "Ok" [operations/debs/kafka] (debian); V: 2 - https://gerrit.wikimedia.org/r/68026 [13:32:43] Change merged: Diederik; [operations/debs/kafka] (debian) - https://gerrit.wikimedia.org/r/68026 [13:33:37] New patchset: Mark Bergsma; "Typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68387 [13:34:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68387 [13:40:23] New patchset: Mark Bergsma; "Fix tier 2 bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68388 [13:41:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68388 [13:46:28] New patchset: Mark Bergsma; "Provide a plain array of backend values" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68389 [13:47:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68389 [13:51:01] New patchset: Mark Bergsma; "Flatten harder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68390 [13:52:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68390 [13:56:08] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [13:57:19] New review: Andrew Bogott; "the latest output is here: https://dpaste.de/Do8x5/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [13:59:29] mark: ^ [13:59:54] ok [13:59:55] New patchset: Mark Bergsma; "Simply bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:00:23] doh I could have put the comments in the manifest instead hehe [14:00:43] *shrug* it was easy to follow as you did it [14:02:12] so mchenry needs to be added to relay_from_hosts [14:02:21] I see there's a hostlist for that but it's not included into relay_from_hosts [14:02:37] you can ditch it and include it directly instead if that's easier [14:02:53] hmm [14:03:03] perhaps it would be better to define a "mail relays" variable in some puppet manifest [14:03:05] and use that [14:03:12] instead of hardcoding mchenry [14:03:30] hmm wait [14:03:34] no this isn't needed [14:03:41] sorry [14:06:09] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:07:41] andrewbogott: I see in the original RT config that hostlist didn't really do anything [14:07:42] just ditch it [14:07:52] i'll run a diff now actually [14:08:58] ok, ditched [14:12:05] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:15:51] New review: Mark Bergsma; "Please change the acl_check_connect ACL from the original config to the following, in the template. ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:16:16] hehe [14:16:20] there's also no smarthost router yet [14:16:28] this template is clearly not very well tested yet ;) [14:18:47] New review: Mark Bergsma; "Add one last "router" which sends any remaining mail to the outbound mail relays (mchenry/sodium):" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:20:05] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68011 [14:20:20] andrewbogott: found a few more problems, added comments for them [14:20:24] but it should be nearly there [14:20:32] thanks [14:24:26] New patchset: Mark Bergsma; "Simply bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:24:58] the puppet parser is such a piece of shit [14:27:18] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:29:09] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:30:18] New patchset: Mark Bergsma; "Simplify bits VCL configuration options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:31:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68392 [14:34:36] enable_external_mail implies "This server does not accept external mail"? [14:43:15] mark, do you see that the smart router block you suggested is already present in the imap_delivery section of the template? Should I just duplicate it, or does that need to be reconciled somehow? [14:43:42] reconcile it [14:43:52] get it out of imap_delivery, make it a separate smarthosts parameter [14:44:38] And… leave it to future imap definitions to fill the array? [14:44:48] I don't see any existing uses of imap delivery [14:53:22] New review: Nemo bis; "Seems good. I noticed their lack some days ago. :)" [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/67203 [14:56:31] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [14:57:21] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [14:57:30] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.010 second response time [14:57:57] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [14:58:09] andrewbogott: imap is now done by sanger, also not puppetized yet [15:00:39] mark: Yeah, the change clearly doesn't break any existing systems, but it does change the behavior of enable_imap_delivery for future users [15:00:58] that's fine [15:01:08] New review: Andrew Bogott; "the latest: https://dpaste.de/fnqcH/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:01:11] that template has only been put to one use before [15:01:16] which is the mailman server [15:01:27] but it's been compiled as an aggregate config of mchenry, sanger, and the previous lists server [15:01:31] because a lot of stuff is shared [15:01:38] so we'll figure out the bugs for the remaining uses as we go ;) [15:02:11] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:04:15] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:07:12] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:17:53] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:23:33] New review: Mark Bergsma; "Looks pretty good. Only a few minor comments left." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:27:52] New patchset: Mark Bergsma; "Setup the new Parsoid caches as a single layer cluster instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [15:27:55] i keep using dashes [15:38:14] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [15:40:11] mark, I have a question re 68404 [15:40:24] oh? [15:40:40] to me it looks like you remove the front/backend separation [15:40:46] that's correct [15:40:49] doesn't that also remove the cache splitting? [15:40:52] yes [15:40:54] which is the point [15:40:55] hmm [15:41:03] I think cache splitting is very bad if you have just two boxes :) [15:41:07] a single box dies and you lose half your cache [15:41:14] so if one of the caches goes down, 100% is gone rather than 50% [15:41:15] and other than that it's just needless complexity [15:41:25] now both caches should get the same content [15:41:40] that may not work the way you currently refresh stuff [15:41:46] thanks for all your help, mark [15:41:49] but that's why I'm changing it now ;) [15:41:58] mark: how will one box get the copy of the other's cache? [15:42:02] it won't [15:42:07] we don't want to render pages twice [15:42:21] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [15:42:51] so far I thought that the 50/50 split was a good compromise between resilience and performance [15:44:01] if you want hashing anyway, wouldn't that be better implemented inside mediawiki? [15:44:21] mediawiki is not involved in this really, except as a client [15:44:29] yes, as a client [15:44:42] we also perform requests to the caches from Parsoid [15:44:44] !log reedy synchronized php-1.22wmf7 'Initial file sync' [15:44:53] that's what I mean [15:44:54] Logged the message, Master [15:45:01] ok, also in parsoid then ;) [15:45:20] (to me, "mediawiki" is "all that wiki platform stuff I stay away from" ;) [15:45:30] then we'd have to do failover etc both in MW and Parsoid [15:45:53] when that is currently nicely abstracted [15:46:00] !log reedy synchronized docroot [15:46:10] Logged the message, Master [15:46:17] "nicely" [15:46:29] i think this whole setup is fugly tbh, as varnish is used for storage [15:46:34] instead of for caching [15:46:42] mark: is your goal to have 100% of the cache in case one of the servers goes down? [15:46:42] i know it's temporary [15:46:58] no, my goal is to not have double the amount of requests to parsoid or mediawiki [15:47:09]