[00:21:59] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:44] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [00:22:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:05] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [00:34:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.776 seconds [01:05:02] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [01:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 244 seconds [01:42:30] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 282 seconds [01:48:40] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 650s [01:51:21] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [01:54:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:58:42] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 21 seconds [01:59:09] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 49s [02:06:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.684 seconds [02:09:21] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [02:16:51] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [02:30:30] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.576 second response time [02:38:27] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:04:24] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [03:04:53] * jeremyb stabs mw8 [03:31:06] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Mon Aug 20 03:31:04 UTC 2012 [03:31:39] did someone manually fix lvs5? [03:32:02] was not puppeting for over 24hrs [03:55:57] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [03:55:59] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:59] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [03:56:00] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:56:00] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:27:26] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [05:29:41] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:35:59] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [05:47:59] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [06:01:29] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:02:50] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [06:06:26] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:10:29] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:12:44] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [06:30:24] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [06:37:26] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:40:26] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [06:42:23] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:47:02] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [06:49:44] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:53:21] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [07:16:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [07:19:30] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:21:54] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [07:25:57] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:48:27] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [07:55:21] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:57] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [07:58:57] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [08:01:03] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:57] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:04:57] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [08:09:27] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [08:57:42] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [09:00:51] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:01:54] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:03:53] good morning [09:05:12] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [09:05:30] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:05:48] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:05:58] morning [09:06:09] sigh, or not [09:07:00] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 65507 bytes in 0.031 seconds [09:07:09] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52639 bytes in 0.029 seconds [09:07:59] I got the page for ipv6 but not ipv4 [09:08:03] good morning paravoid ;) [09:09:18] there are network spikes in bits caches, imagescalers, LVS [09:09:25] and we just lost LVS graphs for some reason [09:10:44] oh joy [09:12:19] and swift pmtpa [09:17:21] I think we can call ms-be6 dead already [09:17:43] I thought it was being worked on [09:17:57] worked on how? [09:18:41] 21:37 cmjohnson1: shutting ms-be6 down for hardware testing/replacing [09:18:44] that's from august 16 [09:19:04] it's the most recent log entry [09:19:45] there's also an open ticket for it [09:20:17] someone has the SOL open, is it you? [09:20:26] no [09:20:52] I"ll guess one of ben or chris [09:21:53] sigh, I'll leave it to them then [09:21:59] the scallars are definitely doing more work now, but I don't think it's a big deal [09:22:00] it's obviously nothing new [09:22:03] right [09:22:20] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:47] since bits seems to have settled down and we're not getting any further flapping [09:22:49] bits caches traffic out is half of what it was [09:22:56] esams that is [09:24:02] that's because ganglia can't reach cp3002 [09:24:22] ok, fixed [09:24:26] yep [09:24:54] !log restarted gmond in cp3002 [09:25:00] what made it die, I wonder [09:25:06] Logged the message, Master [09:25:47] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:26:31] 11T copied, 8 T left... bleah [09:28:11] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:28:11] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:31:38] oh fun [09:31:48] I know why LVS graphs are so fucked up [09:31:55] FUN! [09:32:02] # date [09:32:02] Mon Aug 20 09:26:31 UTC 2012 [09:32:09] they're 6 minutes off [09:32:38] how do they just happen to have a 6 minute drift? that's a lot [09:33:11] ntp's not running [09:33:27] in none of them [09:33:56] on any lvs? [09:34:08] * apergos wonders if that's intentional [09:35:36] ... [09:37:15] base, ganglia. no ntp::client [09:42:49] New patchset: Faidon; "Switch LVS servers to include standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20681 [09:43:06] I'm going to let mark review that, esp. the initcwnd part seems scary [09:43:25] yes, I would say he should check it first [09:43:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20681 [09:43:57] you could just include ntp::client separately. but either way he should give it the ok [09:46:06] I wonder what will happen if I just run ntpdate on the LVS servers :) [09:48:08] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [09:48:24] I'm fed up with mw8, I'm going to just shut it down [09:49:26] !log powering off mw8, faulty (#3425), has been flapping a lot [09:49:36] Logged the message, Master [09:50:41] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:13] so, the remaining puzzle is why imagescalers/swift have this increased traffic for almost the past hour [09:52:46] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [09:52:51] yeah, that's noteworthy all right [10:08:54] ok, it's going down [10:08:57] (by itself) [10:12:41] couldn't make out anything quick enough by looking at th elogs [10:14:10] and now it's settling as you say [13:19:22] mark: here? [13:29:58] paravoid: yes [13:30:13] heya :) [13:30:16] hi [13:30:33] so, lvs servers have their clocks off by 6-7 minutes [13:30:38] yes [13:30:39] not ntp::client is running on them [13:30:42] no [13:30:46] if you run ntpd on them, their performance more than halves [13:30:54] oh wow, really? [13:30:56] yes [13:31:00] well, it used to [13:31:04] perhaps not anymore on lucid or something [13:31:05] er [13:31:07] precise [13:31:12] haven't tested in a while [13:31:40] okay, I've submitted https://gerrit.wikimedia.org/r/20681 which includes ntp::client and more [13:31:52] that's why they don't include standard [13:31:53] manually run ntpdate in cron ? [13:32:04] yeah that would probably work [13:32:09] i think there's an rt ticket for that [13:32:14] is ntpdate safe? [13:32:21] yes [13:32:33] well [13:32:36] big jumps are never safe [13:32:37] but i mean [13:32:41] ntpdate in cron is safe [13:33:06] I meant ntpdate now, for a 6-7min jump :) [13:33:27] probably not safe [13:33:39] yeah, figured as much and didn't do it [13:33:40] i'm betting pybal will lockup or similar [13:34:11] use ntpdate with -B flag, which forces adjtime() sleewing [13:34:25] for 6-6mins skewing will take days [13:34:33] that's ok [13:34:53] but it may well be that adjtime() is what's halving the perf [13:35:04] although I think our lvs servers have a lot more headroom nowadays then they did back then [13:35:06] that's what I was going to say [13:35:15] so it's probably not really a problem [13:35:20] ntpdate -B is equivalent to ntpd, so no point [13:35:22] i just haven't tested it for lucid or precise iirc [13:35:27] either we should run ntpd or we shouldn't [13:35:36] okay, I could try it on one of them and see [13:35:52] you either hit the pps limit or you don't [13:35:53] what was the performance problem exactly? cpu load? [13:35:54] and probably you won't [13:36:04] no, just would start dropping packets earlier [13:36:20] but if you don't hit that threshold, not much of a problem [13:37:05] uh, okay, what do you suggest then? [13:37:32] i suggest, stop caring ;) [13:37:36] works well [13:37:38] ganglia graphs are all borked [13:38:06] that's why I started looking at it [13:38:11] you can do performance testing with lvs to see if the problem's still there [13:38:15] but that's quite a bit of work [13:38:27] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=LVS+loadbalancers+pmtpa&m=load_one&s=by+name&mc=2&g=load_report [13:38:30] i used a packet generator at the time [13:38:30] see that space on the right? [13:38:36] that's the time lag [13:38:38] hehe [13:39:01] i believe there were very many adjtime calls back then [13:39:11] perhaps ntp can be run in a way where it only does them once a minute or so [13:40:42] why don't we do a dist-upgrade, reboot & ntp sync on one and see if it's still an issue? [13:40:53] there's nothing to dist upgrade [13:40:55] they're precise already [13:40:58] Jeff_Green: they're precise already [13:41:02] heh [13:41:11] lvs2 has a stale kernel [13:41:23] does a regular 'upgrade' do kernel? [13:41:31] why would that fix it [13:41:36] this was like over 4 years ago [13:41:45] this minor security update is not gonna fix that issue [13:41:57] btw, another issue: because LVS include base and not standard [13:42:01] my bet is that it's been long fixed, but we should be doing security updates routinely [13:42:05] they don't get generic::tcpweaks aka initcwnd [13:42:10] should they? [13:42:16] not necessarily [13:42:29] generic::tcptweaks should be in base [13:42:33] there's no reason why it wouldn't be [13:42:34] but [13:42:41] there's no reason why it would help for lvs either [13:43:08] the lvs servers don't get in the middle of the tcp handshake, right? [13:43:16] indeed [13:43:22] right, so no effect at all [13:43:50] i need to move syslog on nfs1 to somewhere else before wednesday [13:43:58] what's wednesday? [13:44:04] moving /home to the netapp [13:44:06] on [13:44:09] oh rly? nice :) [13:44:47] or I can keep it on nfs for now, but on a separate partition then [13:44:55] but noone can login on nfs except roots [13:45:34] isn't that a good thing? :) [13:45:52] devs can't check for apache segfaults then [13:46:15] ah. I was thinking MW and fluorine [13:46:24] that's udp2lo [13:46:25] g [13:46:28] yeah yeah [13:46:39] I just didn't think of apache segfaults [13:46:49] thankfully it's not me who does this transition then [13:46:50] i'm not sure how important it is [13:46:59] how so? [13:47:09] I would have forgot about it [13:47:16] New patchset: Platonides; "Avoid having to wait 4 days when testing the WLM app." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20702 [13:47:21] one look at 'top' on nfs1 would have showed it ;) [13:47:44] the apache segfault/dev access thing I mean :) [13:48:21] speaking of apaches, we had big imagescaler/swift spike of traffic for about an hour [13:48:30] like 5 times the normal traffic [13:48:59] ok [13:49:11] and two alerts/pages with no definite cause yet [13:49:31] your input is very welcome [13:50:17] i saw an ipv6 one, hours after [13:51:19] 12:03 < paravoid> good morning [13:51:20] 12:05 <+nagios-wm> PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:51:23] 12:05 <+nagios-wm> PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:27] 12:05 <+nagios-wm> PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:29] clearly my fault [13:51:32] (for saying good morning) [13:56:21] yay! mw8's shut down! :) /me is catching up in scrollback ;) [13:56:46] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:56:47] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:48] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [13:56:48] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [13:56:49] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [13:56:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:56:55] woooo [14:04:55] fwiw, ntpdate -B ignores me [14:05:20] it prints "offset 258 sec" and does nothing [14:13:00] mark: I'm trying to separate legacy from reality, a little help if I don't bother you too much? [14:13:17] mark: /home/w/conf/squid/generated has a lot of yaseo files [14:13:29] and that's what the wiki page for Squids says too [14:13:35] am I looking at the completely wrong place? [14:17:40] yaseo is legacy [14:17:43] our old south korean cluster [14:17:55] I remember that [14:18:07] but is /home/w/conf/squid the canonical place for modifying squid configs? [14:18:12] yes [14:18:17] and is the wikitech squid page more or less accurate? [14:18:57] I saw yaseo references in both, hence by doubt [14:18:58] i believe so [14:19:09] okay, thanks [14:19:24] I'm trying to prepare for switching squids to swift [14:19:45] except the "current clusters" stuff it's pretty accurate [14:19:54] it hasn't changed much in the last 5 years or so [14:19:57] :) [14:20:14] so the generated/*yaseo* are cruft that I can safely rm [14:20:32] yes [14:21:04] thanks [14:39:05] apergos: around? [14:39:18] I'm looking at the squid->swift originals for tonight [14:39:24] I'm here [14:39:27] yes? [14:39:29] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - The [14:39:38] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - The [14:39:40] that's in 2 hours and a bit, right? [14:39:51] yeah [14:40:13] so, when I hit ms-fe.svc.pmtpa.wmnet originals I get 401 [14:40:19] while thumbs return 200 [14:40:31] has Ben told you anything about this part? [14:40:50] I thought it was just squid that we had to change [14:41:04] that's also what I thought [14:41:07] so no, I don't know [14:41:33] uh, okay [14:42:29] I have no extra info from ben about any of this [14:43:29] Could someone please run as root on fenari: chgrp wikidev /home/wikipedia/common/php-1.20wmf10/cache/l10n [14:43:30] Thanks! [14:44:14] just the dir? [14:44:27] Yeah, it's got no files in it currently [14:44:29] done [14:44:49] The permission are somewhat confusing, but I know Roan had to fix it again last deployment, so I guess wmf9s should be "right" [14:44:50] Thanks [14:45:31] yep [15:08:26] morning paravoid. [15:08:32] hi Ben! [15:08:35] I'll be in in about an hour, and IIRC our window starts in 2 [15:09:08] you're not seeing the failures when mediawiki asks for originals because it knows about the sharding and the requests from MW don't go through rewrite.py. [15:09:10] yes, you remember correctly [15:10:14] if you look at the proxy config .erb and the role/swift.pp (and maybe proxy-server.conf on a front end) you'll see what I mean about the container list to shard. [15:10:28] rewrite.py takes it literally, so since only the -thumb containers are listed it's not sharding the -public containers. [15:10:47] (The same config exists in MW but it's listing wikis instead of containers, so shards all containers for that wiki) [15:11:33] ok, I' gotta get on the road. any last bits before I head out? [15:13:06] btw, feel free to prep a puppet change, if you feel like you get it... ;) [15:31:38] apergos: I'm doing the squid changes, want to give ^^^ a shot? [15:32:09] I'm still trying to understand the templates and the config files yet [15:32:33] New patchset: Pyoungmeister; "swapping keys for myself (py)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20725 [15:32:56] grep for shart_container_list [15:33:01] that's what needs changing [15:33:12] shard* ? [15:33:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20725 [15:33:19] yes, thanks [15:33:29] role/swift.pp:119 [15:33:35] and role/swift.pp:172 [15:34:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20725 [15:49:49] !log Upgrading JUNOS on asw2-d3-sdtpa to 11.4R2.14 [15:49:58] Logged the message, Master [15:57:59] New patchset: SPQRobin; "(bug 34817) Enable WebFonts on Burmese Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20727 [15:58:16] maplebed: hi [15:59:39] hi paravoid! [16:02:56] paravoid: did you decide to stage the container shard listing change? [16:03:03] or shall I do that nw? [16:03:14] I was looking at squid, so I told apergos [16:04:05] 18:56 -!- apergos [~ariel@wiktionary/ArielGlenn] has quit [Read error: Operation timed out] [16:04:19] that was 8 minutes ago, so I guess we shouldn't wait [16:04:25] I don't see it in the open changes in gerrit, [16:04:29] so I'll say it didn't happen. [16:04:50] !log Upgrading JUNOS on asw2-a5-eqiad to 11.4R2.14 [16:04:59] Logged the message, Master [16:05:13] should we just append wikipedia-commons-local,wikipedia-de-local,... [16:05:20] is that what needs to happen? [16:05:29] no, the list must contain the full (unsharded) container names [16:05:32] or do we need to explictly list all the shards? [16:05:44] so not just -local but -local-public and -local-temp [16:06:08] oh right, -public [16:06:13] is -temp actually used? [16:06:24] yeah. [16:06:44] is -temp uploadwizard staging, etc.? [16:07:00] aaron would have a more reliable answer but I think so. [16:08:16] maplebed: are you doing it? [16:08:19] yes. [16:08:26] oh okay [16:08:33] in other news, I think I'm done with the squid changes [16:08:39] they're not deployed obviously [16:09:11] can you put a diff in /tmp/ on fenari for me to review? [16:09:20] New patchset: Bhartshorne; "adding public and temp containers to the shard list since mediawiki expects all three, not just thumbs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20729 [16:09:31] and I'd appreciate the same for ^^^ [16:09:35] (a review that is) [16:10:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20729 [16:12:20] I wonder if we should do ^^^ differently... [16:12:33] that huge list certainly does not look like DRY [16:12:47] potentially, but I think not at the moment. [16:13:16] well, actually, I suppose it wouldn't be too much work to do the -thumb, -public, and -temp in rewrite.py... [16:14:26] New review: Faidon; "Looks good, for now." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/20729 [16:14:33] maplebed: +2ed but not merged [16:14:40] k. [16:14:42] tnx. [16:14:45] maplebed: as for squid, diff -urp deployed/ generated/ [16:15:28] now also in /tmp/bensqdiff.diff [16:15:29] funny how we caught that 20' apart [16:20:33] it took me this long to figure out which shards needed to be added [16:20:41] of course you guys are long since done with that [16:20:55] apergos: https://gerrit.wikimedia.org/r/#/c/20729/1/manifests/role/swift.pp if you want to review. [16:21:02] I'm looking at it yeah [16:21:23] paravoid: did the squid config previously send everythingc to ms7 [16:21:30] or did it still go through a regex acl? [16:21:55] it send thumbs to swift, rest to ms7 [16:22:08] don't the deleted ones go in ther etoo? [16:22:16] btw, was the even scheduled for now? I thought it was for 40' from now [16:22:24] but I just got a reminder [16:22:37] ie wikipedia-commons-local-deleted etc [16:22:44] http://wikitech.wikimedia.org/view/Software_deployments is the authority. [16:23:14] apergos: they do, but deleted requets always come from mediawiki (never directly from a client) [16:23:24]