[00:21:59] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:44] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [00:22:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:05] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [00:34:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.776 seconds [01:05:02] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [01:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 244 seconds [01:42:30] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 282 seconds [01:48:40] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 650s [01:51:21] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [01:54:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:58:42] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 21 seconds [01:59:09] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 49s [02:06:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.684 seconds [02:09:21] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [02:16:51] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [02:30:30] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.576 second response time [02:38:27] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:04:24] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [03:04:53] * jeremyb stabs mw8 [03:31:06] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Mon Aug 20 03:31:04 UTC 2012 [03:31:39] did someone manually fix lvs5? [03:32:02] was not puppeting for over 24hrs [03:55:57] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:55:57] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [03:55:58] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [03:55:59] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [03:55:59] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [03:56:00] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:56:00] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:27:26] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [05:29:41] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:35:59] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [05:47:59] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [06:01:29] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:02:50] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [06:06:26] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:10:29] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:12:44] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [06:30:24] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [06:37:26] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:40:26] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [06:42:23] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:47:02] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [06:49:44] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:53:21] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [07:16:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [07:19:30] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:21:54] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [07:25:57] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:48:27] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [07:55:21] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:57] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [07:58:57] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [07:58:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [08:01:03] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:57] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:04:57] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [08:09:27] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [08:57:42] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [09:00:51] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:01:54] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:03:53] good morning [09:05:12] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [09:05:30] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:05:48] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:05:58] morning [09:06:09] sigh, or not [09:07:00] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 65507 bytes in 0.031 seconds [09:07:09] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52639 bytes in 0.029 seconds [09:07:59] I got the page for ipv6 but not ipv4 [09:08:03] good morning paravoid ;) [09:09:18] there are network spikes in bits caches, imagescalers, LVS [09:09:25] and we just lost LVS graphs for some reason [09:10:44] oh joy [09:12:19] and swift pmtpa [09:17:21] I think we can call ms-be6 dead already [09:17:43] I thought it was being worked on [09:17:57] worked on how? [09:18:41] 21:37 cmjohnson1: shutting ms-be6 down for hardware testing/replacing [09:18:44] that's from august 16 [09:19:04] it's the most recent log entry [09:19:45] there's also an open ticket for it [09:20:17] someone has the SOL open, is it you? [09:20:26] no [09:20:52] I"ll guess one of ben or chris [09:21:53] sigh, I'll leave it to them then [09:21:59] the scallars are definitely doing more work now, but I don't think it's a big deal [09:22:00] it's obviously nothing new [09:22:03] right [09:22:20] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:47] since bits seems to have settled down and we're not getting any further flapping [09:22:49] bits caches traffic out is half of what it was [09:22:56] esams that is [09:24:02] that's because ganglia can't reach cp3002 [09:24:22] ok, fixed [09:24:26] yep [09:24:54] !log restarted gmond in cp3002 [09:25:00] what made it die, I wonder [09:25:06] Logged the message, Master [09:25:47] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:26:31] 11T copied, 8 T left... bleah [09:28:11] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:28:11] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:31:38] oh fun [09:31:48] I know why LVS graphs are so fucked up [09:31:55] FUN! [09:32:02] # date [09:32:02] Mon Aug 20 09:26:31 UTC 2012 [09:32:09] they're 6 minutes off [09:32:38] how do they just happen to have a 6 minute drift? that's a lot [09:33:11] ntp's not running [09:33:27] in none of them [09:33:56] on any lvs? [09:34:08] * apergos wonders if that's intentional [09:35:36] ... [09:37:15] base, ganglia. no ntp::client [09:42:49] New patchset: Faidon; "Switch LVS servers to include standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20681 [09:43:06] I'm going to let mark review that, esp. the initcwnd part seems scary [09:43:25] yes, I would say he should check it first [09:43:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20681 [09:43:57] you could just include ntp::client separately. but either way he should give it the ok [09:46:06] I wonder what will happen if I just run ntpdate on the LVS servers :) [09:48:08] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [09:48:24] I'm fed up with mw8, I'm going to just shut it down [09:49:26] !log powering off mw8, faulty (#3425), has been flapping a lot [09:49:36] Logged the message, Master [09:50:41] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:13] so, the remaining puzzle is why imagescalers/swift have this increased traffic for almost the past hour [09:52:46] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [09:52:51] yeah, that's noteworthy all right [10:08:54] ok, it's going down [10:08:57] (by itself) [10:12:41] couldn't make out anything quick enough by looking at th elogs [10:14:10] and now it's settling as you say [13:19:22] mark: here? [13:29:58] paravoid: yes [13:30:13] heya :) [13:30:16] hi [13:30:33] so, lvs servers have their clocks off by 6-7 minutes [13:30:38] yes [13:30:39] not ntp::client is running on them [13:30:42] no [13:30:46] if you run ntpd on them, their performance more than halves [13:30:54] oh wow, really? [13:30:56] yes [13:31:00] well, it used to [13:31:04] perhaps not anymore on lucid or something [13:31:05] er [13:31:07] precise [13:31:12] haven't tested in a while [13:31:40] okay, I've submitted https://gerrit.wikimedia.org/r/20681 which includes ntp::client and more [13:31:52] that's why they don't include standard [13:31:53] manually run ntpdate in cron ? [13:32:04] yeah that would probably work [13:32:09] i think there's an rt ticket for that [13:32:14] is ntpdate safe? [13:32:21] yes [13:32:33] well [13:32:36] big jumps are never safe [13:32:37] but i mean [13:32:41] ntpdate in cron is safe [13:33:06] I meant ntpdate now, for a 6-7min jump :) [13:33:27] probably not safe [13:33:39] yeah, figured as much and didn't do it [13:33:40] i'm betting pybal will lockup or similar [13:34:11] use ntpdate with -B flag, which forces adjtime() sleewing [13:34:25] for 6-6mins skewing will take days [13:34:33] that's ok [13:34:53] but it may well be that adjtime() is what's halving the perf [13:35:04] although I think our lvs servers have a lot more headroom nowadays then they did back then [13:35:06] that's what I was going to say [13:35:15] so it's probably not really a problem [13:35:20] ntpdate -B is equivalent to ntpd, so no point [13:35:22] i just haven't tested it for lucid or precise iirc [13:35:27] either we should run ntpd or we shouldn't [13:35:36] okay, I could try it on one of them and see [13:35:52] you either hit the pps limit or you don't [13:35:53] what was the performance problem exactly? cpu load? [13:35:54] and probably you won't [13:36:04] no, just would start dropping packets earlier [13:36:20] but if you don't hit that threshold, not much of a problem [13:37:05] uh, okay, what do you suggest then? [13:37:32] i suggest, stop caring ;) [13:37:36] works well [13:37:38] ganglia graphs are all borked [13:38:06] that's why I started looking at it [13:38:11] you can do performance testing with lvs to see if the problem's still there [13:38:15] but that's quite a bit of work [13:38:27] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=LVS+loadbalancers+pmtpa&m=load_one&s=by+name&mc=2&g=load_report [13:38:30] i used a packet generator at the time [13:38:30] see that space on the right? [13:38:36] that's the time lag [13:38:38] hehe [13:39:01] i believe there were very many adjtime calls back then [13:39:11] perhaps ntp can be run in a way where it only does them once a minute or so [13:40:42] why don't we do a dist-upgrade, reboot & ntp sync on one and see if it's still an issue? [13:40:53] there's nothing to dist upgrade [13:40:55] they're precise already [13:40:58] Jeff_Green: they're precise already [13:41:02] heh [13:41:11] lvs2 has a stale kernel [13:41:23] does a regular 'upgrade' do kernel? [13:41:31] why would that fix it [13:41:36] this was like over 4 years ago [13:41:45] this minor security update is not gonna fix that issue [13:41:57] btw, another issue: because LVS include base and not standard [13:42:01] my bet is that it's been long fixed, but we should be doing security updates routinely [13:42:05] they don't get generic::tcpweaks aka initcwnd [13:42:10] should they? [13:42:16] not necessarily [13:42:29] generic::tcptweaks should be in base [13:42:33] there's no reason why it wouldn't be [13:42:34] but [13:42:41] there's no reason why it would help for lvs either [13:43:08] the lvs servers don't get in the middle of the tcp handshake, right? [13:43:16] indeed [13:43:22] right, so no effect at all [13:43:50] i need to move syslog on nfs1 to somewhere else before wednesday [13:43:58] what's wednesday? [13:44:04] moving /home to the netapp [13:44:06] on [13:44:09] oh rly? nice :) [13:44:47] or I can keep it on nfs for now, but on a separate partition then [13:44:55] but noone can login on nfs except roots [13:45:34] isn't that a good thing? :) [13:45:52] devs can't check for apache segfaults then [13:46:15] ah. I was thinking MW and fluorine [13:46:24] that's udp2lo [13:46:25] g [13:46:28] yeah yeah [13:46:39] I just didn't think of apache segfaults [13:46:49] thankfully it's not me who does this transition then [13:46:50] i'm not sure how important it is [13:46:59] how so? [13:47:09] I would have forgot about it [13:47:16] New patchset: Platonides; "Avoid having to wait 4 days when testing the WLM app." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20702 [13:47:21] one look at 'top' on nfs1 would have showed it ;) [13:47:44] the apache segfault/dev access thing I mean :) [13:48:21] speaking of apaches, we had big imagescaler/swift spike of traffic for about an hour [13:48:30] like 5 times the normal traffic [13:48:59] ok [13:49:11] and two alerts/pages with no definite cause yet [13:49:31] your input is very welcome [13:50:17] i saw an ipv6 one, hours after [13:51:19] 12:03 < paravoid> good morning [13:51:20] 12:05 <+nagios-wm> PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:51:23] 12:05 <+nagios-wm> PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:27] 12:05 <+nagios-wm> PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:29] clearly my fault [13:51:32] (for saying good morning) [13:56:21] yay! mw8's shut down! :) /me is catching up in scrollback ;) [13:56:46] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [13:56:46] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:56:47] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [13:56:48] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [13:56:48] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [13:56:49] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [13:56:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:56:55] woooo [14:04:55] fwiw, ntpdate -B ignores me [14:05:20] it prints "offset 258 sec" and does nothing [14:13:00] mark: I'm trying to separate legacy from reality, a little help if I don't bother you too much? [14:13:17] mark: /home/w/conf/squid/generated has a lot of yaseo files [14:13:29] and that's what the wiki page for Squids says too [14:13:35] am I looking at the completely wrong place? [14:17:40] yaseo is legacy [14:17:43] our old south korean cluster [14:17:55] I remember that [14:18:07] but is /home/w/conf/squid the canonical place for modifying squid configs? [14:18:12] yes [14:18:17] and is the wikitech squid page more or less accurate? [14:18:57] I saw yaseo references in both, hence by doubt [14:18:58] i believe so [14:19:09] okay, thanks [14:19:24] I'm trying to prepare for switching squids to swift [14:19:45] except the "current clusters" stuff it's pretty accurate [14:19:54] it hasn't changed much in the last 5 years or so [14:19:57] :) [14:20:14] so the generated/*yaseo* are cruft that I can safely rm [14:20:32] yes [14:21:04] thanks [14:39:05] apergos: around? [14:39:18] I'm looking at the squid->swift originals for tonight [14:39:24] I'm here [14:39:27] yes? [14:39:29] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - The [14:39:38] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - The [14:39:40] that's in 2 hours and a bit, right? [14:39:51] yeah [14:40:13] so, when I hit ms-fe.svc.pmtpa.wmnet originals I get 401 [14:40:19] while thumbs return 200 [14:40:31] has Ben told you anything about this part? [14:40:50] I thought it was just squid that we had to change [14:41:04] that's also what I thought [14:41:07] so no, I don't know [14:41:33] uh, okay [14:42:29] I have no extra info from ben about any of this [14:43:29] Could someone please run as root on fenari: chgrp wikidev /home/wikipedia/common/php-1.20wmf10/cache/l10n [14:43:30] Thanks! [14:44:14] just the dir? [14:44:27] Yeah, it's got no files in it currently [14:44:29] done [14:44:49] The permission are somewhat confusing, but I know Roan had to fix it again last deployment, so I guess wmf9s should be "right" [14:44:50] Thanks [14:45:31] yep [15:08:26] morning paravoid. [15:08:32] hi Ben! [15:08:35] I'll be in in about an hour, and IIRC our window starts in 2 [15:09:08] you're not seeing the failures when mediawiki asks for originals because it knows about the sharding and the requests from MW don't go through rewrite.py. [15:09:10] yes, you remember correctly [15:10:14] if you look at the proxy config .erb and the role/swift.pp (and maybe proxy-server.conf on a front end) you'll see what I mean about the container list to shard. [15:10:28] rewrite.py takes it literally, so since only the -thumb containers are listed it's not sharding the -public containers. [15:10:47] (The same config exists in MW but it's listing wikis instead of containers, so shards all containers for that wiki) [15:11:33] ok, I' gotta get on the road. any last bits before I head out? [15:13:06] btw, feel free to prep a puppet change, if you feel like you get it... ;) [15:31:38] apergos: I'm doing the squid changes, want to give ^^^ a shot? [15:32:09] I'm still trying to understand the templates and the config files yet [15:32:33] New patchset: Pyoungmeister; "swapping keys for myself (py)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20725 [15:32:56] grep for shart_container_list [15:33:01] that's what needs changing [15:33:12] shard* ? [15:33:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20725 [15:33:19] yes, thanks [15:33:29] role/swift.pp:119 [15:33:35] and role/swift.pp:172 [15:34:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20725 [15:49:49] !log Upgrading JUNOS on asw2-d3-sdtpa to 11.4R2.14 [15:49:58] Logged the message, Master [15:57:59] New patchset: SPQRobin; "(bug 34817) Enable WebFonts on Burmese Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20727 [15:58:16] maplebed: hi [15:59:39] hi paravoid! [16:02:56] paravoid: did you decide to stage the container shard listing change? [16:03:03] or shall I do that nw? [16:03:14] I was looking at squid, so I told apergos [16:04:05] 18:56 -!- apergos [~ariel@wiktionary/ArielGlenn] has quit [Read error: Operation timed out] [16:04:19] that was 8 minutes ago, so I guess we shouldn't wait [16:04:25] I don't see it in the open changes in gerrit, [16:04:29] so I'll say it didn't happen. [16:04:50] !log Upgrading JUNOS on asw2-a5-eqiad to 11.4R2.14 [16:04:59] Logged the message, Master [16:05:13] should we just append wikipedia-commons-local,wikipedia-de-local,... [16:05:20] is that what needs to happen? [16:05:29] no, the list must contain the full (unsharded) container names [16:05:32] or do we need to explictly list all the shards? [16:05:44] so not just -local but -local-public and -local-temp [16:06:08] oh right, -public [16:06:13] is -temp actually used? [16:06:24] yeah. [16:06:44] is -temp uploadwizard staging, etc.? [16:07:00] aaron would have a more reliable answer but I think so. [16:08:16] maplebed: are you doing it? [16:08:19] yes. [16:08:26] oh okay [16:08:33] in other news, I think I'm done with the squid changes [16:08:39] they're not deployed obviously [16:09:11] can you put a diff in /tmp/ on fenari for me to review? [16:09:20] New patchset: Bhartshorne; "adding public and temp containers to the shard list since mediawiki expects all three, not just thumbs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20729 [16:09:31] and I'd appreciate the same for ^^^ [16:09:35] (a review that is) [16:10:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20729 [16:12:20] I wonder if we should do ^^^ differently... [16:12:33] that huge list certainly does not look like DRY [16:12:47] potentially, but I think not at the moment. [16:13:16] well, actually, I suppose it wouldn't be too much work to do the -thumb, -public, and -temp in rewrite.py... [16:14:26] New review: Faidon; "Looks good, for now." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/20729 [16:14:33] maplebed: +2ed but not merged [16:14:40] k. [16:14:42] tnx. [16:14:45] maplebed: as for squid, diff -urp deployed/ generated/ [16:15:28] now also in /tmp/bensqdiff.diff [16:15:29] funny how we caught that 20' apart [16:20:33] it took me this long to figure out which shards needed to be added [16:20:41] of course you guys are long since done with that [16:20:55] apergos: https://gerrit.wikimedia.org/r/#/c/20729/1/manifests/role/swift.pp if you want to review. [16:21:02] I'm looking at it yeah [16:21:23] paravoid: did the squid config previously send everythingc to ms7 [16:21:30] or did it still go through a regex acl? [16:21:55] it send thumbs to swift, rest to ms7 [16:22:08] don't the deleted ones go in ther etoo? [16:22:16] btw, was the even scheduled for now? I thought it was for 40' from now [16:22:24] but I just got a reminder [16:22:37] ie wikipedia-commons-local-deleted etc [16:22:44] http://wikitech.wikimedia.org/view/Software_deployments is the authority. [16:23:14] apergos: they do, but deleted requets always come from mediawiki (never directly from a client) [16:23:24] so rewrite.py doesn't need to know how to shard them. [16:23:37] maplebed: that says 16:00 UTC< i.e. now. [16:23:47] do we imagine a future where we might write a client that needs it? (for testing or whatever) [16:23:57] maplebed: so, are you merging that? [16:24:03] paravoid: in a sec. [16:24:19] apergos: I don't think so, since the ability to view deleted pictures requires being logged in and having the right privs. [16:24:55] so we would never run some cleanup job that polled swift directly [16:25:09] I guess if we do we can add that change then [16:25:14] apergos: if we did, it would be an authenticated job, therefore also skipping rewrite.py. [16:25:21] I see [16:25:43] paravoid: I'm a bit concerned about allowing *all* traffic that hits upload through to swift. [16:25:56] what do you mean? [16:26:15] I think I'd rather keep the regex acls so, for example, you can send regular swift api calls through it. [16:26:31] sorry. [16:26:36] *can't* send regular swift api calls. [16:27:17] we have no such regexp [16:27:48] we have one just for thumbs [16:28:04] I'm not sure if we should repeat the whole namespace in the squid conf [16:28:05] we do - it's the same as the one that rewrite.py uses to determine whether it should handle a request. [16:28:28] in squid I mean [16:29:00] I agree. Look at rewrite.p lines 249-252 [16:29:19] and the regular swift calls are on the same URLs, are they not? [16:29:34] so you could use DELETE with a header already... [16:29:45] with a token header [16:30:36] no, you couldn't. [16:30:47] the thumb acl wouldn't pass it through. [16:31:37] we're doing originals now though [16:31:52] and a url regexp is not enough to block api calls [16:32:37] all authenticated API calls start with the auth bits as defined in rewrite.py at those lines. [16:32:53] though I hate blacklists instead of whitelists, I believe that does catch them and could reject them at squid. [16:35:26] so, [16:36:12] acl swift_auth url_regex ^http://upload\.wikimedia\.org/(auth|AUTH).* [16:36:26] http_access deny swift_auth [16:36:30] is that what you suggest? [16:36:55] (my squid experience is very limited, don't assume I know what I'm doing) [16:37:11] yes for the lowercase. for the upper case, rewrite.py doesn't anchor it at the beginning, but I don't remember why. one sec while I check that. [16:37:26] (I think it's because python's startswith doesn't do character classes, but I just want to confirm) [16:37:53] http://wikitech.wikimedia.org/view/Swift/Hackathon_Installation_Notes#testing_the_object_store [16:38:02] it's not anchored; it's got v1/ in front of it. [16:38:07] for the uppercase stuff. [16:38:35] though rewrite's allows it to have the AUTH string anywhere in the URL [16:38:54] which is wrong? :) [16:39:11] well, it's more restrictive. [16:39:19] er? [16:39:30] if I name an image AUTH_[0-9a-f]... [16:39:39] then it would fail, yes. [16:40:00] so long as it has between 32 and36 hex chars after the AUTH_ [16:40:27] I think I did it that way because it's not always v1. [16:40:49] I'd like to keep it that way for now (in both squid and rewrite) [16:40:56] acl swift_auth url_regex ^http://upload\.wikimedia\.org/(auth|v[^/]+/AUTH).* [16:41:28] how about that? [16:42:08] what do you suggest? auth|.*AUTH_[0-9a-fA-F].*? [16:42:24] do you know if we can say {32,36} in squid's acl regex? [16:42:35] I don't [16:42:42] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [16:43:09] mark: do you know if we can use character set repetition syntax in squid's acl? eg [a-c]{3} meaning (3 of any a, b, or c)? [16:43:33] don't know offhand [16:43:46] paravoid: my test to look for curretnly used URL patterns: tcpdump on ms-fe1 | grepping for AUTH_. [16:44:03] I did tail /var/log/syslog [16:45:23] which is ~600M already, not sure what will happen when we switch squids to it btw :) [16:45:47] anyway. have you found any patterns not caught by the above acl? [16:49:24] sadly I think AUTH_[0-9a-fA-F] is too liberal - there are valid images (that currently exist) that have AUTH_x (with just one alphanumeric). [16:49:40] grrr [16:49:47] the thing that makes rewrite's effective is it only matcthes 32-36 hex digits. [16:49:51] (and hyphens) [16:50:33] repeating: [16:50:34] acl swift_auth url_regex ^http://upload\.wikimedia\.org/(auth|v[^/]+/AUTH).* [16:50:39] anything wrong with that? [16:50:43] yeah, let's go with that one. [16:51:52] hmm according to some email in 2002 it's the same regex as egrep [16:52:08] extended regex [16:52:10] paravoid: is there currently one that already restricts only upload.wiykimedia.org to this logic? [16:52:21] not that I can see of [16:52:38] in which case I could send a bad Host: header and bypass that blacklist. [16:55:44] yuck [16:57:34] I don't like how we're figuring this out in the middle of our MW... [16:58:16] yes, prepping the change earlier would have been better. [16:58:23] we can postpone the window and keep going. [16:59:42] we could do urlpath_regex and block that [16:59:59] ^/auth etc. [17:00:11] yeah, that's good! that'll work. [17:00:37] hm, that may affect more than upload though, and that's bad [17:01:11] aha, I could add it conditionally [17:01:13] let's see. [17:01:14] the squid template has php conditionals that restrict it to thumbs. [17:01:16] err.. to upload. [17:01:35] yes [17:02:43] we don't have a good way to test something on one test squid, right? it's either on the production cluster or nothing [17:02:54] sure we do. [17:02:57] oh? [17:03:01] take one out of rotation (in pybal) [17:03:08] then the deploy command takes an individual host as an argument. [17:03:12] then we test usincg curl. [17:03:42] why not try {32,36} on one? [17:04:03] +1 [17:04:31] here's the list of squids: http://noc.wikimedia.org/pybal/pmtpa/upload [17:04:42] I'll take sq41 out of rotation. [17:05:32] I'm on sq51 already [17:05:39] ok, I'll take 51 out. [17:05:51] :-) [17:05:52] thanks. [17:07:41] ok, pybal conf saved; traffic should fall off soon. [17:09:14] I can't believe the squid page still says "feel free to check in your changes to RCS. " [17:09:16] geez [17:10:04] maplebed: see diff again [17:10:16] apergos: I'm afraid it's not just the page... there's an RCS/ dir there [17:10:29] there is [17:11:24] if you want a laugh you may look at the timestamps on the files in that dir [17:12:33] I saw the yaseo files, I guess that's enough :) [17:12:46] paravoid: that looks worth a shot to me. [17:13:02] have you pushed the change to swift.pp? [17:13:15] not yet. I'll do that now. [17:13:34] yeah, that's a prerequisite [17:14:05] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20729 [17:18:32] there's another thing I just saw that we need to exclude. ^/$lang/{graph,math,timeline}/ need to go to ms7. [17:18:33] we still have an acl ms4_thumbs and an acl ms5_thumbs in there, do we want those? [17:18:54] grumblegrumble [17:19:27] running puppet on ms-fe1 [17:20:09] what's lang/...? [17:20:26] (where $lang == /en, /it, /de, etc. [17:20:28] ) [17:20:52] (really ^/[^/]+/(graph|math|timeline)/.*) [17:20:57] what's that? [17:21:47] some extensions that I didn't realize are still hooked into NFS. [17:22:20] ok, puppet change deployed to ms-fe1; lemme test it [17:22:44] oh noes [17:22:44] what are they? how should I name the ACL? what comment should I put? [17:23:00] give me something :) [17:23:19] like he says, the math stuff, the timeline extension... [17:23:20] # math extension still requires NFS. send these to ms7 until we can fix that. 2012-08-20 -ben [17:23:22] :D [17:23:38] (or -paravoid if you want to take credit.) [17:23:39] ;) [17:23:40] this means [17:23:49] we *still* can't kill Solaris >_< [17:24:36] hmm, not seeing any /graph dirs [17:24:45] * AaronSchulz wonders where he remembered that from [17:24:57] test successful - I can fetch an original from ms- [17:25:02] ms-fe1 but not ms-fe2. [17:25:08] deploying puppet change to ms-fe2-4 [17:26:28] maplebed: added to squid.conf [17:26:32] apergos: for now, for now. soon.... [17:27:10] paravoid: looking. [17:27:30] maplebed: I get 200 instead of 401 for some random URLs I've been trying. [17:27:37] so, I confirm that the puppet change works. [17:27:42] \o/ [17:27:56] yay [17:28:09] do we wanna try {32,36} now? [17:28:26] should I deploy to sq51? [17:28:41] oh it hasnt gone out? ah ha [17:28:46] traffic hasn't dropped off. [17:28:51] I must have done something wrong. [17:29:15] maplebed: looks like you don't need /graph [17:29:27] mark: I made the change reflected in http://noc.wikimedia.org/pybal/pmtpa/upload but sq51 is still getting traffic. Do you know what step I'm missing? [17:29:40] yes [17:29:46] you should put enabled: False [17:29:48] not comment it out [17:29:51] ah. [17:30:10] done. [17:30:14] I've been told to never ever remove lines before leaving enabled=False for a while [17:30:20] that's not it [17:30:23] removing works as well [17:30:30] without removing, checks continue [17:30:43] http://ganglia.wikimedia.org/latest/?c=Upload%20squids%20pmtpa&h=sq51.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 clearly shows reduced traffic, but there's still a ton of requests flying by. [17:31:33] oh, wait, is tcpdump lying to me because of our network config? [17:32:00] no, that wouldn't account for the sustained 30MBps out. [17:33:09] well incoming requests claim to be about 0... so what is being sent out? [17:34:01] I see plenty of traffic between sq51 and ms-fe. [17:34:02] oh! [17:34:13] taking it out of lvs only takes it out of the frontend squid pool. [17:34:26] squids will still treat it as a peer for the backend squids. [17:34:47] right? [17:36:00] hrmph. [17:36:17] I think so too. [17:37:56] we could disable it in frontend.php and push that [17:38:08] although I dislike doing two changes in the same tree. [17:38:08] cachemgr.cgi would show us [17:39:01] paravoid: the squid config you have looks like it's ready to test by me. [17:39:04] I'm fairly sure that's the case. I have frontend.conf open [17:39:14] I also want to test the {32,36} thing [17:39:17] (re: backend squid pool) [17:39:41] why? there's no way ^/v[^/]+/AUTH.* is going to match any files. [17:39:50] I don't like encoding the same logic over and over across config and systems [17:40:05] what if we change the length of the token at some point? will we remember to change squid.conf too? [17:40:33] say, if the new swift version switches from sha1 to sha256 [17:41:24] so, should I run ./deploy sq51.wikimedia.org ? [17:41:31] or is it ./deploy sq51? [17:41:32] http://noc.wikimedia.org/cgi-bin/cachemgr.cgi [17:41:57] paravoid: yeah, ok. [17:42:13] which part? :) [17:42:43] stick to ^/v[^/]+/AUTH.* [17:42:54] okay [17:42:59] should I run ./deploy sq51? [17:43:40] we have 17' left, so please ack soon :) [17:44:51] maplebed: ahh, its "graphs/" by an old extension I wrote on wikinews ;) [17:44:59] not "graph" [17:45:01] let's deploy to sq51 and see what happens. [17:45:12] AaronSchulz: is it still in use? [17:45:21] surprisingly yes [17:45:22] graph -> graphs, fixed. [17:45:26] cool. [17:45:35] totally unneeded since AFT, but whatever [17:51:45] deployed on sq51 [17:51:58] ok, my curl test tests the frontend but not the back. [17:52:05] can I just put :3128 to test the backend? [17:52:15] yes [17:52:23] I just tried that for a random image and it seemed to work [17:52:33] access denied from both sq51 and 52 [17:52:50] ? [17:52:52] I tried from fenari. [17:55:11] so, it seems to work [17:55:17] agreed. [17:55:19] 5' left until the end of our window [17:55:22] do we deploy all? [17:55:30] looking at the headers, I also see that it doesn't say it comes from a sun server, [17:55:34] what other deploys are going on? [17:55:36] which means it actually got it from swift. [17:56:03] yeah, and there's an X-Object-Meta-Sha1base36 which is swift I think [17:56:10] robla AaronSchulz: is it ok for us to run a little bit over our window? [17:56:16] there's still a chance Swift e.g. might not be able to handle the load [17:56:17] +1 paravoid [17:56:32] rolling back is easy if swift falls over. [17:56:44] ok by me [17:56:49] AaronSchulz: would you ask robla? [17:57:03] Reedy: you ok with us rolling over our window by just a bit? [17:57:04] yes, I'm just saying that there are still risks involved, so we must have a window open [17:57:14] Yeah [17:57:19] There's little to do in this one [17:57:24] ok. [17:57:28] All the prep work was done earlier [17:57:45] * maplebed looks at tcpdump on sq51 [17:58:10] rob is ok with it [17:58:20] tnx. [17:58:37] so we should see some increase in traffic to swift from sq51 [17:59:04] that's exactly what I was thinking [17:59:16] but I don't see anything [17:59:43] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [17:59:43] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [17:59:43] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [17:59:43] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [18:00:00] there are a bunch of established conns to ms-fe [18:00:17] it's too small to be noticed, I think. [18:00:34] okay [18:00:38] well it's only back end requests so it shouldn't be much [18:00:53] so, ./deploy cache pmtpa [18:00:55] right? [18:01:35] one sec [18:02:04] I think I misread tcpdump for the math stuff. [18:02:58] it's project/lang/math/, not /lang/math. [18:03:01] damn. [18:03:07] one more change? [18:03:15] AaronSchulz: can you confirm that ^^^ [18:04:19] yeah, it's a sibling to the 0-9a-f dirs [18:04:33] AaronSchulz: that's not what I mean. [18:04:45] upload.wikimedia.org/wikipedia/en/math/d/a/9/da9ddfd0fd19xxxxx.png [18:05:12] that looks like a sibling to 'thumb' not the shard. [18:05:24] oh, but htat's ok. [18:05:28] that'll go in the public bucket? [18:05:31] yeah. [18:05:43] our regexp doesn't catch that. [18:05:43] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:05:53] I think that's ok. [18:05:55] maplebed: I was talking about how it was stored, not the url [18:06:12] paravoid: I think you're clear to delpoy to all the upload squids. [18:06:27] no I'm not. [18:06:31] /wikipedia/en/math/0/c/5/0c53a53c3f0d7b4cf0b3c58b723dc7b5.png [18:06:35] that fails on sq51 [18:06:52] really math was supposed to be in ms5 [18:07:06] but it switched to ms7 by accident when math became an extension [18:07:19] hahaha [18:07:20] nice [18:07:46] our regexp doesn't catch that and it's not in swift [18:07:51] should we just adapt the regexp? [18:08:09] and at this point sq51's cache needs clearing too I think [18:08:17] how do those get written? (math images) [18:08:20] New patchset: Ottomata; "Updating java.pp for Precise." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20741 [18:08:39] maplebed: you're lagging... [18:08:44] sorry. [18:08:57] I think the math stuff will end up in the public bucket. [18:09:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20741 [18:09:08] it doesn't work as it is [18:09:20] the above URL is a real URL that should work and it currently 404s. [18:09:21] ok, so we should adapt the regex. [18:09:26] * AaronSchulz wonders how math works at all now [18:09:35] that is what I was asking [18:09:49] ^/[^/]+/[^/]+/(graphs|math|timeline)/.* [18:09:51] sorry, I didn't paste a complete URL - the xxxxxs need to be replaced. [18:10:04] ok, I guess I see [18:10:24] url path is "//upload.wikimedia.org/wikipedia/en/math", fs path is "/mnt/upload6/wikipedia/en/math" [18:10:29] but you're right; it doesn't work. [18:10:36] okay, now it does [18:10:39] so yes +1 paravoid's regex. [18:10:50] it's already built and deployed on sq51 [18:10:54] and tested. [18:11:07] New patchset: Ottomata; "Updating java.pp for Precise." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20741 [18:11:26] my tests confirm. [18:11:36] can you confirm that graph/math/timeline are the full extent of it? [18:11:47] no; I only see math. [18:11:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20741 [18:11:50] I feel a bit uneasy discovering corner cases 10' past our window [18:11:50] I trust aaron on that part. [18:12:09] okay. [18:12:13] so, deploy? [18:12:21] +1. apergos? [18:12:42] paravoid: the math/timeline stuff has been known for months [18:12:58] not to me. [18:13:03] the ext-dist stuff... how does that get served? [18:13:11] what's ext-dist? [18:13:14] heh [18:13:52] apergos: what's ext-dist [18:14:09] extension distributor [18:14:23] does that come from upload? [18:14:24] where we checkout extensions from VCS and package them [18:14:56] sigh, I hate waiting. [18:15:01] you and me both. [18:15:11] a process on fenari writes to /mnt/upload6/ext-dist and /mnt/upload6/private/ExtensionDistributor/mw-snapshot [18:15:17] http://wikitech.wikimedia.org/view/ExtensionDistributor this. [18:15:30] it's stored over on ms7 yeah [18:15:33] and served from there [18:15:37] example URL on upload.wikimedia.org? [18:16:00] * AaronSchulz chuckles at $wgMathCheckFiles [18:16:01] those don't go through upload.wm.o so look slike we are good (but that must be solved before solaris can go away) [18:16:11] https://upload.wikimedia.org/ext-dist/AkismetKlik-master-2145603.tar.gz [18:16:13] ok, so apergos +1 deploy? [18:16:18] so last question [18:16:22] images on private wikis? [18:16:27] Reedy: that's a 404. [18:16:27] okay, at this point I think we should just abort. [18:16:30] we have tested those and know they work? [18:16:35] maplebed: you probably hit on sq51, it works here [18:16:36] maplebed: WFM from here [18:16:48] heh. [18:17:02] it seems that we're really unprepared and keep finding cases that fail, I think we should just abort for now. [18:17:06] +1 abort. [18:18:03] paravoid: you're unstaging the changes to squid.conf.php? [18:18:06] we can't use a whitelist? [18:18:15] yes I am. [18:18:53] don't we just want stuff like site/lang/[0-9a-f]/[0-9a-f]{2}/... ? [18:19:06] for testing private images we can use officewiki (where we all have accounts) I guess [18:19:07] meh [18:19:14] reverted on sq51 [18:19:39] anyone knows if I can safely clean sq51's cache? [18:19:53] I have never done that, no idea [18:19:59] I think cleaning an individual upload squid is ok. [18:20:05] (and really, it has to be ok.) [18:20:51] (this list of ever-expanding exceptions is one reason that russ had such an uphill battle getting this crap done) [18:22:12] !log cleaning sq51's cache, poisoned after swift changes staging [18:22:22] it sure takes a while, it has me worried... [18:22:22] Logged the message, Master [18:22:43] the box is gonna be hurtin for a bit. [18:22:45] so it goes. [18:25:46] ok, when that's done I'll put it back in LVS. [18:25:53] for now, time to write a deploy postmortem. [18:25:59] AaronSchulz: robla Reedy you're clear to start your delpoy. [18:26:01] sorry for running over. [18:26:09] no prob, and thanks [18:26:16] I don't think LVS has anything to do with it, that's the frontend cache [18:26:29] I've stopped the backend squid, so it shouldn't get any requests from frontends [18:27:01] PROBLEM - Backend Squid HTTP on sq51 is CRITICAL: Connection refused [18:27:11] (ignore that) [18:27:19] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:19] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:26] but not that [18:27:37] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:43] uh oh [18:27:44] ruh roh. [18:28:06] AaronSchulz: reedy robla - nevermind. something's unhappy that needs fixing before you start. [18:28:13] you said it was safe, didn't you? [18:28:13] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:19] killing sq51? [18:28:22] I'd already finished :D [18:28:29] er [18:28:40] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:41] paravoid: yes. though doesn't mean it's true. :P [18:28:44] that's all the image scalars [18:28:47] or scalers [18:28:58] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:16] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:17] that's right, doh [18:29:18] ya kinda have to be careful before you give someone like Reedy the all-clear :) [18:29:30] http://ganglia.wikimedia.org/latest/?c=Image%20scalers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:29:34] robla: I'd done all the hard work earlier ;) [18:30:06] hey dooodleeees, i need a couple of puppet things reviewed [18:30:10] one is eaaassyy [18:30:13] ottomata: fire now, sorry. [18:30:15] the other takes some reading [18:30:18] come back later? [18:30:18] oh! [18:30:20] didn't realize [18:30:23] sorry, ok thanks bye [18:30:41] I'm confused though, why would clearing one squid trigger the image scalers? the thumbs are all in swift. [18:30:54] I don't think they're related [18:31:06] we had an imagescaler spike in the morning too [18:31:08] (our morning) [18:31:12] orly? [18:31:18] wasn't that bad though [18:31:28] You can get new thumbnails, they're just sloow [18:31:35] looked different than this [18:31:40] yeah, look at the "day" of the graph above [18:31:45] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:31:50] these have a bunch of cpu wait [18:31:59] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies does show what I expected - an enormous increase in thumb requests with a spike in 200s (cuz most of them are found) [18:34:24] it's interesting that the earlier one was msotly cpu and this one is mostly iowait. [18:35:07] so, wait. we took a backend squid out of rotation [18:35:26] which means that a lot of thumb requests are not in the cache now [18:35:40] could it be that they're not in swift either so they get to be regenerated? [18:35:44] yup. [18:35:55] but that's surprising, given how long swift has been serving thumbs. [18:36:17] though for images that haven't been changed in *forever* it makes sense, I suppose. [18:36:26] since squid cache time is 80% * last_mod_date [18:36:46] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.109 second response time [18:37:22] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.558 second response time [18:37:26] okay, this is bad because sq51's cache is going down the drain as we speak [18:37:49] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [18:38:00] but recoveries are coming in... [18:38:17] I think it's ok, it's going to be the most frequently requested ones (that were gone) that get regenerated first [18:38:34] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 64719 bytes in 6.572 seconds [18:38:35] although I am watching the load on ms5 [18:38:43] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [18:38:43] the load doesn't seem to go down terribly fast [18:39:46] really need a three level hash for these projects (directories are too dang large) [18:39:46] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.167 second response time [18:40:02] apergos: on ms5? [18:40:03] getting better now [18:40:04] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [18:40:05] not 7? [18:40:13] on ms5 [18:40:25] from WM writing to it? [18:40:26] don't worry about it, soon it will cease to be a problem [18:40:36] what's also interesting is that as soon as I put sq51 back up, we'll have this again :) [18:40:38] no, from MW fetching from it (and writing to swift?) [18:41:35] I guess the reads are the issue. but it's a complete guess with no real basis [18:41:44] load < 7 [18:42:10] RECOVERY - Backend Squid HTTP on sq51 is OK: HTTP OK HTTP/1.0 200 OK - 468 bytes in 7.879 seconds [18:42:24] and there goes he load again :-D [18:42:30] *the [18:42:40] eh [18:42:53] that's not good, dd still running [18:43:25] looking at the scalers, our canary in the coal mine [18:44:00] might be outa th woods [18:44:52] paravoid: (for later) I'm surprised the clean uses dd instead of just reformatting the partition. wouldn't that be faster? [18:45:04] maplebed: have a look at swift's latency. it got unusually high [18:45:09] what do you mean by "reformatting"? [18:45:25] oh, right, it doesn't use a filesyst.m. nevermind. [18:45:39] yes [18:45:50] paravoid: the latency includes requests that go back to the scalers, so it makes sense that it should rise. [18:46:18] I stopped sq51's squid again, let's see. [18:46:31] PROBLEM - Backend Squid HTTP on sq51 is CRITICAL: Connection refused [18:46:39] is the dd still not done? [18:46:59] no [18:47:03] it's 150G, takes a while [18:47:04] bah [18:47:47] okay, I'm looking at one of the image scalers [18:47:50] load 10-11 or so [18:48:05] a lot of apache processes, but very few spontaneous "convert" processes [18:48:25] I'm not sure that the load is from the actual convert process and not swift being slow to read/write [18:49:03] the put latency didn't rise much though, only the 404 latency. [18:49:05] well we're pretty done with wait on ms5, it's doing a bit more work than before but that's all [18:50:19] load of 10 is a bit much [18:51:02] but manageable [18:51:51] (on the scalers) [18:52:19] New patchset: Cmjohnson; "updating public key for cmjohnson" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20745 [18:53:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20745 [18:54:09] paravoid: did the dd finish? load on sq51 is now half cpu instead of all io [18:54:40] it did, I'm now verifying that the squid that run in the meantime didn't write anything [18:56:06] New patchset: Ottomata; "misc/statistics.pp - installing python-yaml for gerrit-stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20746 [18:56:44] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20746 [18:58:07] bwaaaa [18:58:53] New patchset: Ottomata; "misc/statistics.pp - installing python-yaml for gerrit-stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20746 [18:59:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20746 [19:03:52] paravoid: looks like your test finished? [19:06:01] RECOVERY - Backend Squid HTTP on sq51 is OK: HTTP OK HTTP/1.0 200 OK - 460 bytes in 0.005 seconds [19:06:37] yeah [19:07:20] so, swift was not very filled up with thumbs, was it? [19:07:31] it should have been [19:07:31] judging from what one squid did to it, and the 404 rate that I see in the graph [19:07:48] hi [19:08:03] hi mark [19:08:34] ganglia shows sq51 refilling its cache nicely. [19:08:48] so what's the result now? [19:08:57] mark: aborted [19:09:16] and recovering from the mess we produced. [19:09:23] flushed swift down the drain? [19:10:20] in short: we used sq51 as staging without removing traffic from it, several upload URLs corner cases where incorrectly directed at swift and 404ed, [19:10:40] aborted the config push, flushed sq51's cache to flush those incorrect 404s, [19:11:06] then several thumbs were not in the cache anymore and apparently not in swift either, image scalers load spiked [19:11:13] kind of recovering now [19:11:14] * maplebed puts sq51's frontend back in rotation [19:11:15] New review: Dzahn; "note this will not remove the old key if had been deployed before. you would have to really ensure i..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/20745 [19:12:15] maplebed: accurate description of what happened? [19:12:20] apergos: too [19:12:20] yup. [19:12:45] w/o removing back-end traffic [19:13:20] and I'm not really convinced that these image scalers have increased load due to gs/imagemagick [19:13:24] it did not help that ms5 was struggling for a bit there too [19:13:45] paravoid: when an image scaling requets comes in, MW checks to see if it already has the thumb. [19:14:05] if they were on ms5 but not in swift, much of that load could have been just fetching from nfs and writing to swift and returning to the client. [19:15:35] if it was in sq51's cache, it'll be in esams caches as well [19:15:44] and pmtpa frontends [19:16:22] I forgot about the esams side. ugh [19:17:12] New review: Pyoungmeister; "update to an existing resource. I think this is ok." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/20745 [19:17:13] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20745 [19:17:39] mark: can we afford flushing all frontend pmtpa/esams caches? [19:18:36] no [19:18:53] thought so... [19:19:05] even one at a time? [19:19:18] what's our TTL again? [19:21:05] paravoid: for the back end, at least, 4hrs or 80% of the time since change. [19:21:06] * paravoid is glad it was a very small amount of traffic that 404ed. [19:21:23] I'm not sure what the 404 TTL is [19:22:17] hah, 5 minutes [19:22:17] I need some food before my meeting in 8m. be back in a bit. [19:22:19] haha [19:22:25] lol [19:22:33] for upload, 0 for the rest [19:23:03] okay, so we're good [19:23:15] and that means we didn't have to flush sq51 at all. [19:25:34] ok well know we know [19:28:40] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:28:40] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:35:23] okay, dinner then off [19:35:26] see you tomorrow. [19:35:30] see ya [19:35:35] oh wait before you go [19:35:39] * paravoid waits. [19:35:46] how did you dtermine that the running squid had or had not written anything? [19:35:51] dd [19:35:54] | hd [19:39:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:37] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:37] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:59] dangit [19:40:21] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:27] huuuuge spike [19:40:48] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:51] see io util for ms5 [19:40:56] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [19:41:42] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:18] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:03] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.144 second response time [19:43:39] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.399 second response time [19:43:39] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.966 second response time [19:43:57] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 64719 bytes in 3.368 seconds [19:43:57] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.694 second response time [19:43:57] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.919 second response time [19:43:58] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.432 second response time [19:53:14] ah hey [19:53:20] sorry just saw your message cmjohnson1 [19:53:21] um [19:53:34] oh yes, i think he told me about ssh-agent exec bash [19:53:44] but I just did a buncha research and figured out how to make everything work nicely [19:53:45] what's up? [20:07:29] so [20:07:38] do we have any idea how many 404s we're talking about? [20:09:23] maplebed: could you write a summary to the ops list? [20:16:12] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:21] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:57] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 64719 bytes in 0.183 seconds [20:17:42] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [20:17:51] sigh [20:17:56] not liking it [20:18:03] so [20:18:05] mark: from what I can see from our squid conf, 404 timeout is 5 minutes, so we're safe on that front [20:18:18] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [20:18:20] the extra image scaler load is just because sq51 got purged? [20:18:23] yes [20:18:32] and the majority of those images are not in swift? [20:18:33] and swift is missing thumbs apparently [20:18:40] not just a few if I see this [20:18:48] that's the hypothesis [20:18:59] I don't know how to verify this [20:19:50] Ben was telling me that MW prefers to copy the thumb from ms5->swift (if found) instead of rescaling it [20:20:26] yes [20:20:38] so it all comes down to ms5's throughput [20:20:45] i'm wondering what the current "hit rate" for thumbs in swift is [20:20:46] repopulating sq51 should be done though. I don't know why it's still spiking. [20:21:09] no [20:21:11] why would it be done? [20:21:13] I want to find out if this isn't something closer to the 1199px thing [20:21:22] and somebody's requesting a bunch of new thumbs. [20:21:29] what's the "1199px thing"? [20:21:58] I think sq51's done catching up (as much as it needs to) because it's traffic and CPU patterns have returned to where they were before we purged it. [20:22:00] it might cme down to increased pressure on ms7 as well [20:22:19] sure, it's not full, but it's also now writing tons [20:26:02] !log fixing typo in pdns-templates/wmnet and reloading nameservers [20:26:11] Logged the message, Master [20:27:47] would be nice if we had a graph for successful fetches from ms5 as well [20:41:57] i'm checking one image scaler, and I see about one thumb scale request every 2 seconds [20:54:59] so I recant what I said before about checking ms5 before regenerting. Now that MW reads come from swift, it checks swift before regenerating. [20:55:24] (which, since swift passed the request along, is obviously going to fail) [20:55:45] I noticed that [20:55:50] in a tcpdump I did [20:55:59] a lot of HEAD for thumbs that 404 that is [20:56:07] so my comment about checking ms5 and not actually recrunching the thumb was false. [21:02:52] New patchset: Dzahn; "add wikimedia theme files and logo for planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20820 [21:03:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20820 [21:03:52] when is ms5 used for anything now (except to stash a copy)? [21:04:03] I have the same question about ms7 for that matter [21:04:16] I think ms5 is only used to write copies [21:04:22] I think its only traffic comes from MW. [21:05:06] New patchset: Dzahn; "add wikimedia theme files and logo for planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20820 [21:05:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20820 [21:07:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20820 [21:07:54] what about ms7? [21:08:03] sorry, got distracted. [21:08:08] no worries [21:08:10] ms7 is getting writes from MW [21:08:20] and getting reads from everything that hits upload/ that's not thumbs. [21:08:28] or originals? [21:08:29] so mediawiki reads can cause image scale requests? [21:08:45] mark: no; MW reads bypass rewrite.py. [21:08:49] ok [21:08:50] well now it's still doing originals, true [21:09:31] ok, I know things are still fragile but I gotta go to sleep (and I am not doing anything useful here anyways) [21:12:09] ok, I'm going to try and wiki up what we need to get right to get this thing done tomorrow. [21:13:03] http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_switch_originals for those of you following along. [21:13:13] actually... [21:13:20] I'm going to start in etherpad instead. [21:13:22] better brainstorming. [21:13:43] http://etherpad.wikimedia.org/Swift-Switch-Originals [21:13:47] ok. good luck [21:14:07] I'll leave those tabs open. [21:14:09] night [21:14:15] g'night. [21:16:19] is there any problem with enabling uploads to regular user accounts on testwiki? (https://gerrit.wikimedia.org/r/#/c/20702/1) [21:27:25] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20702 [21:35:18] wow [21:35:22] I'm so glad we aborted. [21:35:26] pybaltestfile in that list [21:35:28] jesus [21:35:35] I'm not sure it's actually used. [21:35:42] of course it is [21:36:01] I saw frontend squids using it [21:36:12] eh? [21:36:22] yeah [21:36:32] cache_peer 208.80.152.51 parent 3128 0 no-query no-digest connect-timeout=5 login=PASS carp weight=5 monitorurl=http://upload.wikimedia.org/pybaltestfile.txt monitor [21:36:36] timeout=30 [21:36:47] I think the "portal" stuff is old and unused (but best to verify) [21:37:04] yeah yeah I know, I was going to bed. but I really wanted to make some progress on some code... [21:37:07] and we don't get pybal's intelligence wrt "removed too many, won't remove more" there either afaik [21:37:10] that's just sleep talking. [21:37:17] so this was probably going to kill everything instantly [21:37:21] yes, the monitor scripts use that [21:37:35] but it didn't occur to me that they used the regular url [21:37:36] heh [21:38:10] pybaltestfile.txt is used by both squid and pybal [21:38:23] and possibly even nagios [21:38:34] I see you're looking at everything in upload/ [21:38:35] good idea [21:38:37] its contents have "don't delete it" [21:38:42] that means DO ADD IT :) [21:38:45] yeah, the config push would have "removed" that [21:38:55] and pybal would be smart enough to not remove every squid out there [21:39:00] but squid frontends wouldn't be so kind [21:39:05] nope [21:39:09] sorry, on the phone. brb. [21:39:23] i think this may not happen tomorrow either [21:39:36] i think I want some more eyes and thinking on this [21:40:20] i believe we're stablish now, and i'm sleepy as well and going to bed [21:41:58] in fact I just found my bug but know I won't fix it right, til I get some shut-eye [21:42:08] so for reals this time, good night :-D [21:42:21] good night [21:42:42] we can change the pybal stuff to monitoring/pybaltest.txt, which does exist in swift. [21:43:41] well, sorta. the full URL is in lvs.pp under swift. [21:43:50] huh. [21:44:01] and don't forget varnish [21:44:03] mark: are you going to bed too or was that good night to apergos ? [21:44:07] yes i'm going [21:44:16] unless there's an outage now [21:44:20] nope. [21:44:21] but i think it's stable enough [21:44:25] but I do have something to think on for later. [21:44:36] do put stuff on the etherpad and/or email [21:44:40] we'll all be looking at it again tomorrow [21:44:45] k. [21:45:40] I'm leaving too [21:45:50] ok, cya. [21:46:03] I'll keep going on that etherpad and switch it to wiki if it feels done-ish. [21:46:05] sorry, it's late and I've been working for many hours [21:46:39] I'll have a look tomorrow on the pad/wiki [21:47:13] let me know where the canonical place will be when you're done with that [21:48:10] start at the etherpad; I'll disambiguate from there. [22:41:20] Thehelpfulone: https://en.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=485 :/ [22:41:24] csteipp: ^ [22:42:00] We traced it back to most likely being caused by a widdit.com extension [22:46:33] hoo: good to know [22:47:16] csteipp: But I couldn't reproduce it myself... they only load there adware parts in certain cases [22:47:46] I had to manipulate stuff with Fiddler to get them to load anything on wikipedia at all... [22:48:41] Hm... well the nice thing is the filter seems to be catching these [22:48:55] csteipp: Yes... but the users edit anyway [22:49:09] but I don't really want to set the filter to disallow :/ [22:50:38] Yeah, but at least there's a list of them, so cleanup should be easy... [22:50:58] Yes... but only the big wikis got own filters [22:51:02] I'll see if Tim can remember the exact plugin when he gets back [22:51:02] we really need global ones [22:51:09] that's just another use case [22:51:11] It's in gerrit :) [22:51:27] Saw it, but I'm not into it's code enough to review :/ [22:52:18] csteipp: Did you make that depend on CentralAuth? [22:53:17] No, no dependency. It works better if centralauth is used (then it will automatically do per-user-id throttling), but not necessary to use it. [22:53:51] mhm, cause I planned to add a way to make the filters check for global groups (especiall global bots) [22:53:56] * especially [22:54:19] seems like I need user rights then... might be slow a bit, though :/ [22:54:26] Oh... hmm. That would be difficult [22:55:22] Yes... I could just use User::getRights as array, but I'm not sure how that affects performance [22:55:56] So you want to have a filter with something like ("bot" in global_user_group) .... [22:56:18] Yes, or (even better, but I got performance worries): "bot" in user_rights [22:56:40] Ah, yeah, checking specific rights might have a problem. [22:57:18] It may already check global groups, for accounts that are authenticated with CentralAuth though... it will check groups on $wgUser, which should be the CentralAuthUser [22:57:18] User::getRights probably is already lazy loaded on edit, so that shouldn't be to much of a problem [22:57:56] csteipp: I don't think so [23:03:57] New patchset: Dzahn; "add template files, reduce code length, have some file permission defaults, do not specify path if equal to resource name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20837 [23:04:38] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/20837 [23:06:21] about to run scap [23:12:05] New patchset: Dzahn; "add template files, reduce code length, have some file permission defaults, do not specify path if equal to resource name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20837 [23:12:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20837 [23:23:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20837 [23:57:42] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [23:57:42] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:57:43] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [23:57:43] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [23:57:44] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [23:57:45] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [23:57:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [23:57:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours