[00:03:14] New patchset: Bhartshorne; "taking stat1 out of a pseudo-analytics cluster. reverts changes 2377 and 2378" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2379 [00:03:40] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2379 [00:03:41] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2379 [00:03:51] maplebed: well we can make make stat1 the background server [00:03:58] and have a different frontend [00:04:00] sever [00:04:04] server [00:05:53] having databases not publicly facing makes me happier. [00:06:09] but also means it'll take a bit longer before you get mysql there because we'll need to move it to a private IP first. [00:07:14] ok [00:07:55] will doing that break anything that's being used on stat1 at the moment? [00:08:12] or does every service or person using stat1 have access to our internal network? [00:08:20] i don't think anybody is using stat1 except for me [00:09:14] ottomata: you've logged into the host recently. You ok with moving it to an internal (10.0/8) IP? [00:09:52] ja that's fine [00:10:04] i don't know if i have internal access, but i'll get it if I need it [00:10:05] maplebed:also check with erik z [00:10:20] he hasn't logged in (according to last) [00:10:36] can we just bind mysql to the internal interface? [00:10:43] and it doesn't have any services other than ssh... [00:10:46] then it will only be accessible internally [00:10:46] there is no internal interface. [00:10:48] oh [00:10:50] hmmm [00:10:50] :D [00:10:57] maybe just the socket? [00:11:04] turn off mysql networking? [00:11:23] errr... hmm. [00:11:30] either way is fine [00:11:33] i don't know if we need it right now [00:11:36] we probably would eventually [00:12:01] I'm not sure we have any db hosts that don't listen to the network. [00:12:32] not that we couldn't start, but I think I'd prefer to move it internal instead. [00:13:03] * maplebed goes to send mail. [00:24:54] PROBLEM - MySQL Idle Transactions on stat1 is CRITICAL: Connection refused by host [00:28:24] PROBLEM - mysqld processes on stat1 is CRITICAL: Connection refused by host [00:28:24] PROBLEM - MySQL Replication Heartbeat on stat1 is CRITICAL: Connection refused by host [00:30:00] maplebed: I'm pretty done for the day, but do you need something quick? [00:30:14] PROBLEM - MySQL Slave Delay on stat1 is CRITICAL: Connection refused by host [00:31:34] PROBLEM - MySQL Recent Restart on stat1 is CRITICAL: Connection refused by host [00:34:14] PROBLEM - Full LVS Snapshot on stat1 is CRITICAL: Connection refused by host [00:34:15] notpeter: nah, I got it covered. [00:34:24] PROBLEM - MySQL disk space on stat1 is CRITICAL: Connection refused by host [00:34:26] maplebed: okie dokie [00:35:05] gah [00:35:24] the core db classes aren't really intended for installing mysql on a one off dev box [00:35:34] PROBLEM - MySQL Slave Running on stat1 is CRITICAL: Connection refused by host [00:35:48] maybe make a misc mysql class for stat1 and anything like it [00:37:15] binasher: it's intended to store data to be used (at least initially) for the reportcard, so it's not exactly a dev box. [00:37:30] but it seems initially clear that it's going to lose its public IP [00:37:33] which makes things easier. [00:37:51] ok, so it will be replicated? (to stat2?) [00:38:05] it'd be nice if I could avoid making a db:misc class, but you're right, that might be easier. [00:38:13] well, if the data's important, I certainly think it should be replicated... [00:38:14] ;) [00:38:36] ottomata or drdee - do you intend to have (mysql) replicas of stat1? [00:40:52] New patchset: Ottomata; "device_pipeline.py - getting rid of debug() function, using main()." [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2380 [00:40:54] New patchset: Ottomata; "Hacky first work on loader classes." [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2381 [00:42:22] New patchset: Ottomata; "base.py - adding schema in comments. Got lots of work to do to make this prettier" [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2382 [00:44:13] New patchset: Bhartshorne; "backing out mysql from stat1 until it moves to an internal IP or something else" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2383 [00:44:52] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2383 [00:44:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2383 [01:24:15] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [01:24:15] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:27:37] New patchset: Lcarr; "Fixing lvs puppet config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2384 [01:28:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2384 [01:28:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2384 [01:29:49] i guess the fixing part was a bit hopeful there... [02:24:25] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1770s [02:28:15] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 2000s [02:34:52] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 3s [02:43:02] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 51s [08:06:51] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [08:15:11] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [08:39:36] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [10:04:07] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 0 seconds [10:06:37] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [11:34:52] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [11:34:52] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [11:37:51] New patchset: Mark Bergsma; "Add cp3001/cp3002 to the Bits caches esams aggregates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2385 [11:38:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2385 [11:38:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2385 [11:38:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2385 [11:40:56] New patchset: Mark Bergsma; "Use %sumfunc% instead of %sumsep%" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2386 [11:41:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2386 [11:41:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2386 [11:41:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2386 [11:48:37] New patchset: Mark Bergsma; "More coffee is needed. Add missing comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2387 [11:49:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2387 [11:49:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2387 [11:56:09] New patchset: Mark Bergsma; "Remove old knsq hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2388 [11:56:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2388 [11:56:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2388 [11:56:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2388 [12:19:03] New patchset: Mark Bergsma; "Add eqiad bits varnish servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2389 [12:19:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2389 [12:19:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2389 [12:19:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2389 [12:19:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2389 [12:44:08] New patchset: Mark Bergsma; "Add eqiad hosts to aggregates (incomplete)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2390 [12:44:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2390 [12:45:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2390 [12:45:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2390 [13:09:37] New patchset: Mark Bergsma; "Add eqiad servers as empty DS references are not allowed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2391 [13:10:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2391 [13:10:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2391 [14:05:18] New review: Diederik; "Ok." [analytics/reportcard] (otto/pipeline); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2381 [14:05:47] New review: Diederik; "Ok." [analytics/reportcard] (otto/pipeline); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2380 [14:05:47] Change merged: Diederik; [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2381 [14:05:47] Change merged: Diederik; [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2380 [14:06:09] New review: Diederik; "Ok." [analytics/reportcard] (otto/pipeline); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2382 [14:06:09] Change merged: Diederik; [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2382 [14:15:49] !log powercycling srv189 [14:15:51] Logged the message, Master [14:21:14] !log srv189 - shut down again, PCI Express Error, -> RT 2413 created [14:21:16] Logged the message, Master [14:23:37] ACKNOWLEDGEMENT - Host srv189 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware problem - RT 2413 [14:29:48] New review: Dzahn; "basics for stat1 server" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2293 [14:42:11] New patchset: Dzahn; "add class for stat server per ezachte (RT 2162), apply on stat1, rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2293 [14:42:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2293 [14:42:49] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2293 [14:42:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2293 [14:50:58] New patchset: Dzahn; "ensure mountpoints exist (btw, puppet feature req. 5015)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2392 [14:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2392 [14:51:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2392 [14:51:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2392 [15:10:26] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] RECOVERY - check_minfraud1 on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 8643 bytes in 0.313 second response time [15:49:04] New patchset: Dzahn; "geoip support for stats1 per RT 2164" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [15:49:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2393 [15:51:39] New patchset: Dzahn; "geoip support for stats1 per RT 2164" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [15:57:38] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2393 [15:57:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [16:05:28] New review: Dzahn; "see inline comments, can be improved, but current version to start. manual changes applied for BZ 23..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2333 [16:05:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2333 [16:24:33] New patchset: Dzahn; "minor formatting in noc index.html" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2394 [16:24:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2394 [16:24:59] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2394 [16:25:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2394 [17:16:20] New patchset: Mark Bergsma; "Add node lists to role/cache.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2395 [17:23:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2395 [17:23:13] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2395 [17:23:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2395 [17:33:53] I'm about to bump swift to 50% [17:37:19] !log deployed new squid config to upload to direct 50% of thumbnail traffic to swift [17:37:21] Logged the message, Master [17:37:43] anyboy that wanst to watch the effect: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift [17:37:45] :) [17:41:34] New patchset: Mark Bergsma; "Automatically generate squid list for eqiad text squids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2396 [17:42:36] New patchset: Mark Bergsma; "Automatically generate squid list for eqiad text squids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2396 [17:43:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2396 [17:43:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2396 [17:43:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2396 [17:47:42] New patchset: Mark Bergsma; "Actually include role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2397 [17:48:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2397 [17:48:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2397 [18:01:37] Is there anything Ops folk can do to look at this problem, re: access from Pakistan? https://bugzilla.wikimedia.org/show_bug.cgi?id=34256 [18:02:57] hexmode: as stated in the ticket, his ISP confirms it's just him. [18:03:11] before the ticket was created, we checked things like routing [18:03:23] I'd suggest he try with a different computer / browser [18:03:23] maplebed: could you resolve it invalid w/ a note? [18:03:30] make sure it's not adblock or something like that. [18:03:40] I suggested a different connection [18:03:51] like internet cafe [18:04:05] he says he has tried multiple computers on the same connection [18:04:16] do we have a wiki page on basic network diagnostics? [18:04:29] if we don't, we should [18:04:32] (i.e. ping wikipedia, ping bits.wikimedia, telnet to port 80, etc.) [18:04:54] * hexmode files an RT ticket to create a wiki page [18:04:54] there's a pretty straightforward set of steps you can use (ping, traceroute, telnet) to isolate network issues [18:05:47] New patchset: Mark Bergsma; "Attempt to generate RRD expression with erb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2398 [18:10:37] New patchset: Mark Bergsma; "Attempt to generate RRD expression with erb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2398 [18:12:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2398 [18:12:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2398 [18:15:11] New patchset: Mark Bergsma; "Wrong file name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2399 [18:16:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2399 [18:16:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2399 [18:17:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2399 [18:19:33] New patchset: Mark Bergsma; "Wrong function name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2400 [18:20:38] maplebed: so how many thumbs does swift have now? [18:20:49] compared to ms5... [18:21:24] hexmode: commented and closed invalid. [18:21:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2400 [18:21:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2400 [18:21:32] mark: I don't have a full count from ms5, [18:21:49] maplebed: tyvm [18:21:58] but swift currently has 16.47 million [18:22:05] (according to http://ganglia.wikimedia.org/latest/?c=Swift%20pmtpa&h=Swift%20pmtpa%20prod&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:22:06] ) [18:22:21] I'm wondering because load on ms5 has barely dropped [18:23:08] I've been wondering about that too. [18:23:30] my only thought is that it means that our squid cluster is large enough that the only requests going through to ms5 are actually falling through to the scalers. [18:23:35] (ms5 or swift) [18:23:41] yeah that's for sure [18:23:46] we have like a 98-99% hit rate [18:23:46] but that doesn't jive with the numbers either. [18:25:07] ok, food now... bbl [18:25:10] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift does show, though, that the object change rate doubled at the same time as the access rate, and it's remained relatively consatnt [18:25:38] the last bit of interesting data is that the 404 hit rate is only slightly less than half the 200 hit rate, [18:25:48] that is curious [18:26:04] which implies that either half or all of the hits are actually falling through. [18:26:15] (annoyingly, I think the 200 hit rate also includes authentication requets) [18:26:29] * maplebed goes to verify [18:27:42] yeah, it does. [18:28:02] New patchset: Mark Bergsma; "Generate eqiad squid aggregate RPN as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2401 [18:28:35] tightning up the ganglia stats to ignore the auth stuff would be useful. [18:28:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2401 [18:28:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2401 [18:38:38] I just copied a version of the swift ganglia log parser that ignores swauth (the authentication stuff) in calculating its hit rate. Expect to see ms-fe1's numbers drop. [18:45:14] damn, that's a pretty severe change. [18:45:18] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&tab=v&vn=swift [18:46:03] the 404s are now 5x the 200s for GETs [18:47:53] ahh [18:48:51] it's also interesting to note the change in 200 avg response time ms-fe1 vs. ms-fe2 (ms-fe1 has the change, 2 doesn't) [18:49:15] avg response time is now closer to 35 rather than 18 (millisecs) [18:50:54] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [18:51:18] huh [18:57:46] New patchset: Bhartshorne; "changed swift stats gathering logic to ignore authentication requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2402 [18:58:09] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2402 [18:58:09] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2402 [19:00:23] I wonder how that's going to be for 50% traffic [19:02:47] it's at 50% now. [19:02:52] oh [19:02:55] :D [19:03:11] went to 50% at 17:37UTC [19:03:14] but the ms5 load is unchanged? really? [19:03:23] that's really bizarre [19:03:46] pl.planet not working: https://bugzilla.wikimedia.org/34268 [19:03:48] well, the updated stats say 85% of all traffic to swift is falling through to ms5, [19:03:53] so I'm not terribly surprised. [19:04:14] hmm [19:04:55] and I do see a small drop in ms5's traffic at about the right time (http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=ms5.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS) [19:09:04] so wait, I obviously really missed something... *which* traffic is falling through to ms5? [19:09:38] the 404s. [19:10:45] sent mail with a pretty annotated picture. [19:10:53] so when will swift start to handle those? [19:11:31] nice pic [19:11:35] (nice annotations) [19:12:35] New patchset: Mark Bergsma; "Temporarily disable upload.eqiad while it doesn't exist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2403 [19:12:59] you mean when will swift fall back to the image generators rather than ms5? [19:13:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2403 [19:13:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2403 [19:13:26] the date's not set yet. [19:14:19] ah [19:15:25] ms5 currently looks like a 2/1 gets vs. calls to the backend [19:15:32] but I should probably look at those more closely again [19:15:49] to make sure the same thing isn't there - that there are 2 gets recorded for an image to go through to the backend. [19:30:23] New patchset: Mark Bergsma; "Refactor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2404 [19:31:15] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2404 [19:31:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2404 [19:42:04] New patchset: Mark Bergsma; "Move RPN generation into a varnish_rpn function" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2405 [19:42:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2405 [19:42:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2405 [19:44:10] New patchset: Mark Bergsma; "Add missing 'end'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2406 [19:44:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2406 [19:44:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2406 [19:50:16] New patchset: Mark Bergsma; "Let's not use erb tags inside ruby functions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2407 [19:50:45] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2407 [19:50:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2407 [20:06:44] New patchset: Mark Bergsma; "Convert all aggregate RPN lists to functions and puppet hashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2408 [20:07:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2408 [20:07:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2408 [20:19:52] New patchset: Mark Bergsma; "Support empty lists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2409 [20:20:13] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2409 [20:20:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2409 [20:20:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2409 [20:22:37] New patchset: Mark Bergsma; "Add missing 'api' key to decommissioned nodes hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2410 [20:24:10] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2410 [20:24:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2410 [20:33:16] New patchset: Mark Bergsma; "Allow difference between path and node list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2411 [20:33:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2411 [20:33:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2411 [20:40:10] New patchset: Mark Bergsma; "Correct esams host names" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2412 [20:40:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2412 [20:40:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2412 [20:40:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2412 [20:45:05] New patchset: Mark Bergsma; "knsq* use .knams.wikimedia.org still" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2413 [20:45:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2413 [20:45:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2413 [20:49:16] New patchset: Mark Bergsma; "More domain name corrections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2414 [20:49:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2414 [20:49:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2414 [20:57:28] New patchset: Mark Bergsma; "Fix graph stacking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2415 [20:58:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2415 [20:58:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2415 [21:12:06] New patchset: Andre Engels; "A new attempt, adding some of Diederik's comment, and a new file simplepipelines.py, containing two basic pipelines to show the value of a single variable or two variables seen as independent and dependent." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2416 [21:12:08] New patchset: Andre Engels; "ignoring a few files" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2417 [21:12:09] New patchset: Andre Engels; "Partly reacting to Diederik's comments; adding simplepipelines.py with some useful pipelines for the case when speed of programming is more important than beauty of output." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2418 [21:12:10] New patchset: Andre Engels; "Revert "Partly reacting to Diederik's comments; adding simplepipelines.py"" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2419 [21:35:47] New patchset: Andre Engels; "ignoring a few files" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2417 [21:46:09] bumping swift traffic to 75% in 15 minutes [21:46:16] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [21:46:16] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [22:01:58] !log deployed updated to upload squid.conf to send 75% of all thumbnail traffic to swift [22:02:00] Logged the message, Master [22:12:16] \o/ [22:13:59] wee [22:23:27] and no server kittens crying so far :o [22:51:46] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:43] maplebed: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2FTotals%2FService_times [22:54:54] * maplebed clicks [22:55:08] service times for uploads.wiki as seen from the squid layer [22:55:43] we don't have any more data than just today, though... :( [22:55:49] hopefully it will settle back down but i think the recent big spike correlates with going to 75%.. maybe just due to restarting squid? [22:56:21] or due to swift not having everything? [22:56:58] the first increase is going to 50%, the most recent going to 75%, yeah. [22:57:07] the times do match up. [22:57:14] torrus has been pretty unreliable but there is long term data - http://torrus.wikimedia.org/torrus/CDN?token=T30108&view=longterm-rrd-html [22:57:32] the big spkies in the year view generally map to ms5 trouble / outages i think [22:57:57] but base latency has been very stable at generally <= 50ms [22:58:09] swift is reporting that 200s (requests for which it already has the image) are being served in about 50ms, so when it's actually got lots of stuff, it looks ilke the times should go back to their historical average. [22:58:25] it's true though, that ~75% of all requetss to swift are falling back to ms5 at the moment. [22:58:26] i don't think we should worry about this right now but keep an eye on it over the next few days [22:58:45] thanks for the link - it's good to see it from both squid's and swift's perspective. [22:58:49] how does swift handle "has this object changed" requests? [22:59:09] what do you mean - HEADs? [22:59:38] or when a user changes an image [22:59:51] heads or gets with if-modified-since headers [22:59:55] the latter results in a mediawiki purging existing thumbnails. [23:00:01] does swift return 304's? [23:00:09] hmm. not sure. [23:00:54] yeah, I do see many 304s in the log [23:01:14] ok, that's good [23:01:31] we don't have data on the rate of 304's on the old system, do we? [23:02:04] no... I'm splitting apart some response codes in ganglia for swift but 304 isn't one of them. [23:02:07] depending on headers returned, its possible squid is hitting swift more often to check for object freshness, which could slow squid down some [23:02:10] so it'll fall into the 'other' category. [23:02:16] maybe we should wait a day or two though [23:02:19] to dig in more [23:02:24] binasher: Even if Swift doesn't 304, the caching layer in front of it (upload Squids) should 304 as well [23:02:51] RoanKattouw: yup, squid 304's [23:02:58] In fact, if you are indeed serving files with a 30-day s-maxage header, you shouldn't be seeing that many 304s coming from Swift directly [23:04:55] binasher: there's currently about 1.5qps coming in to swift that is GET requests that aren't 200, 201, 204, or 404. [23:05:03] do you think i should call out 304s specifically? [23:06:06] maplebed: what's the overall non-error qps? [23:06:44] all GETs are about 28 now. [23:06:57] 28qps [23:07:02] (per server) [23:07:26] hmm [23:07:38] how can i request an image from swift without going thru squid? [23:08:00] curl -o /tmp/foo -vvv -H "Host: upload.wikimedia.org" http://sq86.wikimedia.org:3128//wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/47px-Little_kitten_.jpg [23:08:09] oh wait. [23:08:10] sorry. [23:08:24] s/sq86.wikimedia.org/ms-fe.pmtpa.wmnet/ [23:08:29] and that curl should work. [23:08:34] i agree with RoanKattouw that switft shouldn't by 304'ing much at all, but those could be driven more by squid thinking objects are stale prematurely [23:08:56] I'll add it in. there're alraedy an obscene number of metrics coming out; this won't hurt much more. [23:09:37] it could also just be that the squid conf changes result in carp rebalancing though i wouldn't have expected so [23:09:40] any other response codes you think would be useful to see on tehir own? [23:10:03] this is my current list: 200', '201', '204', '304', '404 [23:10:20] (and that's for all HTTP methods, not just GET [23:10:21] ) [23:11:10] I suppose at this point I should probably just report all counts rather than only select codes + 'other'... [23:11:12] Is there a way you could group the rest under 'unexpected' or something? I was gonna suggest things like 403 or 500 but those really shouldn't be occurring [23:11:12] do you see anything in the 500-504 range in the logs? [23:11:29] RoanKattouw: all not listed there already go into the 'other' bucket. [23:11:58] OK [23:12:50] no, not seeing any 500s in the logs. [23:14:45] New patchset: Bhartshorne; "adding 304s to the list of interesting respnose codes for swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2420 [23:14:46] binasher: ^^^ [23:14:52] also, feel free to muck with it as well. [23:15:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2420 [23:15:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2420 [23:15:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2420 [23:16:22] huh [23:16:33] swift is not returning any caching related headers at all [23:17:03] which means the following squid rule is applied to everything [23:17:03] That sounds bad, intuitively. Are the Squids configured to auto-apply them? [23:17:06] refresh_pattern . 300 80% 5184000 stale-while-revalidate=21600 ignore-reload [23:17:36] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 1781 seconds [23:18:05] the 300 in that line == "time (in minutes), an object without an explicit expire time should be considered fresh." [23:18:26] and the 80% == "percentage of the objects age (time since last modification age) an object without explicit expire time will be considered fresh" [23:19:15] 5184000 = max time, but that wont' apply in plenty of cases [23:19:30] I don't get the 80% bit [23:19:49] oh, I think I do. [23:20:04] an object that is an hour old will be cached by squid for 48 minutes [23:20:04] if I get an object with a last-modified time of 10 months ago, it'll get an expiry time of 8 monhts? [23:20:28] huh. [23:20:35] 8 months or $max, whichever is smaller [23:20:45] so new objects are cached for 5 hours and then recached for 80% of the time since modification after that. [23:21:02] that actually seems like incredibly reasonable default behavior! [23:21:15] excepting, of course, that we explicitly purge, so it can be much more aggressive... [23:21:18] (in our case) [23:22:06] PROBLEM - MySQL Slave Running on db11 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant drop database centralauth: database doesnt exist on [23:22:08] i think our declared caching policy is "until explicitly purged or one month" [23:22:23] oh damn it [23:23:21] Yes [23:24:36] binasher: I don't see ms5 returning any caching headers either. [23:24:46] lovely [23:25:05] I guess thumb.php doesn't set those [23:25:07] ? [23:25:20] Or, well, the web server, for non-thumb files [23:26:02] the only difference I see between the ms5 and swift headers are the Server:nginx and Connection:keep-alive headers (swift doens't send either) [23:27:00] thump.php sends Last-Modified [23:27:12] Right, and that triggers the 80% behaviro [23:27:23] *thumb.php [23:27:24] But still, we can be more agressive [23:27:38] thumb-handler.php doesn't do anything either, for 200 responses at lesat [23:27:48] and swift carries on the same last-modified [23:28:01] so the change shouldn't effect squid behavior [23:28:04] It sends a strange one for 404 [23:28:16] header( 'Cache-Control: s-maxage=300, must-revalidate, max-age=0' ); // for 404 responses from thumb-handler.php [23:28:30] Caching a 404 response with must-revalidate makes no sense to me [23:28:33] You might as well not cache it at all [23:29:06] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [23:29:09] that's what it's doing - explicitly disabling the cache [23:29:11] I mean Squid will probably be conformant and send an IMS, but there's no real reason to [23:29:17] isn't it? [23:29:23] Well, effectively yes [23:29:29] But for 500s it sends Cache-Control: no-cache [23:30:51] squid service times are elevated a bit but settled down [23:31:19] maplebed: when are you planning on going to 100%? [23:33:36] RECOVERY - MySQL Slave Running on db11 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:38:11] from looking at an individual squid, the increase in time seems to be just related to cache misses - http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq82.wikimedia.org%2Fbackend%2FService_times%2FHTTP_service_times - but the hitrate remained steady - http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fpmtpa%2Fupload%2Fsq82.wikimedia.org%2Fbackend%2FPerformance%2FHit_ratios [23:39:01] binasher: tomorrow morning. [23:39:33] that should be enough time to see squid service times return to normal [23:39:52] but something to double check before that deploy [23:40:24] will do. [23:41:03] fwiw, I do expect to see the times go up somewhat; swift is slower than ms5, even for hits. [23:41:59] but I would also like to see swift's rate of "cache miss" drop. [23:43:53] werdna: http://wikimedia.7.n6.nabble.com/Cartman-Gets-an-Anal-Probe-English-Wikipedia-s-featured-article-today-tp4371372p4372512.html [23:44:43] and there i was picturing ms5 as a lone donkey hauling a mile of shipping containers [23:45:08] lol [23:45:24] Pretty well-trained donkey it turns out