[00:00:26] (03CR) 10Ori.livneh: [C: 032] serve graphite.wikimedia.org via misc-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/98003 (owner: 10Ori.livneh) [00:04:59] ori-l: I do, but am distracted by labs migration. [00:05:01] And turkey [00:05:31] hence "at some point" [00:08:07] * andrewbogott nods [00:08:41] (03CR) 10Mattflaschen: "We also need to add the namespace to wgExemptFromUserRobotsControl so people can not force indexing manually using the __INDEX__ magic wor" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [00:08:45] TimStarling: do you think something like this would work well for collector? http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm [00:09:09] TimStarling: it has some nice implementations, like http://arma.sourceforge.net/docs.html#running_stat for C++ [00:10:09] Cold Turkey is a free productivity program that you can use to temporarily block yourself off of popular social media sites, addicting websites and games so that .. [00:10:15] ori-l: could do [00:10:51] I had domas add a field for the sum of squares, in the current collector, based on the idea of using it for variance, but it was never displayed in the web interface [00:11:04] * Aaron|home finally has an ssd as the os drive for all his computers [00:11:06] ah, I was wondering about that [00:11:09] I'm not sure how useful variance would actually be for analysis [00:11:31] you can use the technique to keep a running mean and stddev [00:11:40] getting the some of those overly tight skrews out was a pita...I'm surprised my fingerprints still work [00:12:27] Aaron|home: lenovo? [00:12:41] no, some older clevo laptop [00:12:50] this is a useful overview: http://www.johndcook.com/standard_deviation.html [00:13:16] er, to keep a running mean, i meant [00:14:02] I never thought to myself "damn, I really wish I knew what the standard deviation of this mean is" [00:14:27] maybe I wanted a histogram on occasion [00:14:56] it's not like execution times follow a normal distribution [00:14:57] well, keeping a running histogram is simple, right? [00:15:19] it is not as simple as standard deviation [00:16:42] (03PS1) 10Tim Starling: Update trusted-xff.cdb for I4fd360a6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98036 [00:17:16] hmmm...: http://www.cis.upenn.edu/~sudipto/mypapers/histjour.pdf [00:18:19] do you have a good book (or some other resource) to recommend for working with profiling data? [00:18:49] (03CR) 10Reedy: "I do see a favicon displayed if I visit https://www.wikimedia.org/ though.." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91209 (owner: 10Reedy) [00:19:21] i know the basics but i think i'm skimming along the surface a little [00:20:06] no [00:22:26] my knowledge of profiling mostly comes from using profilers to get performance improvements, rather than reading things [00:23:40] there are lots of profilers around from which to borrow feature ideas, if that's what you're looking for [00:25:17] no, i don't want to pile on features [00:27:25] i mostly want to reproduce existing functionality in a reliable way and then add a javascript API that mimics wfProfile*, accumulates stats locally, and syncs them periodically [00:28:43] and sampling, to avoid self-DDOS [00:29:02] so that there's a consistent API for instrumenting code that works across PHP and JS [00:30:12] sounds like an interesting project [00:33:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [00:36:15] (03PS1) 10Springle: depool slaves for package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98037 [00:38:23] (03CR) 10Springle: [C: 032] depool slaves for package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98037 (owner: 10Springle) [00:38:31] (03Merged) 10jenkins-bot: depool slaves for package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98037 (owner: 10Springle) [00:39:58] !log springle synchronized wmf-config/db-eqiad.php 'depool slaves for package upgrade' [00:40:13] Logged the message, Master [00:41:24] (03CR) 10Tim Starling: "It has " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91209 (owner: 10Reedy) [01:02:04] !log started rsync of graphite data (~400gb) from professor.pmtpa to tungsten.eqiad [01:02:18] Logged the message, Master [01:16:12] (03PS1) 10Springle: warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98040 [01:16:43] (03CR) 10Springle: [C: 032] warm up slaves after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98040 (owner: 10Springle) [01:17:53] !log springle synchronized wmf-config/db-eqiad.php 'warm up slaves after package upgrade' [01:18:11] Logged the message, Master [01:38:25] ori-l: hey [01:38:54] ori-l: sorry for not packaging in udpprofile in time :( [01:43:23] (03CR) 10Legoktm: "@Mattflaschen: If I'm reading http://www.robotstxt.org/meta.html correctly, the rel="nofollow" will override the 'follow' in the meta tag." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [01:44:18] (03CR) 10Legoktm: [C: 04-1] "Oh and -1 since it needs to be added to $wgExemptFromUserRobotsControl." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [02:09:09] (03PS1) 10Legoktm: Set $wgCirrusSearchEnablePref = true if CirrusSearch is alternate [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 [02:09:38] !log LocalisationUpdate completed (1.23wmf4) at Thu Nov 28 02:09:38 UTC 2013 [02:09:57] Logged the message, Master [02:14:48] (03CR) 10Legoktm: Enable AbuseFilter block option on Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [02:15:25] (03PS2) 10Legoktm: Enable AbuseFilter block option on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98002 (owner: 10John F. Lewis) [02:15:36] !log LocalisationUpdate completed (1.23wmf5) at Thu Nov 28 02:15:36 UTC 2013 [02:15:51] Logged the message, Master [02:27:33] (03PS1) 10Springle: slaves to full steam after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98047 [02:28:01] (03CR) 10Springle: [C: 032] slaves to full steam after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98047 (owner: 10Springle) [02:28:09] (03Merged) 10jenkins-bot: slaves to full steam after package upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98047 (owner: 10Springle) [02:29:31] !log springle synchronized wmf-config/db-eqiad.php 'slaves to full steam after package upgrade' [02:29:46] Logged the message, Master [02:40:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:42:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Nov 28 02:42:58 UTC 2013 [02:43:13] Logged the message, Master [02:45:49] PROBLEM - Frontend Squid HTTP on sq80 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:58] (03PS1) 10Faidon Liambotis: Bump puppet's TTL back to 1H [operations/dns] - 10https://gerrit.wikimedia.org/r/98049 [02:46:59] (03PS1) 10Faidon Liambotis: Kill species subdomain for all but wiki{p,m}edia [operations/dns] - 10https://gerrit.wikimedia.org/r/98050 [02:47:00] (03PS1) 10Faidon Liambotis: Kill localhost.* entries [operations/dns] - 10https://gerrit.wikimedia.org/r/98051 [02:47:01] (03PS1) 10Faidon Liambotis: Various cleanups [operations/dns] - 10https://gerrit.wikimedia.org/r/98052 [02:47:02] (03PS1) 10Faidon Liambotis: Indent wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/98053 [02:47:03] (03PS1) 10Faidon Liambotis: Kill some old *.labs.wikimedia.org entries [operations/dns] - 10https://gerrit.wikimedia.org/r/98054 [02:47:04] (03PS1) 10Faidon Liambotis: Switch *.{wap,mobile}.wikipedia.org to wikipedia-lb [operations/dns] - 10https://gerrit.wikimedia.org/r/98055 [02:47:59] (03CR) 10Faidon Liambotis: [C: 032] Bump puppet's TTL back to 1H [operations/dns] - 10https://gerrit.wikimedia.org/r/98049 (owner: 10Faidon Liambotis) [02:48:51] (03CR) 10Faidon Liambotis: [C: 032] Kill species subdomain for all but wiki{p,m}edia [operations/dns] - 10https://gerrit.wikimedia.org/r/98050 (owner: 10Faidon Liambotis) [02:49:39] RECOVERY - Frontend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 531 bytes in 0.083 second response time [02:50:01] (03CR) 10Faidon Liambotis: [C: 032] Kill localhost.* entries [operations/dns] - 10https://gerrit.wikimedia.org/r/98051 (owner: 10Faidon Liambotis) [02:51:30] PROBLEM - Backend Squid HTTP on sq80 is CRITICAL: Connection refused [02:52:32] (03CR) 10Faidon Liambotis: [C: 032] Various cleanups [operations/dns] - 10https://gerrit.wikimedia.org/r/98052 (owner: 10Faidon Liambotis) [02:53:55] (03PS2) 10Faidon Liambotis: Indent wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/98053 [02:53:56] (03PS2) 10Faidon Liambotis: Kill some old *.labs.wikimedia.org entries [operations/dns] - 10https://gerrit.wikimedia.org/r/98054 [02:53:57] (03PS2) 10Faidon Liambotis: Switch *.{wap,mobile}.wikipedia.org to wikipedia-lb [operations/dns] - 10https://gerrit.wikimedia.org/r/98055 [02:54:28] (03CR) 10Faidon Liambotis: [C: 032] Kill some old *.labs.wikimedia.org entries [operations/dns] - 10https://gerrit.wikimedia.org/r/98054 (owner: 10Faidon Liambotis) [03:12:41] (03PS1) 10Faidon Liambotis: Kill old sub-labs.wikimedia.org domains [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98056 [03:12:42] (03PS1) 10Faidon Liambotis: Kill wlm.wikimedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98057 [03:12:43] (03PS1) 10Faidon Liambotis: Add redirects for mobile/wap.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98058 [03:14:19] (03PS1) 10Springle: depool slaves for pakcage upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98059 [03:15:23] (03CR) 10Springle: [C: 032] depool slaves for pakcage upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98059 (owner: 10Springle) [03:16:28] !log springle synchronized wmf-config/db-eqiad.php 'depool slaves for package upgrade' [03:16:44] Logged the message, Master [03:17:34] (03PS3) 10Faidon Liambotis: Indent wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/98053 [03:17:35] (03PS3) 10Faidon Liambotis: Switch *.{wap,mobile}.wikipedia.org to wikipedia-lb [operations/dns] - 10https://gerrit.wikimedia.org/r/98055 [03:18:19] (03CR) 10Faidon Liambotis: [C: 032] Indent wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/98053 (owner: 10Faidon Liambotis) [03:20:43] Are we sending nxdomain? [03:20:48] uh oh [03:20:49] where? [03:20:50] for what? [03:20:58] I just merged a bunch of changes [03:21:05] I messed up I guess [03:21:05] I just got a page about DNS services. [03:21:14] About nxdomain [03:21:27] what does it say? [03:21:58] ALERT! DNS: Nameserver error on 208.80.154.238: Non-existent domain. [03:22:05] ALERT! DNS: Nameserver error on 208.80.154.238: Non-existent domain. [03:22:13] Hm. That paste isn't sending [03:22:18] ALERT! DNS: Nameserver error on 208.80.154.238: Non-existent domain. [03:22:18] I got it [03:22:19] (03CR) 10Dzahn: "removing toolserver stuff - good. does it need replacemnt in toollabs? don't know. RT ticket this came from https://rt.wikimedia.org/Tick" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98057 (owner: 10Faidon Liambotis) [03:22:21] Grr [03:22:25] Ryan_Lane: I got it [03:22:29] Oh, good [03:22:36] I'm on a mobile client [03:22:45] Ryan_Lane: I'm 99% it's a false positive [03:22:51] Oh, ok, good [03:22:55] * Aaron|home wishes https://gdash.wikimedia.org/dashboards worked [03:23:19] Cause otherwise that would be a really bad change, since I'm pretty sure it checks for enwiki [03:23:29] Err bad alert [03:23:55] Ryan_Lane: it's not [03:24:00] I changed it to localhost a while back [03:24:04] and I killed localhost just now :D [03:24:12] localhost.wikipedia.org that is [03:24:22] (03CR) 10Dzahn: [C: 031] Kill old sub-labs.wikimedia.org domains [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98056 (owner: 10Faidon Liambotis) [03:25:14] Oh, good. Ok, off again. [03:25:18] Ryan_Lane: sorry... [03:25:22] * Ryan_Lane waves [03:25:25] and thanks [03:25:33] No worries, just wanted to make sure things were good [03:28:05] (03CR) 10Dzahn: "added MaxSem and ArthurRichards, mobile should know" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98057 (owner: 10Faidon Liambotis) [03:28:30] mutante: did you open the site? :) [03:28:46] paravoid: yes :P [03:28:54] it used to look different [03:28:55] heh [03:29:33] but WikiLovesMonuments app..i'm a user, so i think it should be replacement by something working [03:29:48] or we should let them know at least [03:29:59] done so by adding on gerrit [03:32:02] user erfgoed should migrate to toollabs [03:32:08] and have a nicer index [03:32:39] Request is taking too long. Please retry in a while. [03:32:42] watchmouse... [03:35:27] (03PS1) 10Springle: repool slaves after package upgrade (lvm snapshot boxes only, LB=0) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98061 [03:36:09] (03CR) 10Springle: [C: 032] repool slaves after package upgrade (lvm snapshot boxes only, LB=0) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98061 (owner: 10Springle) [03:37:34] !log springle synchronized wmf-config/db-eqiad.php 'repool slaves after package upgrade, (lvm snapshot boxes only, LB=0)' [03:37:38] (03CR) 10Faidon Liambotis: [C: 04-1] "Why not a single VHost with ServerName m/ServerAlias zero?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 (owner: 10Yurik) [03:37:50] Logged the message, Master [03:48:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:08:24] PROBLEM - Disk space on wtp1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=76%): [04:10:44] PROBLEM - Parsoid on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:15:24] RECOVERY - Disk space on wtp1014 is OK: DISK OK [04:16:34] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.007 second response time [04:21:34] PROBLEM - Disk space on wtp1020 is CRITICAL: DISK CRITICAL - free space: / 96 MB (1% inode=76%): [04:23:34] RECOVERY - Disk space on wtp1020 is OK: DISK OK [04:31:44] (03CR) 10Tim Starling: "You can do protocol-relative redirects now, as of I3adffd88. See how redirects.conf does it." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 (owner: 10Dzahn) [04:41:25] PROBLEM - Disk space on wtp1013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=76%): [04:43:24] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [04:52:24] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.013 second response time [04:52:24] RECOVERY - Disk space on wtp1013 is OK: DISK OK [05:12:23] PROBLEM - Disk space on wtp1019 is CRITICAL: DISK CRITICAL - free space: / 95 MB (1% inode=76%): [05:13:23] RECOVERY - Disk space on wtp1019 is OK: DISK OK [05:13:46] (03PS5) 10Yurik: for m.wikipedia.org and zero.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [05:15:20] yurik: :) [05:15:24] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [05:15:58] paravoid, found a hardcoded redirect in there - kinda dangerous :) [05:16:11] yeah [05:16:23] (03PS3) 10Yurik: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [05:16:24] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.009 second response time [05:16:29] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [05:21:53] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 320 MB (3% inode=76%): [05:26:37] (03PS1) 10Ori.livneh: Configure parser cache databases in db-$realm file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98065 [05:26:49] springle: ^ i figured that would make your life a tiny bit easier [05:27:15] oh! [05:27:19] :) [05:27:23] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:30:07] oh, parsoid... [05:31:14] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.005 second response time [05:31:44] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run for 5d 20h 13m 21s [05:31:53] RECOVERY - Disk space on wtp1011 is OK: DISK OK [05:37:12] (03CR) 10Ori.livneh: [C: 032] Configure parser cache databases in db-$realm file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98065 (owner: 10Ori.livneh) [05:37:47] !log ori updated /a/common to {{Gerrit|Icdaa4c1b5}}: Configure parser cache databases in db-$realm file [05:38:03] Logged the message, Master [05:40:23] !log ori synchronized wmf-config/db-eqiad.php 'Icdaa4c1b5: Configure parser cache databases in db-$realm file (1/3)' [05:40:38] Logged the message, Master [05:41:04] !log ori synchronized wmf-config/db-pmtpa.php 'Icdaa4c1b5: Configure parser cache databases in db-$realm file (2/3)' [05:41:18] Logged the message, Master [05:41:52] !log ori synchronized wmf-config/CommonSettings.php 'Icdaa4c1b5: Configure parser cache databases in db-$realm file (3/3)' [05:42:07] Logged the message, Master [05:52:53] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=76%): [05:54:33] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:09:55] RECOVERY - Disk space on wtp1018 is OK: DISK OK [06:10:26] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.006 second response time [06:14:20] (03PS1) 10Ori.livneh: MySQL gmond module: handle server restarts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98068 [06:14:25] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 305 MB (3% inode=76%): [06:19:45] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:35] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.006 second response time [06:21:25] RECOVERY - Disk space on wtp1021 is OK: DISK OK [06:22:45] PROBLEM - Disk space on wtp1016 is CRITICAL: DISK CRITICAL - free space: / 254 MB (2% inode=76%): [06:28:05] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:49] RECOVERY - Disk space on wtp1016 is OK: DISK OK [06:54:59] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.007 second response time [07:04:39] PROBLEM - Disk space on wtp1024 is CRITICAL: DISK CRITICAL - free space: / 60 MB (0% inode=76%): [07:08:02] PROBLEM - Parsoid on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:10:32] PROBLEM - Host sq80 is DOWN: PING CRITICAL - Packet loss = 100% [07:10:42] RECOVERY - Disk space on wtp1024 is OK: DISK OK [07:10:52] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.004 second response time [07:11:22] RECOVERY - Host sq80 is UP: PING OK - Packet loss = 0%, RTA = 35.86 ms [07:11:32] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Thu Nov 28 07:11:22 UTC 2013 [07:11:52] RECOVERY - Backend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 486 bytes in 0.078 second response time [07:16:25] !log powercycled sq80 [07:16:41] Logged the message, Master [07:25:47] what's wrong with it? [07:26:42] couldn't tell, couldn't log in from either console or mgmt [07:27:11] looking at atop didn't show me anything useful there about what went out to lunch, haven't looked at syslog/dmesg yet [07:27:43] okay [07:27:51] it's tampa anyway :) [07:27:54] yep [07:28:12] was just looking at ssl1 which we don't care about, Iknow, but this is a bit curious) [07:28:34] nginx: [emerg] bind() to [2620:0:860:ed1a::c]:443 failed (99: Cannot assign requested address) [07:28:41] I had a quick look yesterday, something looks misconfigured [07:28:45] this is mobile-lb, it was never in tampa [07:28:53] that turns out to be /etc/nginx/sites-enabled/mobilewikipedia yeah [07:59:10] (03PS1) 10Nemo bis: Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 [08:04:24] (03PS1) 10Nemo bis: Remove ancient ArticleFeedbackTool v4 cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98074 [08:06:48] (03CR) 10Ori.livneh: [C: 031] Remove ancient ArticleFeedbackTool v4 cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98074 (owner: 10Nemo bis) [08:10:48] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=76%): [08:12:48] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:18] PROBLEM - Disk space on wtp1012 is CRITICAL: DISK CRITICAL - free space: / 262 MB (2% inode=76%): [08:18:22] (03PS3) 10Matanya: varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [08:19:13] (03CR) 10jenkins-bot: [V: 04-1] varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 (owner: 10Matanya) [08:22:18] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [08:22:24] (03PS4) 10Matanya: varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [08:22:28] PROBLEM - Disk space on wtp1005 is CRITICAL: DISK CRITICAL - free space: / 334 MB (3% inode=76%): [08:23:16] (03CR) 10jenkins-bot: [V: 04-1] varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 (owner: 10Matanya) [08:25:38] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.009 second response time [08:25:48] RECOVERY - Disk space on wtp1011 is OK: DISK OK [08:26:28] RECOVERY - Disk space on wtp1005 is OK: DISK OK [08:27:01] (03PS5) 10Matanya: varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [08:27:18] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [08:27:18] RECOVERY - Disk space on wtp1012 is OK: DISK OK [08:28:18] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.007 second response time [08:28:18] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 947 bytes in 0.003 second response time [08:29:31] !log /var/lib/parsoid/nohup.out on wtp 1005,11,12 was 6gb or more, causing / on these boxes to fill; moved it, restarted parsoid, removed it [08:29:44] Logged the message, Master [08:30:28] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:31:28] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:32:38] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 267 MB (3% inode=76%): [08:33:20] paravoid: thank you for your very useful comment on the varnish lint stuff. [08:34:38] RECOVERY - Disk space on wtp1023 is OK: DISK OK [08:34:54] !log and wtp1023 [08:35:08] Logged the message, Master [09:08:19] PROBLEM - Disk space on wtp1015 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=76%): [09:11:19] RECOVERY - Disk space on wtp1015 is OK: DISK OK [09:29:31] (03PS1) 10Matthias Mullie: (bug 57605) Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98076 [09:35:27] hmmm mlitn, can't you just merge https://gerrit.wikimedia.org/r/98073 ? [09:36:15] (03Abandoned) 10Matthias Mullie: (bug 57605) Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98076 (owner: 10Matthias Mullie) [09:36:22] Nemo_bis: I had missed that one ;) [09:38:18] ok, sorry for not adding you to reviewers [09:38:29] (I didn't think it that urgent) [09:44:50] Nemo_bis: it's probably not that urgent, no :p [09:46:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [10:06:13] !log stack traces filling up parsoid nohup.out logs (sveral gigs in only a few minutes once the parsoid gets into that state), sample on wtp1010 in /var/lib/parsoid/nohup.out.errors [10:06:29] Logged the message, Master [10:08:04] Object.Util.tokensToString (/srv/deployment/parsoid/Parsoid/js/lib/mediawiki.Util.js:346:13 [10:08:11] need someone to look at this basically now [10:08:41] Util.tokensToString, invalid token: undefined tokens: [ '', [10:08:46] plus piles and piles of crud [10:08:59] meh meant to ask that in -dev [10:34:31] (03PS1) 10Ori.livneh: Hack: filter out 'invalid token' from parsoid log [operations/puppet] - 10https://gerrit.wikimedia.org/r/98081 [10:34:44] apergos: that's one approach, probably not optimal [10:34:50] gonna see if i can find the bug [10:35:09] ah that's another possibility [10:35:58] no, that filter won't work [10:36:05] while it's one message it's split across many lines [10:36:10] see the sample in uh [10:36:16] wtp1010 in /var/lib/parsoid/nohup.out.error [10:36:59] Util.tokensToString, invalid token: undefined tokens is just the first line of it [10:37:01] ori-l: [10:37:08] cat: /var/lib/parsoid/nohup.out.error: No such file or directory [10:37:22] s [10:37:42] my bad copy paste from the log entry [10:37:43] cats? (just kidding, i found it) [10:37:46] :-P [10:37:51] and a fine morning to you toooooo [10:37:56] :) [10:38:07] um, yeah, probably better to just not write a log [10:38:18] yep for now [10:39:04] 1 gb in about 9 mins [10:39:30] give or take... [10:41:02] i would just hack it locally [10:41:05] (03PS1) 10ArielGlenn: turn off logging for parsoid for now, was filling / [operations/puppet] - 10https://gerrit.wikimedia.org/r/98082 [10:41:09] oh [10:41:22] well puppet would replace em [10:41:38] ah yeah, for a second i thought this was in /srv/dep./parsoid [10:41:46] nah [10:42:10] (03CR) 10ArielGlenn: [C: 032] turn off logging for parsoid for now, was filling / [operations/puppet] - 10https://gerrit.wikimedia.org/r/98082 (owner: 10ArielGlenn) [10:42:45] (03Abandoned) 10Ori.livneh: Hack: filter out 'invalid token' from parsoid log [operations/puppet] - 10https://gerrit.wikimedia.org/r/98081 (owner: 10Ori.livneh) [10:44:41] (03CR) 10Matthias Mullie: [C: 031] "Looks good; to be merged once anyone's willing to push it out immediately (to not confuse others deploying with an undeployed patch)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [10:45:47] (03CR) 10Ori.livneh: [C: 04-1] "Doesn't quite do the trick yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98068 (owner: 10Ori.livneh) [10:49:19] !log turned off logging for parsoid ( https://gerrit.wikimedia.org/r/#/c/98082/ ), old logs remain in place for folks to examine [10:49:33] Logged the message, Master [10:50:36] apergos: is there a bug or something for that, or shall i just attach RoanKattouw_away to that gerrit changeset? [10:51:17] no bug report yet, I haven't searched bugzilla [10:51:37] dups are cheap >.> [10:51:55] :-D [10:52:27] I'll tell the bugmeister you said that :-P [10:53:02] I would prefer multiple dup reports compared to no reports at all [10:53:11] for sure [10:53:24] I'm pretty sure andre has seen me say that many times [10:53:44] I'm just still cleaninig up (waiting for puppet to go around instead of forcing it, doing restarts of parsoid after puppet since there's no refresh scheduled) [10:54:30] there's https://bugzilla.wikimedia.org/show_bug.cgi?id=53723 which is somewhat related [10:54:34] but not the same thing [10:54:57] looking [10:54:58] Our puppetization is not very complete yet, with log rotation for example still missing (which causes parsoid servers to hang after running out of disk space).' [10:55:12] oh [10:55:15] yeah [10:55:39] sept [10:55:41] grrrr [10:56:54] meanwhile, in vagrant: http://git.wikimedia.org/blob/mediawiki%2Fvagrant.git/HEAD/puppet%2Fmodules%2Fmediawiki%2Ftemplates%2Fparsoid.conf.erb :P [10:57:03] but the bug report I meant was for the error I see in the logs [10:57:34] uh huh [10:59:42] yeah, that deserves a bug report [11:00:18] which I will check and see if there is, and if not, file one, in a little while [11:01:03] my thoughts are too scattered to write one, otherwise i would [11:01:57] don't worry about that [11:02:13] I just haven't done my usual morning routine things yet [11:15:13] (03PS1) 10Ori.livneh: Add Salt deployment module for mwprof [operations/puppet] - 10https://gerrit.wikimedia.org/r/98085 [11:19:21] (03CR) 10Ori.livneh: [C: 032] Add Salt deployment module for mwprof [operations/puppet] - 10https://gerrit.wikimedia.org/r/98085 (owner: 10Ori.livneh) [11:22:58] (03PS1) 10Ori.livneh: Add mwprof build dependencies to mwprof role [operations/puppet] - 10https://gerrit.wikimedia.org/r/98086 [11:55:18] (03CR) 10Ori.livneh: [C: 032] Add mwprof build dependencies to mwprof role [operations/puppet] - 10https://gerrit.wikimedia.org/r/98086 (owner: 10Ori.livneh) [13:31:53] (03CR) 10Manybubbles: [C: 04-1] "The problem with this is that we turn Cirrus on as a secondary to build the index. If users switch to Cirrus during the index build they'" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98046 (owner: 10Legoktm) [14:00:01] (03CR) 10Hashar: "apergos : make sure to ping Parsoid folks about it or they will wonder where the log went :D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98082 (owner: 10ArielGlenn) [14:41:38] (03CR) 10ArielGlenn: "Hashar: I already opened a bug on the underlying issue:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98082 (owner: 10ArielGlenn) [14:59:09] (03CR) 10Odder: "Not sure that's still needed, for two reasons: (1) Nobody merged it yet, and (2) No one (except me) commented on the bug since this patch" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (owner: 10TTO) [15:05:05] (03PS6) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [16:07:48] (03CR) 10Steinsplitter: [C: 031] Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [16:38:41] When is Mark returning from vacation? [16:42:21] in about two more weeks iirc [16:42:25] the 10th? [16:42:26] something like that [16:42:27] why? [16:48:37] people keep referring to him for various issues :) [16:50:32] if you are waiting for something from him you should ensure there's a bugzilla or RT ticket open rather than wonder when he's back ;) [16:52:08] nah, this is more of a discussional thing [16:57:11] how much disk space can be spared on the varnish boxes for local kafka message queueing in case the transatlantic link is down? [16:57:19] 10G? 100G? [16:58:07] < 10G with the current partitioning [16:58:18] is 10G [16:58:47] used is ~2G, plus say keeping it at 80% at most [16:58:51] so maybe 5-6G [16:58:54] okay. a rough estimate is that 10G is one hour of vk logs [16:59:06] per box? [16:59:39] for which boxes? :) [17:03:42] upload boxes, doing 2kreqs/s, each json message is approx 1.5k (might be a bit high). [17:05:24] just frontends I assume? [17:06:34] paravoid: are there any other varnish boxes? [17:06:41] 2k req/s sounds too small [17:07:01] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=frontend.client_req&s=by+name&c=Upload+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [17:07:07] so, up to 6k req/s [17:07:57] good graph [17:08:00] so we have (text, bits, upload, mobile, parsoid) x (eqiad, esams, ulsfo, $newdc) clusters [17:08:11] each of these clusters has multiple servers [17:08:20] each server has two varnishd, one frontend and one backend [17:09:16] so it's router -> LVS -> random frontend in-memory caching -> consistent (url hashing) backend with persistent SSD store [17:09:23] for the simple case of eqiad [17:09:38] esams is LVS -> frontend esams -> backend esams -> backend eqiad [17:09:50] (ulsfo == esams for all intents and purposes) [17:10:25] oh and there's also SSL terminators, which is LVS -> source ip-hashed SSL terminator -> LVS -> varnish frontend -> ... [17:10:53] fun ain't it [17:14:18] are you collecting logs from the backend varnishes? [17:15:13] right now? no [17:15:39] it probably doesn't make much sense, although it can be useful for debugging [17:59:14] Snaps: are you familiar with ? [18:00:49] * YuviPanda|away +2s all the things [18:09:12] ori-l: ;) [18:09:39] Snaps: I don't mean it patronizingly; a lot of the early analytics decisions were made before the requirements were grokked [18:10:05] ori-l: are you bold ? [18:10:13] average: i'm italicized [18:10:18] oh I see [18:10:33] like the fields that are logged by default, they're more than we can ever actually analyze IMO [18:10:52] and the exaggerated OMG EVERY BIT IS PRECIOUS approach to reliability [18:11:45] if the esams - eqiad link goes down for an hour, it's OK to just lose an hour of data, as long as there's a clear indication in the logs that some data was lost, so it can be accounted for in the analysis phase [18:12:21] okay, I follow you [18:14:13] ori-l: oh, so how do you account for missing data ? [18:14:19] interpolation ? [18:14:25] average: yes, for example [18:14:31] but there's a flip side to that coin; being technically bold which might be making sure no message is "ever" lost [18:15:19] Snaps: mostly I'm just paying you a compliment and saying: it's clear that you're a very good software engineer, and if you have good intuitions about what sort of setup makes sense and is doable, you should feel free to propose them; it doesn't make sense to shackle you to requirements that may not have made sense anyway [18:15:39] that's just my opinion, though [18:17:46] * average is thinking that actual data would be cooler than doing interpolations [18:18:56] * average is probably ignorant to what it would take to have 100.00% logs and no gaps/missing_data [18:20:22] 100% is probably impossible ;) [18:20:48] As with everything, there's usually a tradeoff [18:23:48] a while back mark wrote that 'Also: we "lost" about 10 years of data, before we ever started logging and analysing anything at all. It was fine. It may not be optimal, it may not be professional, but we still became a top 5 website that way. Let's keep this in perspective.' [18:24:38] 82% of all statistics are made up anyway [18:25:10] well, I don't see this as license to be sloppy or imprecise [18:25:48] ori-l: I think that went straight over your head [18:25:58] nah, I know the quip [18:26:05] lol [18:26:34] but rather to accept occasional faults and outages as a fact of life for small big site like us and to handle them gracefully and properly [18:26:51] * ori-l is rambling [18:26:56] ori-l: It's a holiday! [18:26:58] back to breakfast [18:26:59] gtfo irc nao [18:27:01] bye! [18:27:05] ori-l: thanks, and I do appreciate that attitude. In this case its not so much sticking to requirements (havent seen many), but getting to know the platform and environment to know the current extents. (and not to piss of the greek ops division) [18:27:28] Snaps: the greek ops division is very smart and very reasonable in my experience :) [18:28:08] ok, off for real, have a nice evening / day / morning / something [19:29:39] ori-l: make a commit? [19:29:49] it's an uninitialized repo? [19:29:56] I'm pretty sure that's an ignorable message [20:20:27] Snaps: the conclusion is easy IMHO, if you have 10 GB logs per hour you're doing it all wrong; sample the logs, all problems solved. [20:29:07] (03PS1) 10Yuvipanda: toollabs: Add uwsgi to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/98126 [20:30:34] Nemo_bis: Interesting. Sample everything always? Sample at congestion? Move sampling to producer side alltogether (currently done at consumers)? Is producer side sampling fair? etc. [20:36:03] Snaps: full logging is inherently wrong, so sampling will always be better ;) ; errors are not a problem if they are constant [20:37:31] Nemo_bis: I disagree! :) and there is alot of work being done on realtime stream processing which looks interesting (samza, storm, ..) [20:37:56] * YuviPanda offers his agreement to Snaps [20:38:39] really want real time data access for short intervals at least, over large intervals reasonable aggregation is fine (large intervals = up to the the last day) [20:38:49] for page statistics; sure, sampling is probably okay. But if there is a reliable event stream there are tons of interesting and probably pointless things one could do [20:40:00] yeah I'm not thinking page views but performance and other monitoring [20:40:37] (03CR) 10Yuvipanda: [C: 04-1] "Okay, installing it automatically starts it at startup, which we don't want. Need to add a rule to not make that happen" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98126 (owner: 10Yuvipanda) [20:41:17] Snaps: one of the things I'd really love to have is a 'referrer' graph, which shows aggregate info on how many times a page was visited from another page [20:41:27] Snaps: only from internal links [20:41:35] so no privacy violations, and only aggregate [20:41:50] Snaps: it'll help us do things like 'also read...' and stuff [20:42:00] yeah. reactive low-turnaround recommendations [20:42:12] yeah [20:42:28] Snaps: is the analytics team doing pageview capturing now? [20:42:40] I have no idea :) [20:42:44] I thought that was taken off the table and deprioritized? [20:42:45] ah :D [20:42:46] right [20:44:05] the fun starts when the full firehose and all the infrastructure and tools to analyze it are in place. Thats when all the "hey, lets try this" awesomeness begins [20:44:47] it should be easy to jack in to. [20:45:55] yeah [22:19:44] (03CR) 10MaxSem: "The API is already on tool labs: http://tools.wmflabs.org/heritage/api/api.php" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/98057 (owner: 10Faidon Liambotis) [22:24:10] (03PS1) 10Edenhill: Correctly calculate escape buffer size [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98134 [22:24:11] (03PS1) 10Edenhill: Added %{VCL_Log:key}x support [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98135 [22:24:12] (03PS1) 10Edenhill: Tag column reader was used incorrectly [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98136 [23:07:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000