[00:00:34] ori-l: btw, did you ever had a look at the effect of switching AF/ME to esams? [00:01:24] paravoid: no, but i didn't forget about it either, will do it this weekend sometime [00:01:33] weekend? come on :) [00:01:51] ori-l: you're going to burn out hard :) [00:02:02] and how [00:03:30] ok, maybe next week [00:03:52] (03PS1) 10Ori.livneh: Omit parentheses from metric names [operations/puppet] - 10https://gerrit.wikimedia.org/r/83204 [00:04:28] last one? [00:05:08] (03CR) 10Faidon Liambotis: [C: 032] Omit parentheses from metric names [operations/puppet] - 10https://gerrit.wikimedia.org/r/83204 (owner: 10Ori.livneh) [00:05:37] woot. now i just have to figure out what these numbers mean :P [00:05:59] * paravoid hates the underscore in the name of the view [00:06:36] ok, but you have to merge it [00:06:41] how about the name itself? [00:06:47] i'm not sure "perceived latency" is great [00:07:02] I'd probably name it Navigation Timing or something [00:07:06] yep [00:07:23] I have a question though [00:07:25] why ganglia? [00:07:43] I mean, I like ganglia (fsvo like) [00:07:56] but we've been using graphite for perf data so far [00:08:06] I don't mind having it in ganglia too of course [00:08:38] it's going into graphite too [00:08:40] but [00:08:42] (03PS1) 10Ori.livneh: Ganglia: rename 'perceived_latency' view to 'Navigation Timing' [operations/puppet] - 10https://gerrit.wikimedia.org/r/83205 [00:08:52] 03:04 < ori-l> last one? [00:08:54] liar! [00:09:09] i deserve that [00:09:16] (03CR) 10Faidon Liambotis: [C: 032] Ganglia: rename 'perceived_latency' view to 'Navigation Timing' [operations/puppet] - 10https://gerrit.wikimedia.org/r/83205 (owner: 10Ori.livneh) [00:09:37] and then I'll need to manually delete the old view [00:09:40] with the mediawiki error graphing, i spent some time thinking about how we might do anomaly detection [00:09:59] and came up with all sorts of complex solutions that would be difficult to implement [00:10:15] but we get a lot of free anomaly detection from just slapping it in the channel topic and making it accessible to anyone interested [00:10:28] lol [00:11:51] i think you also see ganglia feature much more prominently in post-mortems for a related reason.. the fact that it's authenticated makes it just a shade less convenient to access and share links, but it's probably exactly the nudge required to push things in the direction of ganglia [00:12:09] the fact that graphite requires authentication, i mean [00:12:23] well, it doesn't [00:12:30] the graphs don't :) [00:12:49] right, but then it's less powerful than ganglia [00:13:09] again, I don't mind [00:13:13] i'm just fearing of overlap [00:13:37] i think it's a legitimate worry, not really sure what to do about it [00:13:45] nothing for now [00:13:47] we'll see [00:14:43] shame we didn't have that DNS graph before the migration [00:14:56] the data is there, we just weren't graphing it [00:15:06] just means we need to do a bit of manual work to plot it, but not that hard [00:15:08] it's all on db1047 [00:15:18] hm [00:15:20] that's interesting [00:15:49] it's a shame ganglia doesn't let you spoof a timestamp, otherwise i'd back-fill [00:16:03] you can always construct the rrd [00:16:06] :) [00:16:29] yeah, that's not a half-bad idea [00:16:33] slighty crazy but in a good way [00:16:37] heh [00:17:19] ugh [00:17:29] the fact that the title is followed by the unit of time without any separator is gross [00:18:04] but i'm done tinkering with it for now [00:18:10] thanks very much for the merges [00:18:44] DNS data started showing up too [00:18:46] we'll do more next week, don't worry :) [00:19:49] that's what, ms? [00:34:38] paravoid: yep (re: ms) [01:11:58] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [02:01:31] !log LocalisationUpdate failed: git pull of extensions failed [02:01:35] Logged the message, Master [02:17:43] (03PS3) 10TTO: Adjust reupload-own permissions for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80546 [02:18:00] (03PS5) 10TTO: skwiktionary: Set site logo to local file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [03:37:19] PROBLEM - MySQL Processlist on db1004 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 90 statistics [03:38:19] RECOVERY - MySQL Processlist on db1004 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 6 statistics [03:52:38] :( [03:52:44] whats wrong with these messages [03:53:48] it's fucked, no? [03:55:00] well, it gives lots of 0-thread states [03:55:02] that mean nothing [03:55:08] as generally you will always see e.g. 0 locked threads [03:55:23] and "authenticated" means either long distance connections that wikimedia has none of [03:55:33] or dns failure, that doesn't happen if you're running with skip-name-resolve [03:55:41] otoh, copy to table also is not much of a state in most of web apps [03:55:42] ergh. [03:55:55] and "statistics" means "looking up data" [03:56:11] * domas points to http://dom.as/2009/09/27/mysql-processlist-phrase-book/ [04:00:46] well, [04:00:57] i was about to point you to the icinga check that generates that alert [04:01:02] but git.wikimedia.org is 500ing [04:01:49] !log git.wikimedia.org: HTTP 500; Varnish XID 196670102 [04:01:53] Logged the message, Master [04:13:08] domas: anyways, it's part of pmp-check-mysql-processlist. the particular alert threshold that is being met appears to be this one: MAX="$(max "${UNAUTH:-0}" "${LOCKED:-0}" "${COPYIN:-0}" "${STATIS:-0}")" [04:14:42] where UNAUTH = 'unauthenticated', LOCKED = sum of 'Locked', 'Waiting for table level lock', 'Table lock', COPYIN = '.*opy.* to.* table.*' and STATIS = 'statistics' [04:39:33] is someone around who can build debian packages... I'd like to get operations/debs/LaTeXML to apt.wikimedia.org [05:03:57] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:04:37] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:27] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [05:22:28] PROBLEM - RAID on db1053 is CRITICAL: CRITICAL: Degraded [05:23:45] !log gitblit was returning errors for some urls with java.lang.NullPointerException, stopped and restarted it, seems ok now [05:23:48] Logged the message, Master [05:48:19] (03PS1) 10TTO: Localized logo for hewikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83218 [06:36:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:43:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [06:46:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [07:32:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.465 second response time [07:39:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:42:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.384 second response time [08:55:45] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 1653170 seconds [09:23:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:24:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [10:33:13] what would be the easiest way to retrieve in the repos a list of all WMF "real" domains? [10:38:27] prettiest I found so far is https://git.wikimedia.org/blob/operations%2Fdns.git/def7b5fb8be5b606482a60ae98f117b0f28983ef/templates%2Fwikimedia.org#L564 [10:40:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.442 second response time [11:06:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.384 second response time [11:12:46] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [12:33:06] (03PS1) 10Umherirrender: $wgCaptchaWhitelist: whitelist also links with query or anchor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 [12:34:18] (03CR) 10Umherirrender: "Untested, if escaping of regex is correct" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83225 (owner: 10Umherirrender) [13:10:29] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:19] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [13:27:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [14:31:28] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:32:28] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:53:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:03:58] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: No successful Puppet run in the last 10 hours [15:12:48] /msg ChanServ HELP [15:12:49] /msg ChanServ HELP op [15:27:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [15:32:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [15:58:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.642 second response time [16:27:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.102 second response time [17:23:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.935 second response time [17:55:17] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 33 statistics [17:58:17] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 3 statistics [18:00:18] heh, nice [18:00:25] replication unbroken for nearly 500 days [18:02:17] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 39 statistics [18:04:17] RECOVERY - MySQL Processlist on db1002 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 7 statistics [18:19:29] PROBLEM - MySQL Processlist on db1051 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 72 statistics [18:22:39] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 146 seconds [18:22:53] lol, msnbot is stupid [18:23:21] doesn't it go without saying [18:23:29] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [18:23:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [18:24:51] well [18:24:58] it is indexing per-page move/deletion logs [18:24:59] and what not [18:25:06] I wonder where it picks the links for those [18:34:26] domas: why not? it's linked from history and history links are not nofollow'ed, only rel="archives" whatever that means [18:35:29] RECOVERY - MySQL Processlist on db1051 is OK: OK 0 unauthenticated, 0 locked, 1 copy to table, 4 statistics [18:37:17] (03PS1) 10Ori.livneh: NavigationTiming StatsD instance: flush every 5 mins [operations/puppet] - 10https://gerrit.wikimedia.org/r/83230 [18:38:44] Nemo_bis: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=Navigation+Timing [18:39:00] ori-l: good morning [18:39:08] hey :) [18:39:17] hopefully the patch above will smooth out the mobile data [18:39:21] uh what's that pretty navigation sidebar [18:39:35] it's the list of views; that's standard [18:39:41] did you also find out how to make the eventlogging combined graphs in logscale [18:40:07] it's easy to do in graphite [18:40:32] before investing too much time in making it possible in ganglia i want to spend some time investigating the possibility of making more of graphite public [18:40:47] gotta run, ttyl [18:40:52] dinnertime for me [18:41:10] bon appetit [20:48:34] (03PS1) 10Edenhill: Added README.md markdown file [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/83347 [21:13:03] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [21:37:47] (03CR) 10Diederik: [C: 031] "Thanks Magnus!" [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/83347 (owner: 10Edenhill) [23:13:11] PROBLEM - RAID on db1001 is CRITICAL: CRITICAL: Degraded [23:32:46] (03PS1) 10Jeroen De Dauw: Update some rss/atom feeds in the planet wikimedia config [operations/puppet] - 10https://gerrit.wikimedia.org/r/83348 [23:32:52] (03CR) 10jenkins-bot: [V: 04-1] Update some rss/atom feeds in the planet wikimedia config [operations/puppet] - 10https://gerrit.wikimedia.org/r/83348 (owner: 10Jeroen De Dauw) [23:35:50] (03CR) 10Jeroen De Dauw: "wtf?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83348 (owner: 10Jeroen De Dauw)