[00:02:21] Ryan_Lane: hooft [00:04:46] LeslieCarr: where is the htcp purge code? [00:05:36] usr/local/bin/udpmcast.py [00:05:40] not AaronSchulz [00:05:58] dunno, however that is not the important bit - important is finding the multicast relay [00:05:59] at the moment [00:06:07] in the future, it may become the important bit [00:06:14] AaronSchulz: puppet/files/varnish/varnishhtcpd [00:06:36] AaronSchulz: refreshLinks.* jobs aren't running at all [00:06:49] ahh, my "find" command had . that messed it up [00:07:09] binasher: where's the unicast -> multicast gateway? [00:07:23] it's not documented and isn't puppetized [00:07:42] and two of us are slowly going insane [00:07:44] i am hoping with all my hoping ability that you know [00:07:46] not so slowly [00:08:03] you're not already insane? [00:08:13] i don't know.. but i can probable find out [00:08:14] just a sec [00:08:22] oh yay thank you [00:08:23] binasher: are you running jobs on fenari? [00:08:25] Ryan_Lane: the one on oxygen? [00:08:31] also, we will then put it in wikitech [00:08:32] or is that something else? [00:08:38] oxygen is for logging [00:09:07] AaronSchulz: no, reedy is [00:09:20] I had this same exact problem last time I tried to find it [00:09:26] yeah [00:09:30] then ma rk came on, fixed it in a few minutes and disappeared [00:09:36] and it's still not documented or puppetized [00:09:48] lame [00:13:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:47] binasher: /etc/wikimedia-site is pmtpa there, how is it not writing to a read-only slave? [00:13:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.069 second response time [00:14:16] AaronSchulz: tim put the eqiad masters in slot 0 of db-pmtpa.php [00:14:16] binasher: let us know if there's anything we can do to help - any searches, salt searches, etc [00:14:26] AaronSchulz: i don't really like that at all [00:14:32] he did for crons on hume but.. memcached consistency issues. [00:14:45] and other reasons [00:14:45] do we have a hume counterpart? [00:14:50] we can just move evertying on hume to eiqad [00:14:52] it's all puppetized [00:15:09] I just reimaged recently [00:18:11] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 189 seconds [00:18:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [00:19:47] LeslieCarr: well, there's an instance running on dobson [00:20:06] that being an old server is promising [00:20:11] * LeslieCarr scurries over and looks [00:20:12] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [00:20:13] that wants to forward to hooft in esams [00:20:32] that looks like the right group [00:20:36] that's the right Z endpoint [00:20:36] dobson doesn't appear to actually get the 239.128.0.112 mcast traffic though [00:20:36] :) [00:20:45] if oxygen is for logging then it's udp2log and analytics. but there is also "fluorine" using class { "role::logging::mediawiki" [00:20:48] though it tries to join [00:20:49] dobson.wikimedia.org > 239.128.0.112: igmp v2 report 239.128.0.112 [00:21:14] there is something insane going on with that freaking group [00:21:22] but yay! [00:21:23] huzzah~! [00:21:31] thank you [00:21:33] AaronSchulz: any idea re: refreshLinks jobs? [00:21:35] no prob [00:25:18] Are they what is taking so long? [00:25:57] Reedy: no, they aren't being run at all [00:26:06] deja vu? [00:26:17] oh, do you want to change how often those run? [00:26:25] # add cron jobs - usage: @ (these are just needed monthly) (note: s1 is temp. deactivated) cronjob { ['s2@2', 's3@3', 's4@4', 's5@5', 's6@6', 's7@7']: } [00:26:43] we never ran s1 per cron, it would not finish [00:27:17] it should not run today..hmm.. just the first week of a month one cluster per day [00:29:09] ooh [00:29:54] ./manifests/misc/maintenance.pp class misc::maintenance::refreshlinks [00:32:06] binasher: not really, still looking [00:32:07] New patchset: Pyoungmeister; "coredb: this should make everything set to migrate es1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45916 [00:32:11] the pending job list cache key looks fine [00:32:24] there are processes on mw1001 [00:32:36] nextJobDB seems to work [00:33:01] i manually deleted that key from memcached at one point earlier [00:33:08] they are running with --dfn-only if that is relevant at all [00:33:15] php multiversion/MWScript.php runJobs.php enwiki --type refreshLinks is slow but works [00:33:32] binasher: was it wonked? [00:34:50] enwiki:jobqueue:refreshLinks:isempty contains the string "false" [00:35:10] it didn't seem to be, just wanted to see it get replaced [00:36:05] the only db queries containing refreshLinks coming from the jobbers is "SELECT /* JobQueueDB::doIsEmpty */ 1 FROM `job` WHERE job_cmd = 'refreshLinks2' AND job_token = '' LIMIT 1" [00:36:26] preilly: can you email a link? [00:36:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45916 [00:37:58] AaronSchulz: sure [00:38:21] AaronSchulz, dschoon: http://hackasurfa.yerdle.com/ [00:39:15] interesting. [00:40:19] binasher: apparently running refreshLinks2 just aborts out almost all the jobs as duplicates [00:40:35] our de-duping code has gotten much more efficient! [00:55:21] binasher: didn't this happen once before? [00:56:42] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [00:57:24] AaronSchulz: yep [01:01:37] * paravoid is back [01:01:43] gwicke: http://hackasurfa.yerdle.com/ [01:05:22] i'm taking off for a weekend in wine country, bye all [01:05:32] heh [01:08:12] dschoon: any interest? [01:09:02] preilly: while hanging out with Adam Werbach would be cool, i have plans already :( [01:09:35] dschoon: Do you know Adam? [01:09:46] nope, but i know s&s [01:10:11] dschoon: cool yeah Adam and I are friends too bad you're busy [01:11:08] dschoon: maybe next time [01:11:30] is paul arden still alive? [01:11:38] clever dude, that one [01:12:09] dschoon: Died: April 2, 2008 [01:12:33] yeah, i thought so :( [01:12:58] his books were pretty great for being 50 pages of ad copy [01:14:01] !log deactivating multicast for 1 minute in order to try and flush the multicast forwarding table [01:14:14] Logged the message, Mistress of the network gear. [01:16:02] preilly: https://gerrit.wikimedia.org/r/45802, https://gerrit.wikimedia.org/r/45800, https://gerrit.wikimedia.org/r/45798 … from earlier today [01:16:23] so, it seems jobs-loop is stuck in the high priority loop [01:16:36] wikis with no such jobs are still being listing as having them so it keeps checking... [01:16:52] I can see that it does no jobs for them so they clearly have none [01:17:21] yes!!! [01:17:39] sucks for the low priority jobs [01:17:48] I see no memcached errors in the logs [01:18:19] huzzah!!!!! [01:18:49] * AaronSchulz waits for more of Leslie's insightful comments [01:18:57] rfaulkner: Change merged: preilly; [sartoris] (master) - https://gerrit.wikimedia.org/r/45802 Change merged: preilly; [sartoris] (master) - https://gerrit.wikimedia.org/r/45800 Change merged: preilly; [sartoris] (master) - https://gerrit.wikimedia.org/r/45798 [01:19:00] htcp purging is working [01:19:12] see, insightful [01:19:19] fucking hell that was ridiculous [01:19:28] i took my inspiration from the IT crowd [01:19:30] LeslieCarr: lame [01:19:36] LeslieCarr: ha ha ha [01:19:51] https://www.youtube.com/watch?v=nn2FB1P_Mn8 [01:20:07] AaronSchulz: commonswiki: 1627300 [01:20:09] heh [01:20:28] LeslieCarr: you know the IT crowd boss is on Portlandia now right? [01:20:56] LeslieCarr: Douglas Reynholm (Series 2-4) Played by: Matt Berry [01:21:19] paravoid: what about it? [01:21:26] !log htcp purging across datacenters now "works". dobson is now receiving purge requests on multicast group 239.128.0.112 port 4827 and transmitting them via udpmcast.py (started by rc.local) to hooft in esams [01:21:37] Logged the message, Mistress of the network gear. [01:21:44] LeslieCarr: sweet! [01:21:45] AaronSchulz: nothing, it's funny how it's over 1.6 million [01:25:03] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [01:25:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:25:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [01:31:01] leslieCarr - thanks! [02:21:31] !log aaron synchronized php-1.21wmf8/includes/job/JobQueueDB.php [02:21:42] Logged the message, Master [02:28:47] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'logging' [02:28:57] Logged the message, Master [02:29:36] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:40] !log LocalisationUpdate completed (1.21wmf8) at Sat Jan 26 02:29:39 UTC 2013 [02:29:45] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:50] Logged the message, Master [02:33:03] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [02:33:36] RECOVERY - Host ms-be1011 is UP: PING WARNING - Packet loss = 73%, RTA = 0.37 ms [02:34:57] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:35:45] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:26] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [02:36:30] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [02:38:08] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: aawiki to 1.21wmf8 [02:38:18] Logged the message, Master [02:38:37] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [02:38:41] Ryan_Lane: salt-minion's init script is broken [02:38:42] Rather than invoking init scripts through /etc/init.d, use the service(8) [02:38:45] utility, e.g. service S20salt-minion start [02:38:47] initctl: Unknown job: S20salt-minion [02:38:50] Since the script you are attempting to invoke has been converted to an [02:38:53] Upstart job, you may also use the start(8) utility, e.g. start S20salt-minion * Starting Salt Minion [fail] [02:39:06] Ryan_Lane: and I also still see it dying quite often in dmesg [02:40:17] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [02:40:36] (ignore that) [02:41:26] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [02:54:55] !log LocalisationUpdate completed (1.21wmf7) at Sat Jan 26 02:54:55 UTC 2013 [02:55:07] Logged the message, Master [03:25:04] oomph [03:49:59] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:08:36] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:12:56] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [04:14:53] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [04:17:53] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [04:19:59] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [04:20:00] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [04:20:53] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [04:21:57] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [04:32:40] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [04:35:40] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [04:35:41] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [04:38:41] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [05:41:10] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [06:09:26] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [08:09:28] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [08:10:26] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [08:15:00] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:57:15] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [11:26:29] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [11:26:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:26:31] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:39:42] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [13:40:51] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:40] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [13:51:22] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:08:43] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:14:28] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [14:16:25] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [14:19:25] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [14:22:25] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [14:23:28] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [14:34:19] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [14:37:19] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [14:37:19] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [14:40:19] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [15:42:06] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [16:10:21] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [17:37:26] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [17:37:38] Logged the message, Master [17:38:05] New patchset: Reedy; "Push rest of closed wikis to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45947 [17:38:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45947 [17:38:56] New patchset: Reedy; "Remove old 1.21wmf5 symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45948 [17:39:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45948 [18:09:41] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [18:11:14] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [18:15:17] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [19:42:35] hi, anyone with logs access around? I just found out that some "smart" dev uses xmlfm api format (html wrapped xml) >250/sec [19:42:48] and that has been going on for a long time [19:44:43] oh aye? [19:45:02] Reedy: ? [19:45:16] do you have logs access? [19:45:18] Api logging is limited at best [19:45:21] "logs" is vague [19:45:39] there are no server logs for api? [19:46:02] "server logs" [19:46:29] oh wtf [19:46:34] Whoes made a mess of fluorine? [19:47:18] * Aaron|home thinks of how to rewrite jobs-loop [19:47:47] Reedy: what would you recommend to find out who misuses the api? 250/sec is a heavy use - xml gets 500/sec. and considering all the regexing required to generate html-formatted xml, its probably three times as expensive [19:47:47] reedy@fluorine:/a/mw-log$ ls [19:47:47] 1264.log fatalapi.log ns.php(3871):.log s.php(3871):.log [19:47:47] 38808.log fatal.log Object(User),.log sql-bagostuff.log [19:47:47] #46352.log fatapi.log oks::onGetUserPermissionsErrors(Object(Title),.log ssionsErrorsInternal('read',.log [19:47:49] 720.log filebackend-ops.log oks.php(255):.log s('userCan',.log [19:47:50] need to track and kiil [19:47:51] Aaron|home: ^^ [19:47:53] w [19:47:56] t [19:47:57] Reedy: that happened once before [19:48:00] f [19:48:02] do know who caused it [19:48:07] *don't [19:48:14] yurik: Best we log is osmething like 2013-01-26 19:47:30 mw1117 itwiki: API GET 151.49.78.23 151.49.78.23 T=6ms format=json action=opensearch search=Yorkshire%20( namespace=0 suggest= [19:48:29] perfect, its a good start [19:48:39] Reedy: you don't log user-agents? [19:49:02] need to grep for API GET without the "format" and without the "open search" [19:49:03] MatmaRex: Not personally [19:49:06] Reedy: i shouldn't have bothered implementing this in my lib, lol [19:49:24] i seriously doubt someone is dumb enough to supply format=xmlfm [19:49:37] reedy@fluorine:/a/mw-log$ tail -n 10000 api.log | grep -c xmlfm [19:49:37] 0 [19:49:39] Great for starters [19:50:07] of course - unfortunatelly the "smart" devs simply do not add the "format=" to their query [19:50:32] there is a special header made specifically for them!!! We should translate it into every language [19:50:43] LeslieCarr: whatever you consider "romantic"... [19:50:46] MatmaRex: reedy@fluorine:/a/mw-log$ tail -n 10000 api.log | grep -c -v format [19:50:46] 529 [19:50:51] arfgh [19:51:03] yurik: 529/10000 don't specify a format [19:51:10] Reedy: so you know jobs-loop has an obvious starvation condition? [19:51:19] I think these logs might be sampled too [19:51:23] * Aaron|home wonders who to whack [19:51:41] my counts based on https://graphite.wikimedia.org/dashboard/ApiAllFormats -- 3000-4000 -- json, 500 -- xml, 250 - no format [19:51:41] MatmaRex: http://p.defau.lt/?cpoh3K_eZsKQNocPPDhlgw [19:52:12] Reedy: kill them all [19:52:23] All what? [19:52:29] no formatters :) [19:52:34] Reedy: ons.php(3871):.log ? [19:52:40] kidding, but we need to figure out who is the smart individual [19:52:43] Someone fucked up the config [19:52:58] could you send me that 500+ ? [19:53:47] There's wikidata bots using no format parameter [19:54:06] you kidding?! [19:54:41] Nope [19:54:42] 2013-01-26 19:52:47 mw1130 wikidatawiki: API POST Innocent_bot 78.73.94.165 T=205ms action=wbsetlabel id=q1994003 token= bot= language=nb value=Vid%C3%B6%C3%A5sen [19:55:23] Reedy: i don't think its them [19:55:43] I didn't say it was [19:55:48] It just made me puke [19:56:16] MatmaRex: I think we log useragents in some of the logs somewhere [19:57:38] yurik: a large quantity of the formatless queries are opensearch requests [19:57:45] reedy@fluorine:/a/mw-log$ tail -n 10000 api.log | grep -v -c format [19:57:46] 403 [19:57:46] reedy@fluorine:/a/mw-log$ tail -n 10000 api.log | grep -v format | grep -c opensearch [19:57:46] 401 [19:57:56] goddamnit [19:58:02] granted, not necesserily the same 10,000 log lines, but whatever [19:58:07] For rough numbers its fine [19:58:07] so much hate [19:58:09] hate hate hate [19:58:11] yes, those we could skip - opensearch also overrides its formatter [19:58:19] Have you got a knife LeslieCarr? [19:58:28] i am at home, too many kitchen knives [19:58:36] we are looking probably for action=query with no params [19:58:41] no format param [19:58:48] tempting to get a flight to tampa [19:59:07] or any other action that does not override formatter setting [19:59:10] As long as you take a video camera.. [19:59:27] Reedy- It looks like a good fraction of the non-opensearch requests with no format are queries with no parameters at all from 10.64.17.3 and .6. Is that something testing "is the API up?" every few seconds? [20:00:02] vl1018-eth1.lvs1003.wikimedia.org. [20:00:05] LVS [20:00:13] anomie: it wouldn't be testing - 250/sec based on grahite site [20:00:19] at least i hope it wouldn't be [20:00:30] Actually, isn't there IP webcams in tampa already? [20:00:31] yurik- Most of those 250 are probably opensearch [20:00:40] anomie: can't be [20:00:41] or does opensearch not log there? [20:00:44] nevermind [20:00:49] unless there is a bug in profiling [20:00:53] New patchset: Reedy; "Revert "Disble QueryPage updates on frwiki like enwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45963 [20:01:54] Reedy- What's LVS? [20:02:10] that stat really comes from the xmlfm formatter - meaning that formatter is being used for output. Accessing api.php without params would cause that [20:02:22] but i seriously doubt we have that many devs :) [20:02:46] anomie: load balancing [20:03:07] Reedy: there are webcams [20:03:19] Reedy- Why would load balancing be doing a GET on the API with no parameters? [20:03:21] however i don't want them having video of the crime [20:03:41] anomie: health check [20:03:54] LeslieCarr- That's what I guessed. Thanks. [20:04:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45963 [20:06:17] New review: Nemo bis; "https://bugzilla.wikimedia.org/show_bug.cgi?id=44348#c6 has some nice numbers from real-life update...." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33713 [20:07:19] LeslieCarr: funny - api.php actually caches the entire output of the help screen, which means most of the heartbeat only checks that the cache is up :) [20:08:15] !log reedy synchronized wmf-config/InitialiseSettings.php [20:08:25] Logged the message, Master [20:08:58] yeah, that is exactly what the heartbeat does [20:09:05] just checks that the cache is up [20:09:06] * MatmaRex sometimes gets the impression Reedy is running the servers by himself [20:09:39] more complex checks don't belong on the load balancing layer, they belong on other monitoring layers [20:09:53] now, we could use some more complex checks ;) [20:10:36] unit tests against servers [20:13:08] * anomie wonders if it would be more efficient for it to query api.php?format=none, if it's only checking for a 200 response [20:18:22] anomie: not sure what you mean [20:19:24] yurik- If the heartbeat is just checking whether fetching api.php gives an HTTP 200 response, adding ?format=none would have it not bother formatting and outputting the text [20:21:16] anomie: true, but we should also make it so that the help message is not returned [20:57:49] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [21:00:50] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-vodaphone-india.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [21:28:03] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [21:28:04] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [21:28:05] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:29:06] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Sat Jan 26 21:28:50 UTC 2013 [23:23:01] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [23:24:58] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:53:01] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours