[00:01:51] RECOVERY - Disk space on ms-be7 is OK: DISK OK [00:02:54] !log maxsem synchronized php-1.23wmf9/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/105108' [00:03:14] Logged the message, Master [00:03:58] I'm done [00:04:27] woohoo [00:21:23] (03CR) 10Tim Starling: [C: 032] Set $wgULSFontRepositoryBasePath to protocol-relative URL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105115 (owner: 10Ori.livneh) [00:21:42] (03Merged) 10jenkins-bot: Set $wgULSFontRepositoryBasePath to protocol-relative URL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105115 (owner: 10Ori.livneh) [00:23:23] !log tstarling synchronized wmf-config/CommonSettings.php [00:23:53] Logged the message, Master [00:25:05] TimStarling: thanks [00:28:34] ok, well it still aborts [00:31:04] it aborts the connection and then starts a new request for the same URL after 100ms [00:36:12] do you think we could check connection_aborted() during MW request shutdown, and log to a special channel if it is? [00:45:08] TimStarling: maybe, do you have any reason to suspect the problem is prevalent? [00:45:38] I haven't been able to reproduce it, so it could still be a browser bug [00:51:14] ori: what's up? [00:51:57] paravoid: can you be around if I flip the graphite CNAME to tungsten? [00:51:59] also, hi [00:52:13] happy new year [00:52:21] same to you :) [00:52:35] yes, I can be around [00:52:39] is everything ready? [00:53:03] no, not yet. how much longer do you think you'll be around? [00:53:54] half an hour maybe? [00:54:05] but I can stay for more [00:54:12] how can I help? [00:55:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.271 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [00:56:51] RECOVERY - MySQL Slave Running on db68 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:58:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:02:34] paravoid: argh, I need to edit all the graph definition files in files/graphite/gdash/dashboards to change all FooMetric -> MediaWiki.FooMetric [01:02:44] since they're namespaced under 'MediaWiki' in the new instance [01:03:04] sed -i ? [01:03:06] :) [01:03:16] (i know, error prone :P) [01:03:17] FooMetric can be anything, sadly [01:03:34] e.g.: :data => 'cactiStyle(substr(EditPage.*.tp90,1,2))' [01:03:39] ('EditPage' in this case) [01:04:11] paravoid: sorry, I forgot about that. Probably best that I just ping you tomorrow. [01:14:08] (03PS4) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [01:18:01] AaronSchulz: since you're setting it to 1 you might as well explicitly check for that value [01:18:55] I did that before and then didn't, meh [01:19:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 8.612 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [01:21:04] (03PS5) 10Aaron Schulz: Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 [01:21:05] meh [01:22:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:27:06] (03CR) 10Ori.livneh: [C: 032] Added a separate scap-rebuild-cdbs phase to scap [operations/puppet] - 10https://gerrit.wikimedia.org/r/105110 (owner: 10Aaron Schulz) [01:28:03] ori: https://gerrit.wikimedia.org/r/#/c/105021/ [01:28:06] eeaseh [01:30:15] !log reedy started scap: Fix Disambiguator hewiki magicwords [01:30:19] Reedy: why two separate messages? [01:30:47] also, why remove the code to print it to stdout? [01:31:02] Logged the message, Master [01:31:42] does it not print and assign? [01:32:30] "The -v option causes the output to be assigned to the variable var rather than being printed to the standard output." [01:32:31] srsly [01:33:10] it's like printf / sprintf [01:33:32] you could just change one line, the current dologmsg [01:34:39] well, no. print it to a var, echo the var, and interpolate it into the log msg format [01:39:12] (03PS2) 10Reedy: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 [01:42:41] mw103: mwdeploy is not in the sudoers file. This incident will be reported. [01:42:41] mw103: mwdeploy is not in the sudoers file. This incident will be reported. [01:42:41] mw103: Done [01:42:48] There's quite a few of these for different servers... [01:42:58] I'm really on santas naughty list now [01:52:18] (03PS1) 10Aaron Schulz: Do not try to run as mwdeploy in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105132 [01:52:36] yep. [01:52:53] Didn't even last a fortnight :( [01:53:24] ori: ^ stupid c/p error [01:54:21] Is my "localisation update" not going to have worked then? [01:55:19] it would have aborted out [01:55:21] (03PS3) 10Ori.livneh: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:55:29] anything after the failed sudo doesn't happen [01:55:48] (03CR) 10Ori.livneh: [C: 032] Do not try to run as mwdeploy in scap-2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105132 (owner: 10Aaron Schulz) [01:55:58] Right [01:56:00] Sigh [01:57:07] (03PS4) 10Ori.livneh: Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:57:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Make logmsgbot report scap length to irc channel, but not log it [operations/puppet] - 10https://gerrit.wikimedia.org/r/105021 (owner: 10Reedy) [01:57:24] !log reedy finished scap: Fix Disambiguator hewiki magicwords [01:57:39] scap completed in 31m 28s. [01:58:02] Logged the message, Master [02:01:50] Hmmm [02:02:03] Localisation update is due to start nowish.. [02:03:09] i ran puppet on tin [02:03:13] anything else i can do to help? [02:03:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.726 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:03:52] I sorta need to run scap again... But it's going to be running at a very similar time to localisation update [02:04:59] so the /upstream dir must have been synced, so I can just rebuild the cdbs [02:05:18] (03CR) 10Faidon Liambotis: "Our nameservers do not provide a recursion service anyway, so yes, these are all noops and this discussion is a orthogonal, indeed." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [02:05:57] paravoid: you aren't awake are you? [02:06:02] I am [02:06:11] it's absurdly late there ;) [02:06:28] I'm thinking of ways to move away from dns for ldap/puppet [02:06:34] were you afraid of someone hacking my gerrit account and posting code reviews? :) [02:06:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:06:43] AaronSchulz: WFM [02:07:08] I'm not sure I understand [02:07:12] I think it's done already, just hanging on the last few boxes [02:07:14] one sec [02:07:21] need to describe it :) [02:07:27] okay :) [02:07:38] I was thinking of storing puppet data in DNS, and writing an ENC that reads it from there [02:07:40] searchidx1001...always the last to the party [02:07:54] then writing everything into DNS using designate [02:08:51] Reedy: I didn't do wmf9 though, does that matter? [02:09:01] why? [02:09:20] because I'd like the web interface to stop writing puppet and dns info into ldap [02:09:41] ah [02:09:58] this would make it easier to open api access [02:10:21] no generic key/value openstack service yet I'm assuming :) [02:10:44] well, there is, kind of [02:10:51] it wouldn't really be much of a help there, though [02:11:02] AaronSchulz: Shouldn't really. localisation update will fix that one long before it's used on hewiki [02:11:15] since it really just provides the ability to create them in tenants [02:11:37] this needs to be globally accessible and tenant writable [02:11:59] I could write a nova plugin for puppet to write it into LDAP, too [02:12:12] it doesn't necessarily need to be designate [02:13:26] DNS can be extended to be a key/value for things that don't write a lot and don't need to be queried easily. [02:13:36] not sure if it's an insane idea or not, though :D [02:13:38] switching to designate for DNS sounds kind of obvious [02:13:50] yeah, we're doing that either way [02:13:56] using DNS for puppet classes... dunno, I'm not thrilled by the idea :) [02:14:02] * Ryan_Lane nods [02:14:17] how would you do variables? [02:14:24] arbitrary TXT records or something? [02:14:28] yep [02:15:56] extending nova client and server api is likely easy enough too, though [02:16:14] and then what? nova server api writing to ldap? [02:16:17] yep [02:16:44] I want to move away from LDAP for DNS because pdns's implementation sucks [02:17:12] and it's unmaintained [02:17:20] and there's an openstack service for it now [02:17:25] yep [02:17:32] which makes things way easier [02:17:32] that alone sounds a good enough reason to me [02:17:38] sounds like* [02:17:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 7.843 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:17:44] mhoover has a working prototylpe [02:17:55] !log LocalisationUpdate completed (1.23wmf8) at Fri Jan 3 02:17:55 UTC 2014 [02:17:56] do you know what is wrong with ^^^ btw? :) [02:18:05] speaking of labs DNS [02:18:06] has that been flapping? [02:18:09] yes [02:18:10] all day [02:18:14] I haven't investigated [02:18:18] Logged the message, Master [02:18:27] opendj has an issue of some variety [02:18:38] I'm betting something absurd is querying it poorly [02:18:39] and even the recovery is 7.8s, doesn't sound very recovered to me either [02:19:01] indeed [02:19:31] checking opendj on virt1000 [02:19:42] seems fine there [02:20:23] hm. still a problem on virt0 [02:21:23] Reedy / AaronSchulz: I'm about to head out. Is there anything else that I should hang around for? [02:21:23] (03PS1) 10Springle: sanitarium scripts built during Sep13 data leak event, based on .sql files in Asher's old $HOME. [operations/software/redactatron] - 10https://gerrit.wikimedia.org/r/105135 [02:21:29] hm. someone's doing a pretty expensive query really often [02:21:37] not for me, I'm heading out too [02:22:21] I'm hoping to head to bed soon.. [02:22:21] (03CR) 10Springle: [C: 032] sanitarium scripts built during Sep13 data leak event, based on .sql files in Asher's old $HOME. [operations/software/redactatron] - 10https://gerrit.wikimedia.org/r/105135 (owner: 10Springle) [02:22:41] Reedy: I can stick around if there's something I could do to assist [02:23:04] we should really have a dedicated ldap server for wikitech/gerrit/other web interfaces [02:23:05] ori: Nothing to do, just been confirmed as fixed :) [02:23:12] wee. [02:23:19] so that labs instances can't affect it [02:23:23] bar some purging it seems [02:23:27] yeah definitely [02:23:39] but the discussion above sounds very appropriate too :) [02:23:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:24:47] between the two, I think I prefer nova writing to LDAP using the puppet/ldap schema, rather than trying to use domain records to encode puppetClass/puppetVar [02:24:58] agreed [02:25:09] either way I need to write a plugin for some service [02:33:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 9.447 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:36:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [02:47:35] !log LocalisationUpdate completed (1.23wmf9) at Fri Jan 3 02:47:34 UTC 2014 [02:47:53] Logged the message, Master [02:52:31] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.117 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [02:53:37] paravoid: ^^ well seems clearing that cron may have fixed the situation [03:03:17] (03PS1) 10Ryan Lane: Specify the command to remove for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105136 [03:08:16] (03CR) 10Ryan Lane: [C: 032] Specify the command to remove for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105136 (owner: 10Ryan Lane) [03:20:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jan 3 03:20:58 UTC 2014 [03:21:20] Logged the message, Master [03:23:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:24:07] !log start schema changes for bug 59236, indexing only, ipblocks ipb_parent_block_id [03:24:54] Logged the message, Master [03:33:40] (03PS1) 10Ryan Lane: Fix misplaced closing brace for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105137 [03:33:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 6.466 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [03:35:28] (03CR) 10Ryan Lane: [C: 032] Fix misplaced closing brace for manage-exports [operations/puppet] - 10https://gerrit.wikimedia.org/r/105137 (owner: 10Ryan Lane) [03:37:41] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:39:41] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 5.797 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [04:24:14] TimStarling: is there any use keeping the postConnectionBackoff stuff around? [04:27:34] !log added python-keystone-redis to apt repo [04:27:35] actually if getLagTimes is the only thing used, then that could just go in LB and LM can be deleted [04:27:50] Logged the message, Master [04:28:02] I guess postConnectionBackoff can go [04:28:32] it's a whole separate hierarchy that really only supports mysql [04:28:34] I figured it was better to reduce the max_threads on the server, and let the clients get a connection error and die [04:28:47] well, the idea was that more subclasses would be added to support other DBMSes [04:29:07] and, ideally, some other information system outside of MySQL [04:29:22] that doesn't require you to actually connect to the server to check whether it is overloaded [04:30:14] can't other replication systems have a meaningful getLag() implementation? I'd hope that any replica DB has some way to get the lag somehow. [04:30:39] so the only real use would be something that grabs async stats without a direct DB query [04:31:30] we kind of hack around that with $wgMemc, the add() locks, and using stale data while locked [04:31:40] yes, we have memcached already [04:31:58] ideally you don't want to have to connect to a server in order to decide whether or not to connect to it [04:32:04] (03PS1) 10Ryan Lane: Add redis support to keystone [operations/puppet] - 10https://gerrit.wikimedia.org/r/105139 [04:32:05] it is kind of inefficient [04:32:38] I imagined LoadMonitor being a client for some system which held information on all mysql servers [04:32:50] updated by regular polling [04:36:22] it would be nice to have a non-mysql specific one...actually the mysql one might be as long as getLag is implemented [04:37:28] yeah, once postConnectionBackoff() is removed, it is not really MySQL-specific [04:38:55] of course, the DatabasePostgres class for example, would need config flags for the type of replication (http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling#Introduction) [04:39:10] with mysql, there aren't so many choices [04:39:33] TimStarling: so when was that backoff disabled? [04:39:41] I see it commented out with --TS [04:43:59] after 2009 and before 2013 [04:46:31] PROBLEM - MySQL Slave Running on db1026 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: Error Deadlock found when trying to get lock: try restarting transac [04:46:39] pity the history of this repo wasn't imported into git from subversion [04:47:31] RECOVERY - MySQL Slave Running on db1026 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:48:39] looks like it was around 2011-06-23, r1905 in the old repo [04:50:29] it was during incident response on that day [04:50:49] https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_18#June_23 [04:51:01] 16:32 RoanKattouw: Site came back up instantly after Tim disabled max_threads [04:51:01] 16:28 logmsgbot: tstarling synchronized php-1.17/wmf-config/db.php 'disabled max threads' [05:00:58] haproxy nodes in front of slaves, perhaps. it can monitor both rep lag and outages [05:08:11] aaron@fluorine:~/mw-log$ grep -P -o 'Error connecting to .+ ' dberror.log | grep -P -o '\d+\.\d+\.\d+\.\d+' | sort | uniq -c [05:08:13] 4312 10.64.0.6 [05:08:14] 8743 10.64.16.23 [05:08:16] 718 10.64.16.29 [05:08:48] springle: I always wonder if there is some way to avoid that error spam (though it only matters to the user if all the slaves give it) [05:09:43] either the weights are off, the max connections on the server are too low, or there are too many CDN misses [05:10:59] db1002, db1034, and db1040 [05:14:44] max_connections is already high. imo it should be lower [05:15:24] well that or more slaves are needed [05:15:35] so 4 possibilities ;) [05:15:37] db1002 and 34 are both s2 and have same hardware/config. probably a spike [05:15:49] yes definitely. have ordered some [05:15:59] these flood of errors happen every day [05:16:23] springle: move some from tampa too? >:D [05:16:43] yep :) i have 9 coming from tampa ;) [05:16:44] hmm, that emoticon looks wrong in BZ [05:17:12] dastardly smiley face is just regular angry face [05:17:21] *CZ [05:37:40] query: SELECT /* EditPage::getLastDelete EvilFreD */ [05:37:59] silly that such a simple query falls over...must be those missing log indexes [06:57:42] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:58:52] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:21] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:31] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 2217.21 ms [06:59:51] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64254 bytes in 5.294 second response time [06:59:53] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 16%, RTA = 118.64 ms [07:00:21] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 7.692 second response time [07:00:51] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:51] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 5.604 second response time [07:04:21] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:03] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2161.02 ms [07:05:03] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2214.52 ms [07:05:11] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 2058.84 ms [07:05:32] RECOVERY - Host amssq48 is UP: PING WARNING - Packet loss = 37%, RTA = 111.89 ms [07:05:32] RECOVERY - Host amssq56 is UP: PING WARNING - Packet loss = 37%, RTA = 146.32 ms [07:05:41] RECOVERY - Host amssq62 is UP: PING WARNING - Packet loss = 37%, RTA = 118.10 ms [07:06:12] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64390 bytes in 0.655 second response time [07:06:22] !log Disabled OSPF3 on csw2-knams:xe-1/1/0.0 [07:06:31] PROBLEM - Packetloss_Average on erbium is CRITICAL: CRITICAL: packet_loss_average is 14.6136152381 (gt 8.0) [07:06:40] Logged the message, Master [07:11:01] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.32331555556 (gt 8.0) [07:16:31] RECOVERY - Packetloss_Average on erbium is OK: OK: packet_loss_average is 0.48323969697 [07:17:01] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.43920567568 (gt 8.0) [07:21:01] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.629932833333 [07:21:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [07:27:01] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.644574864865 [07:37:03] !log streber from mgmt console reports eth0 link down (hence it appears down to icinga and ganglia) [07:37:20] Logged the message, Master [07:37:52] no carrier... [07:38:07] * apergos looks around for a mark [07:39:26] mark, around? any ideas about streber no carrier? [07:45:07] nm I see your email nw [08:09:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [09:07:07] goooood morning [09:07:24] heyo [09:10:26] (03CR) 10Hashar: [WIP] Kibana puppet class (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [09:16:32] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.88 ms [09:18:31] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:22:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.90 ms [09:25:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:27:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [09:28:01] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 264 MB (2% inode=72%): [09:31:01] RECOVERY - Disk space on wtp1023 is OK: DISK OK [09:31:51] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:51] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:33:31] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [09:33:41] PROBLEM - Disk space on wtp1010 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=72%): [09:36:11] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [09:38:41] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:41] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [09:49:31] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [09:49:41] RECOVERY - Disk space on wtp1010 is OK: DISK OK [09:50:51] !log restarted parsoid on wtp1010 and 1023, several gigs of logs full of "Error: Can't set headers after they are sent." from ServerResponse.OutgoingMessage.setHeader [09:51:09] Logged the message, Master [09:58:23] (03CR) 10saper: "Oh, very imporant reasons:" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [11:12:13] (03PS1) 10Ori.livneh: gdash: Prefix all metric names with 'MediaWiki.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/105163 [11:16:10] ori: friendly reminder that it is 3am and you should sleep :-D [11:16:52] allllllllmost [11:17:14] ori: and I resist the envy of bike shedding about prefixing metrics with MediaWiki :D [11:17:45] (03PS1) 10Ori.livneh: point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 [11:17:55] (03CR) 10jenkins-bot: [V: 04-1] point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 (owner: 10Ori.livneh) [11:18:36] bah [11:19:00] (03PS2) 10Ori.livneh: point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 [11:49:05] (03CR) 10Aklapper: [C: 031] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103525 (owner: 10Dzahn) [12:14:30] !log Jenkins is showing failures for tests executed on integration-slave01 (remote file system failing) [12:14:46] Logged the message, Master [12:14:56] uhuh, morebots is here :) [12:43:48] !log Jenkins still unable to use integration-slave01 (restarted the node in labs, and disconnected/re-launched slave agent afterwards, too; no effect) [12:44:02] Logged the message, Master [13:30:21] PROBLEM - MySQL Processlist on db1009 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 92 copy to table, 36 statistics [13:34:21] RECOVERY - MySQL Processlist on db1009 is OK: OK 1 unauthenticated, 0 locked, 6 copy to table, 4 statistics [13:35:46] !log killed msnbot spike on s2 [13:36:04] Logged the message, Master [13:44:07] again? [13:45:49] msn o_O [13:56:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:01] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [14:08:31] PROBLEM - HTTP on aluminium is CRITICAL: Connection refused [14:11:31] RECOVERY - HTTP on aluminium is OK: HTTP OK: HTTP/1.1 302 Found - 557 bytes in 0.001 second response time [14:14:51] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:51] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [14:18:11] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [14:19:12] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.446 second response time [14:23:14] !log restarting Jenkins , some git plugin are misbehaving [14:23:30] Logged the message, Master [14:30:13] hi akosiaris - per https://wikitech.wikimedia.org/wiki/Server_access_responsibilities#SSH_key_access I need my key revoked. Do you want the fingerprint, the public key, or something else? [14:30:39] (the laptop is with WMF office IT for repair) [14:31:02] (I assume I should contact akosiaris because that's the nick I see for "on RT duty" in the /topic) [14:31:53] I need username and public key (whether the fingerprint or the entire key... does not matter for revocation) [14:32:24] akosiaris -- username: sumanah [14:32:43] after you issue a new one, that one you need to upload somewhere trusted and I will fetch it from there (office wiki for example) [14:32:44] akosiaris: fingerprint: 54:66:e6:49:fd:47:1e:16:19:d8:85:94:cd:61:d3:1c [14:32:47] akosiaris: got it [14:33:05] oh.. and welcome back!! [14:33:08] thank you akosiaris! [14:33:42] I'm sorry for not revoking the key IMMEDIATELY; the laptop was being carried via a trusted person to WMF OIT and so I thought the risk was ok, but then I read the wikitech wiki page and saw that it was not. [14:35:38] (03PS1) 10Mark Bergsma: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 [14:36:09] (03PS1) 10Alexandros Kosiaris: Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 [14:37:41] (03CR) 10jenkins-bot: [V: 04-1] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [14:39:23] (03CR) 10jenkins-bot: [V: 04-1] Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 (owner: 10Alexandros Kosiaris) [14:39:34] (03PS2) 10Mark Bergsma: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 [14:39:40] LOST ? [14:39:55] akosiaris: huh? [14:40:08] jenkins is voting LOST ... [14:40:13] oh, jenkins. Weird! [14:40:16] on all jobs... weird... [14:40:22] will force the commit [14:40:24] : !log restarting Jenkins [14:40:34] Just wait a few minutes, CR+2 it now and it'll be handled when it is ready. [14:40:42] i should stop Zuul probably :D [14:40:51] 17 minutes ago... does it take that long ? [14:40:54] (unless it needs to be deployed right now) [14:41:00] Jenkins takes a long time to reboot. [14:41:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revoke sumanah keys after her request. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105186 (owner: 10Alexandros Kosiaris) [14:41:14] (03CR) 10jenkins-bot: [V: 04-1] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [14:42:46] brainwane: I just merged the chage, key will be revoked from anywhere in the cluster in 30 mins top. Thanks for letting us know. [14:43:53] !log zapped /vol/root export from nas-1001-a [14:44:10] Logged the message, Master [14:44:16] thanks akosiaris - that seems sufficient to me [14:44:16] entirely? [14:45:19] mark: we can change it to another host if someone uses it. [14:45:32] occasionally to do netapp config changes [14:45:38] VERY annoying that you almost have to do that over nfs [14:45:46] but yeah, not on streber anymore ;p [14:45:50] I always use the CLI [14:46:00] does it have an editor? [14:46:20] or do you use that cat equivalent ;p [14:46:22] I think it does in 8.something something [14:46:30] but I don't use it [14:46:32] why can't they just ship vi or smt [14:48:16] anyway we can always set the export to another host when we need it [14:48:24] absolutely [14:53:25] !log jenkins restarted [14:53:43] Logged the message, Master [14:57:15] (03PS3) 10Hashar: Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [15:00:48] (03CR) 10Mark Bergsma: [C: 032] Fix indentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105185 (owner: 10Mark Bergsma) [15:00:51] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:41] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [15:20:15] (03PS8) 10Physikerwelt: Add Mathoid module (TeX -> MathML / SVG conversion web service) [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 [15:23:20] (03PS1) 10Hashar: webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 [15:23:38] (03PS2) 10Hashar: webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 [15:24:51] (03CR) 10Mark Bergsma: [C: 032] webproxies service entries in wmnet. [operations/dns] - 10https://gerrit.wikimedia.org/r/105189 (owner: 10Hashar) [15:37:51] (03PS1) 10Mark Bergsma: Setup RANCID on netmon1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105191 [15:39:03] (03CR) 10Mark Bergsma: [C: 032] Setup RANCID on netmon1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105191 (owner: 10Mark Bergsma) [15:39:49] (03PS1) 10Tim Landscheidt: Tools: Install requested package python-pyexiv2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/105192 [15:47:18] (03PS1) 10Mark Bergsma: Add systemuser creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105193 [15:48:23] (03CR) 10Mark Bergsma: [C: 032] Add systemuser creation [operations/puppet] - 10https://gerrit.wikimedia.org/r/105193 (owner: 10Mark Bergsma) [16:04:47] (03PS1) 10Mark Bergsma: Add /etc/rancid/rancid.conf to config management [operations/puppet] - 10https://gerrit.wikimedia.org/r/105199 [16:14:15] !log reedy synchronized php-1.23wmf9/extensions/ [16:14:32] Logged the message, Master [16:15:22] !log reedy synchronized php-1.23wmf8/extensions/ [16:15:38] Logged the message, Master [16:22:23] (03PS1) 10Mark Bergsma: Add RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/105203 [16:23:09] (03CR) 10Mark Bergsma: [C: 032] Add /etc/rancid/rancid.conf to config management [operations/puppet] - 10https://gerrit.wikimedia.org/r/105199 (owner: 10Mark Bergsma) [16:23:27] (03CR) 10Mark Bergsma: [C: 032] Add RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/105203 (owner: 10Mark Bergsma) [16:24:16] (03CR) 10Faidon Liambotis: [C: 04-1] "scap shuffles the list to avoid overloading rsync servers, as far as I can see, so this won't produce the effect you want it to produce. I" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [16:27:12] (03CR) 10Reedy: "Outage last night; waiting for redeployment of updated l10n cache files via sync-dir to fix the problem. EQIAD was last, so had to wait fo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105006 (owner: 10Reedy) [16:28:14] Reedy: outage report? :) [16:28:33] commons and wikivoyage were broken ;) [16:28:40] and Wikidata [16:28:46] because of wikidata [16:29:11] Due to the somewhat stupid way we handle missing magic words [16:29:37] or, as in this case; we don't [16:29:52] can you send a detailed report via email or even document it on wikitech under the incident section? [16:30:16] Yeah [16:30:25] not too detailed, just what happened/what was the bug, when did it happen, why did it happen, what can we do to fix it :) [16:30:32] I'd had more than enough after the long deploy and unrelated database outage we had afterwards [16:30:52] :) [16:31:05] I understand, it's not urgent, but I was lacking context for your change [16:31:06] it was a fun day [16:31:21] oh the fun continued today, don't you worry [16:31:38] paravoid: it was literally a wtf, why do we do tampa first for anything like that [16:32:05] In theory, reversing the list would've reduced the time to deploy the fix by anything upto 50% [16:32:08] so sync-common-files takes the list as-is [16:32:18] yup [16:32:18] sync-common-file [16:32:21] line by line via dsh [16:32:25] although it is dsh -c [16:32:26] X at a time etc [16:32:29] so in theory, concurrent [16:32:43] scap shuffles, though [16:33:08] yeah because of tims proxy fan out niceness [16:35:35] (03CR) 10Faidon Liambotis: "Are we going to do both kibana.wm.org & logstash.wm.org?" [operations/dns] - 10https://gerrit.wikimedia.org/r/105105 (owner: 10BryanDavis) [16:37:05] paravoid: Would you rather have logstash as the name for the search GUI? I don't care [16:37:24] hi [16:37:27] are we going to do both? [16:37:45] logstash has a web intf too, does it not? [16:37:49] Hi and happy new year :) [16:37:54] I don't think so paravoid [16:37:56] oh yeah :) [16:38:07] Ah. It used to but has pretty much been deprecated [16:38:09] it used to at least [16:39:03] It had a really crappy UI and then the ruby rewrite of kibana (v2). Now it seems to be dead tech [16:39:54] So I was just planning on the on the kibana3 UI with logstash being backend middlewear for parsing and shipping logs [16:40:15] shipping? [16:40:21] shipping how? [16:40:26] ocean going liners [16:40:52] Into elasticsearch and any other backends we end up wanting [16:41:48] log source -> logstash -> elasticsearch [16:42:06] log source being? [16:42:18] php app or syslog or ? [16:42:47] okay, "log shipping" confused me :) [16:42:58] lumberjack etc. [16:43:17] Ah. Yes. That is a possible but not probably use case for us [16:43:22] http://cookbook.logstash.net/recipes/log-shippers/ [16:43:51] It think logstash is little heavy for running on most log generating nodes [16:44:12] yup [16:48:53] <^d> bd808: Heh, http://www.elasticsearch.org/blog/logging-elasticsearch-events-with-logstash-and-elasticsearch/ [16:50:48] (03PS1) 10Mark Bergsma: Fix rancid file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105209 [16:50:49] (03PS1) 10Mark Bergsma: Add RANCID .cloginrc file [operations/puppet] - 10https://gerrit.wikimedia.org/r/105210 [16:51:04] ^d: Feeding back in to the same cluster :) [16:51:39] I've done that before in testing but never considered it in production [16:51:45] <^d> It sounds insane! [16:51:46] <^d> :) [16:52:08] paravoid: just read _security..... [16:52:15] I know [16:52:16] <^d> bd808: Logging elastic (for cirrus) in logstash would be a good idea though, and easy. [16:52:21] just before I left dA logstash was rolled out on all the web servers [16:52:37] with ES and kibana as the frontend (not sure what version) [16:52:59] didn't seem to be an issue to run in production, then again I don't know the performance details [16:53:06] (03CR) 10Mark Bergsma: [C: 032] Fix rancid file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105209 (owner: 10Mark Bergsma) [16:53:27] <^d> gi11es: We're explicitly having different clusters for logstash's elasticsearch and production search's elasticsearch. [16:53:36] <^d> :) [16:53:40] sounds sane [16:54:07] (03CR) 10Mark Bergsma: [C: 032] Add RANCID .cloginrc file [operations/puppet] - 10https://gerrit.wikimedia.org/r/105210 (owner: 10Mark Bergsma) [16:55:21] <^d> bd808: Do we have a preferred host to log to in the logstash setup? [16:55:53] <^d> (or, maybe we could put it behind lvs like we did with elastic10[nn] so clients don't care) [16:56:25] ^d: We haven't got that bit nailed down yet [16:57:02] There is going to be a redis input path that the php code will probably use [16:57:02] <^d> I think we should do lvs like we did on the other one. It makes things wayyyyy easier for the clients (and is pretty trivial to setup) [16:57:18] <^d> Then it won't matter what boxen is moved around or renamed. [16:57:52] That would be good. I'm always a fan of service names vs host names [16:58:04] lvs for? [16:58:08] <^d> logstash [16:58:11] for logging you mean? [16:58:17] <^d> ya [16:58:59] logging behind lvs? is that really necessary? :) [16:59:05] are you going to allow logstash usage from php-land? or will it be for purely traditional syslog entries? [16:59:21] I was hoping all clients will support writing to multiple servers [16:59:34] <^d> paravoid, mark: Maybe not necessary, I just threw the idea out there. [16:59:39] <^d> :) [16:59:39] gi11es: We will be logging from php for sure [16:59:45] gi11es: the first deployment is going to be mediawiki actually, syslog will come later :) [17:00:02] and I'm not even sure how are we going to do that yet tbh [17:00:07] awesome. is there a legacy equivalent? or is it the first time you do something like this? [17:00:08] logging is rather basic infrastructure, I'd rather not have that depend on something like lvs if we can avoid it [17:00:20] as logstash doesn't really have proper access controls [17:00:40] <^d> mark: Fair enough :) [17:00:41] and some syslog data are sensitive enough [17:01:05] <^d> Anyway, anything that's java-base we can easily get in elasticsearch. [17:01:16] <^d> log4j should *just work* with socket appender stuff. [17:01:18] gi11es: we have a plain old syslog writing to files now, plus at least one other specialized thing for syslog data [17:02:16] gi11es: We are basically trying to move beyond "ssh to fluorine and use grep" [17:02:24] and syslog was being written to by php? [17:03:13] There is a somewhat elaborate php -> udp -> files path for production application logs [17:05:07] stop me if I ask too many questions :) and are these oldschool application logs searchable/graphed? [17:05:43] currently, no, unless you count grep as "searchable" [17:05:49] :) [17:06:16] There are some "rate of message generation" graphs too but nothing very fancy [17:07:06] (03PS1) 10Mark Bergsma: Fix RANCID directory modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105212 [17:07:17] gi11es: You can play with the proof of concept system in labs: https://logstash.wmflabs.org/ [17:07:39] wow, yeah logstash + kibana is going to be like going from horse carriages to spaceships [17:08:20] that's the hope [17:09:03] whoaa bd808 kibana is like graphs for logstash? [17:09:15] it's a better web interface [17:09:24] initially in php, then ruby, now in-browser javascript [17:09:29] (03CR) 10Mark Bergsma: [C: 032] Fix RANCID directory modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105212 (owner: 10Mark Bergsma) [17:09:33] developed by someone else than the logstash author [17:09:39] but then both authors were hired by elasticsearch [17:09:46] full circle [17:09:58] something we did that could be useful here is we moved chat logging to logstash [17:10:18] awesoome [17:10:28] there's a bunch of things we can do, but one at a time :) [17:10:35] uhhhh, how much data can we push to this thang? [17:10:43] (webrequest, ahem ahem?) [17:10:44] heheh :p [17:10:59] the possibilities are endless, but let's pace ourselves :) [17:11:01] heheh [17:11:29] we've been talking about this for over a year now [17:12:05] And hopefully we are within a few days of seeing something that works :) [17:28:23] hey opsen, can I turn your attention to https://rt.wikimedia.org/Ticket/Display.html?id=4028 ? :) [17:36:26] heyaa manybubbles, you've got elastic search icinga alerts up, right? [17:36:48] do you have any that come directly from JMX? or are they all just http curls ? [17:48:15] (03PS1) 10Edenhill: json: output NaN as null for '\!num' modifiers [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/105216 [17:48:42] (03PS2) 10Edenhill: json: output NaN as null for '\!num' modifiers [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/105216 [17:52:04] !log Jenkins downgrading git plugin from 2.0 to 1.5 , we might be hit by https://issues.jenkins-ci.org/browse/JENKINS-21057 [17:52:20] Logged the message, Master [17:54:57] (03CR) 10BryanDavis: "I was planning on only using the kibana interface served via apache. Logstash has the ability to serve a web interface but it just serves " [operations/dns] - 10https://gerrit.wikimedia.org/r/105105 (owner: 10BryanDavis) [18:04:09] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CRIT replication delay 308 seconds [18:04:09] PROBLEM - MySQL Slave Delay on db1037 is CRITICAL: CRIT replication delay 308 seconds [18:04:18] PROBLEM - MySQL Replication Heartbeat on db1051 is CRITICAL: CRIT replication delay 312 seconds [18:04:28] PROBLEM - MySQL Slave Delay on db1052 is CRITICAL: CRIT replication delay 330 seconds [18:04:28] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 330 seconds [18:04:38] PROBLEM - MySQL Slave Delay on db1051 is CRITICAL: CRIT replication delay 337 seconds [18:04:38] PROBLEM - MySQL Slave Delay on db1049 is CRITICAL: CRIT replication delay 337 seconds [18:04:39] PROBLEM - MySQL Replication Heartbeat on db1052 is CRITICAL: CRIT replication delay 336 seconds [18:04:39] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 336 seconds [18:04:39] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 336 seconds [18:04:39] PROBLEM - MySQL Replication Heartbeat on db1037 is CRITICAL: CRIT replication delay 336 seconds [18:04:39] PROBLEM - MySQL Slave Delay on db63 is CRITICAL: CRIT replication delay 338 seconds [18:04:40] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CRIT replication delay 339 seconds [18:04:48] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 342 seconds [18:04:48] PROBLEM - MySQL Replication Heartbeat on db67 is CRITICAL: CRIT replication delay 343 seconds [18:04:58] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 352 seconds [18:04:58] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 358 seconds [18:05:28] RECOVERY - MySQL Slave Delay on db1052 is OK: OK replication delay 140 seconds [18:05:38] RECOVERY - MySQL Slave Delay on db1051 is OK: OK replication delay 147 seconds [18:05:38] RECOVERY - MySQL Replication Heartbeat on db1052 is OK: OK replication delay -0 seconds [18:05:38] RECOVERY - MySQL Slave Delay on db63 is OK: OK replication delay 148 seconds [18:05:38] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay 149 seconds [18:05:45] s1 quite unhappy [18:05:55] blugh [18:06:04] I can guess why.. But it shouldn't be [18:06:08] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 72 seconds [18:06:15] Unless someone broke wfWaitForSlaves()... [18:06:18] RECOVERY - MySQL Replication Heartbeat on db1051 is OK: OK replication delay -0 seconds [18:06:23] -0 seconds! result [18:06:33] that's fast [18:06:38] RECOVERY - MySQL Slave Delay on db1049 is OK: OK replication delay 0 seconds [18:06:38] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay -1 seconds [18:06:48] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay -0 seconds [18:06:58] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay -0 seconds [18:07:04] getting reports in -tech [18:07:14] of lag [18:07:25] (03PS1) 10Odder: Add more settings related to page imports on hewikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105221 [18:07:28] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 0 seconds [18:07:30] yeah, it'll take a little while for it to catch up [18:07:50] s/reports/drive-by report/ [18:07:56] wtfWaitForSlaves()... [18:07:58] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 123 seconds [18:08:38] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay -1 seconds [18:08:38] RECOVERY - MySQL Replication Heartbeat on db1037 is OK: OK replication delay -1 seconds [18:09:08] RECOVERY - MySQL Slave Delay on db1037 is OK: OK replication delay 0 seconds [18:10:24] [17:17:56] Hmm. It looks like it'll break DatabaseMysqlBase::masterPosWait() for non-fakeSlaveLag too. Yep, errors when I try to run update.php for wikis in the labs oauth project. [18:11:28] PROBLEM - MySQL Slave Delay on db67 is CRITICAL: CRIT replication delay 360 seconds [18:11:32] But that's master.. [18:11:47] Reedy: The problematic patch I was referring to didn't make it into wmf9 though, I don't think. [18:11:50] !log Jenkins downgrading git plugin client to 1.4.6 and restarting jenkins [18:11:51] sorry [18:11:58] anomie: Right [18:12:02] And enwiki is on wmf8 anyway [18:12:06] Logged the message, Master [18:12:13] Mostly checking as it was similar area [18:13:28] RECOVERY - MySQL Slave Delay on db67 is OK: OK replication delay 0 seconds [18:13:48] RECOVERY - MySQL Replication Heartbeat on db67 is OK: OK replication delay -1 seconds [18:16:28] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [18:20:28] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [18:21:24] paravoid: graphite dns is https://gerrit.wikimedia.org/r/#/c/105166/ [18:22:30] hmm [18:22:36] on a friday... [18:22:43] I wonder if I should press this button [18:23:40] friday [18:23:41] friday [18:24:17] ori: what about the gdash dashboards that you mentioned? [18:24:38] https://gerrit.wikimedia.org/r/#/c/105163/ [18:25:50] ouch [18:25:56] this must haven't been fun to make [18:26:24] i was going to say... that was more manual than i care to admit [18:26:33] so [18:26:37] we didn't swap TTLs to 60s [18:26:52] which means that now we'll be in the awkward situation that traffic will gradually switch over the course of 1 hour [18:27:09] right, i figured just !log the fact that gdash will be wacky for a bit. [18:27:22] It won't affect that many people [18:27:25] so, let's merge it, but puppetd --disable on professor so that gdash dashboards don't get picked up [18:27:38] gdash.wikimedia.org is already tungsten [18:27:56] oh right [18:28:52] i wonder if i could just move everything in professor under MediaWiki/ too [18:29:03] what do we need gdash.pmtpa.wmnet for? [18:29:27] i don't know, probably nothing [18:29:33] i just updated it since it was there [18:29:45] how about graphite.pmtpa.wmnet? log target? [18:31:05] it is not used in that way currently (thought it'd make sense to). i figured it was just a handy alias if graphite is failing and the on-duty person doesn't happen to remember which host it is on [18:31:14] nah [18:31:16] let's drop both [18:31:42] and let's make the graphite CNAME TTL to 600 for now [18:31:51] and I'll merge now [18:31:55] want to do it or should I? [18:32:39] i can update the patch, sure [18:33:01] is there more to it than just removing the lines? [18:33:08] nope [18:35:56] (03PS3) 10Ori.livneh: Point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 [18:36:20] !log aaron started scap: active Testing timing [18:36:37] Logged the message, Master [18:36:39] templates/apache/sites/graphite.wikimedia.org.erb: ProxyPass http://graphite.pmtpa.wmnet:81/ [18:37:09] yeah, that's all going away [18:37:16] but not for another hour [18:37:19] do you care? [18:37:56] well, i guess it doesn't hurt to change it [18:38:11] * ori updates that, too [18:38:12] do we write to both carbons atm? [18:38:18] (03CR) 10jenkins-bot: [V: 04-1] Point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 (owner: 10Ori.livneh) [18:38:37] jenkins is telling lies [18:41:04] ^d & manybubbles, will Cirrus fix http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx ? :P [18:41:18] it obviously doesn't matter much now and I won't hold this back, but what we generally do is set TTL to some low-value first (e.g. 1m), wait until the old TTL expires, swap while keeping the old TTL, wait to see if everything works, bump TTL back to 1h [18:41:42] MaxSem: see #wikimedia-dev [18:42:17] <^d> Search is hard. [18:42:33] !log aaron started scap [18:42:35] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Point graphite CNAME at tungsten in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/105166 (owner: 10Ori.livneh) [18:42:35] paravoid: no, let's do this right [18:42:41] heh [18:42:43] people inventing a dozen of similar looking quotes are harder [18:42:50] Logged the message, Master [18:42:54] and I worked around the graphite.pmtpa.wmnet issue, don't bothe [18:42:55] r [18:43:22] ok, thank you [18:43:26] (03PS1) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105228 [18:44:24] ok [18:44:26] gerrit is broken [18:44:29] yay [18:44:51] oh wait [18:44:59] scratch that, it was pebcak [18:45:11] ori: yeah, MW_SCAP_BETA makes do difference, it still does the same stuff [18:45:19] *no [18:45:38] it is probably not being correctly exported to the subshell in which one of the dependent scripts is executing [18:46:11] even for the local sync-common [18:46:31] maybe it's just old and set in its ways [18:46:53] probably why the --versions thing seemed to not have a effect sometimes [18:47:02] * AaronSchulz gets the python urge again [18:47:12] ori: seems to work here [18:47:17] https://gerrit.wikimedia.org/r/#/c/103107/ has effectively been abandoned by the GCI students, any admins willing to press the button? [18:47:40] * ori hugs paravoid [18:47:54] wee, thanks. i thought professor was going to bury me [18:47:58] oh wait [18:48:11] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [18:48:16] "no data" [18:48:49] hmm. [18:48:56] * ori investigates where reqstats.* come from [18:48:59] no MediaWiki. there [18:49:12] well, I haven't merged the gdash change yet [18:49:21] no, but I mean, this isn't mediawiki [18:49:41] it's the squid logs -> carbon [18:49:59] my $carbon_server = "10.0.6.30"; [18:49:59] my $carbon_port = 2003; [18:50:00] yay [18:50:55] oh, christ. let's not wait a week for that. [18:51:37] i'll update it so that new stats go to tungsten [18:51:42] already on it [18:51:44] ori: https://gerrit.wikimedia.org/r/#/c/103107/ abandon please? :-) [18:51:54] can you see if you can copy the data from professor? [18:52:34] yep [18:53:01] also https://gerrit.wikimedia.org/r/#/c/103355/ was dropped by the student :-( [18:53:45] (03PS1) 10Faidon Liambotis: Move sqstat (udp2log to carbon) to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/105230 [18:54:18] (03PS2) 10Faidon Liambotis: gdash: Prefix all metric names with 'MediaWiki.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/105163 (owner: 10Ori.livneh) [18:54:25] (03CR) 10Faidon Liambotis: [C: 032 V: 032] gdash: Prefix all metric names with 'MediaWiki.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/105163 (owner: 10Ori.livneh) [18:54:33] (03PS2) 10Faidon Liambotis: Move sqstat (udp2log to carbon) to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/105230 [18:54:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Move sqstat (udp2log to carbon) to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/105230 (owner: 10Faidon Liambotis) [18:54:57] oh wait, this is wrong [18:54:58] damn [18:55:22] ori: doing sudo -u mwdeploy makes them fall off [18:55:35] paravoid: i know, i'm fixing [18:55:54] i didn't know about reqstats [18:57:04] MediaWiki.reqstatsedits.en_wikipedia_org.tp99, [18:57:05] also typo [18:59:34] (03PS1) 10Faidon Liambotis: Fix gdash reqstats dashboards [operations/puppet] - 10https://gerrit.wikimedia.org/r/105232 [18:59:37] ori: ^^ [18:59:48] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [19:00:40] ori: although ideally we'd put it under cdn.* or something [19:01:03] !log apparently jenkins is back up and happy. Had to revert the git plugin to a previous version ... [19:01:18] Logged the message, Master [19:01:35] oh, you're quick [19:02:17] should had caught it in review, sorry about that [19:02:19] (03CR) 10Ori.livneh: [C: 032] Fix gdash reqstats dashboards [operations/puppet] - 10https://gerrit.wikimedia.org/r/105232 (owner: 10Faidon Liambotis) [19:03:38] ori: did you find a way to copy the data? [19:04:54] sorry, my isp had a hickup [19:07:57] (03PS1) 10Faidon Liambotis: gdash: fix owner/group/mode for templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/105237 [19:08:49] (03CR) 10Faidon Liambotis: [C: 032] gdash: fix owner/group/mode for templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/105237 (owner: 10Faidon Liambotis) [19:09:03] ...and now graphite times out [19:09:04] thanks [19:09:24] https://graphite.wikimedia.org/render/?title=10%20Most%20Deviant%20API%20Methods%20by%20Call%20Rate%20log(2)%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(substr(mostDeviant(10,maximumAbove(MediaWiki.API.*.count,1)),0,2)) [19:09:29] for example [19:09:35] yeah, i got a 500 just now [19:09:45] don't fix everything, let me figure it out [19:09:56] lol, ok [19:11:28] (03PS1) 10Aaron Schulz: Fixed scap variable exporting [operations/puppet] - 10https://gerrit.wikimedia.org/r/105238 [19:11:36] (03PS1) 10Jgreen: add root@backup4001's ssh key to fundraising backupmover auth keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/105239 [19:11:56] ori: ^ [19:12:08] we're a little busy with breaking graphite :P [19:13:58] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [19:16:33] ori: hey [19:16:39] yep? [19:16:45] (03CR) 10Jgreen: [C: 032 V: 031] add root@backup4001's ssh key to fundraising backupmover auth keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/105239 (owner: 10Jgreen) [19:16:54] are you debugging? [19:17:04] yeah, a bit stumped, but give me a moment [19:17:14] it's uwsgi, did you find that? [19:18:34] (03CR) 10Ottomata: [C: 032 V: 032] json: output NaN as null for '!num' modifiers [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/105216 (owner: 10Edenhill) [19:18:38] the uwsgi log files report successful reqs [19:18:39] Snaps: thanks [19:18:45] I'll rebuild that on Monday and deploy it then [19:18:53] oh, no it doesn't [19:18:58] nope [19:19:04] and I also got the error myself [19:19:09]

uWSGI Error

Python application not foundConnection closed by foreign host. [19:19:31] (it's a 500, varnish thinks it's a transient error and it retries indefinitely, hence the timeout) [19:21:57] it's the graphite search index I think [19:22:08] it's _graphite:_graphite [19:22:16] and the app runs as www-data, so it can't write to it [19:22:35] ori: ^ [19:22:53] you just chowned it [19:23:00] I'll take that as a "yes" :) [19:23:18] hey [19:23:21] yeah, i just figured that out [19:23:49] that needs to be fixed more permanently, i'll take care of that [19:23:50] did you see the comment in the /usr/share/graphite-web/graphite.wsgi source? [19:24:32] sigh, no. that is a little obscure. [19:26:39] professor has 16 x 70G disks [19:26:42] lol [19:27:15] 73G 15k RPM SAS [19:27:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [19:28:11] i'm confused. is it just choking on the requests at the moment? [19:28:22] because memcached is empty or something? [19:28:59] Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util [19:29:02] sda 0.00 2.80 1.20 3550.00 7.20 14211.20 8.01 143.65 40.39 58.67 40.39 0.28 100.00 [19:29:05] 100% i/o [19:29:09] while rendering [19:29:46] and, presumably, flushing data to disk [19:29:56] I've seen professor go 100% i/o too at times [19:30:53] ganglia says tungsten is better iowait-wise [19:30:55] but why does reloading a graph after it has been generated make it hang? it should be caching, no? [19:32:02] I don't know how graphite works tbh [19:32:04] 'tcpdump tcp port 11211' doesn't show any traffic, so i must have misconfigured it somewhere [19:32:43] (on an unrelated note, pypy with the version in precise doesn't seem like a great idea, it's really old and Debian explicitly didn't want to release with that version) [19:33:01] it won't be needed after today [19:33:13] !log Some migration pains while moving graphite from professor to tungsten; expect graphite & gdash flakiness [19:33:28] Logged the message, Master [19:34:12] another system with lots of files! [19:38:00] are you sure graphite uses memcache for graphs? [19:38:42] If set, enables the caching of calculated targets (including applied functions) and rendered images. [19:38:42] MEMCACHE_HOSTS: "If set, enables the caching of calculated targets (including applied functions) and rendered images. " [19:38:45] heh [19:38:46] yeah :) [19:39:45] found it [19:39:57] if MEMCACHE_HOSTS: CACHE_BACKEND = 'memcached://' + ';'.join(MEMCACHE_HOSTS) + ('/?timeout=%d' % DEFAULT_CACHE_DURATION) [19:40:01] that's django 1.2 [19:40:03] we run django 1.3 [19:40:10] which has moved to a CACHES dictionary [19:40:32] hm, django might have a backwards compatibility stanza though [19:41:17] if not settings.CACHES: legacy_backend = getattr(settings, 'CACHE_BACKEND', None) if legacy_backend: [19:41:32] how does graphite have 500G of data [19:41:34] heh [19:42:09] paravoid: tail -f /var/log/graphite-web/cache.log [19:43:17] domas: logging time series data forever for lots of things [19:43:30] with no rrd-like aggregation [19:43:54] heh [19:45:58] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [19:46:14] ori: found it [19:46:18] oh wait no [19:46:21] dammit :) [19:46:32] W11: Warning: File "render/datalib.py" has changed since editing started [19:46:52] i made the log output the port it's trying to connect on [19:47:57] for later: 'service uwsgi stop' leaves uwsgi-core instances running [19:49:19] is it supposed to be trying to hit the pickle_receiver_port? [19:49:55] netstat -nap |grep 2204 shows multiple TIME_WAIT, plus one established [19:51:52] wait [19:51:56] so it pops from connectionPool [19:52:45] if it errors out, it nevers puts the connection back again, which makes sense [19:55:34] File "/usr/lib/python2.7/dist-packages/graphite/render/datalib.py", line 191, in recv_exactly [19:55:37] data = conn.recv( num_bytes - len(buf) ) [19:55:40] timeout: timed out [19:55:42] there you go [19:55:55] that's why connections keep getting added to the pool [19:56:34] Fri Jan 03 19:47:06 2014 :: Retrieval of cactiStyle(substr(mostDeviant(10,maximumAbove(MediaWiki.API.*.count,1)),0,2)) took 219.277028 [19:56:37] Fri Jan 03 19:47:07 2014 :: Rendered PNG in 0.453549 seconds [19:56:40] Fri Jan 03 19:47:07 2014 :: Total rendering time 219.770963 seconds [19:56:43] 220 seconds [20:02:29] paravoid: templates/varnish/graphite.inc.vcl.erb sets explicit TTLs [20:02:57] ok? [20:03:03] paravoid: that's not the config that is active on tungsten; tungsten's is in templates/varnish/misc.inc.vcl.erb [20:03:09] I know [20:03:31] 220s to render a graph doesn't sound great in any case [20:03:35] do you suppose that would help? i guess not, yeah. [20:03:52] and there's no traffic on the memcached port on localhost [20:03:59] hm, now there is [20:04:05] but very minimal, still [20:06:49] the graphs are empty now, are you aware of this? [20:06:59] ah that's the reqstats ones [20:10:06] let's increase CARBONLINK_TIMEOUT for starters [20:10:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [20:10:20] now it's 1s [20:10:33] so that times out, the cache lookup times out, and it's rendered every time [20:12:40] it times out even with 20 [20:12:56] so the cache isn't very much of a cache probably [20:13:18] something is badly misconfigured somewhere [20:13:22] performance should not be this bad [20:13:47] i want to investigate it more before modifying the configuration for longer timeouts [20:14:06] uhm [20:14:14] I think I found it... [20:14:28] i have both face and palm ready [20:14:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [20:15:47] fixed [20:16:34] ...... [20:17:49] (03PS1) 10Faidon Liambotis: graphite: use the correct port for carbonlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/105250 [20:18:08] hm, maybe that's not right [20:18:24] well, the port is the right thing to do, but I'm looking if that hash is used elsewhere [20:19:13] yes it is, in carbon.conf [20:20:47] (03CR) 10Faidon Liambotis: [C: 04-1] "This is wrong as it changes carbon's [relay] as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/105250 (owner: 10Faidon Liambotis) [20:22:50] ori: 1) the above, correct port for carbonlink, 2) copying reqstats data from professor, 3) using apache's mod_expires to set some more sensible TTLs (the 120/600s seemed ok to me) [20:23:16] no, wait [20:23:23] paravoid: [20:23:35] i moved reqstats/ backed to archived/, just move it one dir up [20:23:38] it is freshly rsync'ed [20:23:55] and let's try it out with the correct config before changing apache [20:23:56] oh and 4) http://graphite.wikimedia.org/ fails right away [20:24:05] why archived? [20:24:11] the hierarchy is the same, we didn't change it [20:24:33] i just moved it out of the way at one point because i suspected there was something about it that was causing this [20:24:42] the mismatch between the declared aggregation configs and the files themselves [20:24:44] and I did change CARBONLINK_HOSTS by hand and it just works now [20:24:54] (puppet is disabled for now) [20:26:25] do you grok the 'tip' box in https://graphite.readthedocs.org/en/latest/config-carbon.html ? [20:26:28] i'm very confused by it [20:28:46] I think it means that if you had a cache listening on 2003 & 2004 [20:29:05] you can move 2003 & 2004 to the relay, create some new port numbers for your cache and then set them to the relay's destination [20:29:24] it's poorly phrased in any case [20:29:34] which is how this is configured [20:29:37] correct [20:29:44] graphite-web was misconfigured [20:29:55] to talk with the pickle port, instead of the cache query port [20:30:13] carbon is just fine I think [20:30:37] (my puppet commit above is wrong, as it changes both, hence my -1) [20:30:54] (03PS2) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105228 [20:32:41] (03PS3) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105228 [20:34:34] paravoid: yes, you are right; http://rcrowley.org/articles/federated-graphite.html confirms too [20:34:45] i'll change local_settings.py.erb to grab the correct port number [20:37:39] it's called "cache query" for a reason I thought [20:42:21] don't be mean; i was mislead by advice that the two settings should exactly match [20:42:36] oh sorry, I didn't intend to be mean [20:42:44] I'm explaining my thought process [20:42:58] that it was more intuition because of that, rather than something else [20:43:41] well, also 11:49 is it supposed to be trying to hit the pickle_receiver_port? [20:43:46] but i didn't follow it [20:43:48] hehe [20:43:52] happens! [20:46:46] i need a couple more minutes for the erb, thanks for the patience [20:48:45] (03PS1) 10Faidon Liambotis: graphite: use both virt0/virt1000 for AuthLDAPURL [operations/puppet] - 10https://gerrit.wikimedia.org/r/105381 [20:49:04] (03CR) 10Faidon Liambotis: [C: 032] graphite: use both virt0/virt1000 for AuthLDAPURL [operations/puppet] - 10https://gerrit.wikimedia.org/r/105381 (owner: 10Faidon Liambotis) [20:49:42] (03CR) 10Faidon Liambotis: [V: 032] graphite: use both virt0/virt1000 for AuthLDAPURL [operations/puppet] - 10https://gerrit.wikimedia.org/r/105381 (owner: 10Faidon Liambotis) [20:51:47] ori: another very tiny optimization that you might consider making is switching to pylibmc [20:52:14] i.e. apt-get install python-pylibmc + switch from MEMCACHE_HOSTS to a django 1.3's CACHES with the pylibmc backend [20:53:02] probably isn't worth it considering the memcache traffic I'm seeing atm [20:57:01] (03PS2) 10Ori.livneh: graphite-web: use the correct port for carbonlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/105250 (owner: 10Faidon Liambotis) [20:58:13] i don't have a VM provisioned for testing that at the moment [20:58:59] but I think it's correct. the DESTINATIONS setting was right; it's only graphite-web's CARBONLINK_HOSTS that needs to be changed [20:59:17] yes [20:59:21] that's my understanding as well [20:59:27] and it should match the relay's DESTINATIONS in terms of the order of carbon caches instances [21:00:18] it's a bit unfortunate that we infer CARBONLINK_HOSTS from DESTINATIONS, but we don't infer DESTINATIONS from the set of [cache:*] declarations [21:00:42] it might call for a graphite::carbon_cache puppet resource [21:00:42] I... was about to say this.. [21:01:05] or it might call for going in the other direction and making things repetitive but non-magical [21:01:19] i prefer the former, as you might have guessed :P [21:01:52] because my taste for erb magic has never, ever bitten me [21:02:33] anyways, I think that change is fine [21:02:36] ok if I merge? [21:03:04] yes [21:03:20] (03PS1) 10Faidon Liambotis: Add carbon-relay & statsd service aliases [operations/dns] - 10https://gerrit.wikimedia.org/r/105383 [21:03:28] ^^^ what do you think of that? [21:03:44] and before you ask me why it's carbon-relay and not carbon, we have a server named carbon... [21:04:14] I supposed restoring "graphite" would be equally as good, although technically the protocol is carbon, not graphite [21:04:20] but I won't bikeshed, whatever you want :) [21:04:36] but I want to git grep tungsten in puppet & mediawiki-config and use service hostnames instead [21:05:01] the problem with having a statsd service alias is that statsd instances have a one-size-fits-all configuration for how data should be aggregated, which summary metrics should be computed, and which backends they should be routed to [21:05:29] what do you mean? [21:05:29] i worked around it by exploiting the fact that the configuration file is loaded and evaluated to hack in a plugin-like system into the puppet module [21:05:46] I want to use it in all the places we've hardcoded tungsten [21:05:46] but it's still the case that we probably want multiple instances running [21:06:06] i think we need more than one statsd instance on the cluster is what i'm saying [21:06:08] ah, I see [21:06:31] for example: https://github.com/etsy/statsd/blob/master/exampleConfig.js [21:06:39] the 'histogram' config [21:06:52] ugh [21:07:34] the bins for memcached latency are not going to be the same for page load timing [21:07:35] etc. [21:09:20] i would like to make it easier to specify aggregation policies by metric regex, like graphite, and submit that upstream, but that'll be a while [21:09:32] is jenkins dead? https://gerrit.wikimedia.org/r/#/c/105250/ [21:09:38] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite-web: use the correct port for carbonlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/105250 (owner: 10Faidon Liambotis) [21:11:21] anyways, let me publicly embarrass you for a moment and acknowledge that you've been awesome & extremely helpful with the graphite work in general and the past hour in particular [21:12:09] shush [21:12:31] your change is broken [21:12:34] :P [21:13:09] no :a :b :c :d [21:13:10] really? it seems to have made the right change [21:13:17] ugh [21:13:24] so that's another reason it was broken before, too [21:13:25] -CARBONLINK_HOSTS = ["127.0.0.1:7102:a", "127.0.0.1:7202:b", "127.0.0.1:7302:c", "127.0.0.1:7402:d"] [21:13:26] +CARBONLINK_HOSTS = ["127.0.0.1:2104:a", "127.0.0.1:2204:b", "127.0.0.1:2304:c", "127.0.0.1:2404:d"] [21:13:34] -CARBONLINK_HOSTS = ["127.0.0.1:2104:a", "127.0.0.1:2204:b", "127.0.0.1:2304:c", "127.0.0.1:2404:d"] [21:13:37] +CARBONLINK_HOSTS = ["127.0.0.1:7102", "127.0.0.1:7202", "127.0.0.1:7302", "127.0.0.1:7402"] [21:13:40] is what I got [21:13:44] you probably run puppet before it was merged [21:13:54] before you run puppet-merge I suppose, because I did that [21:14:00] oh, right, ugh [21:14:11] well, it's your change, so your mistake, as far as the historical record is concerned [21:14:17] right! [21:14:17] i'll now swoop in for the fix [21:16:26] so, shall we put reqstats under CDN.* or something? [21:16:55] why CDN? [21:17:06] because gdash says "(cdn)" before the graphs [21:17:07] (03PS1) 10Ori.livneh: graphite-web: Specify instance names in CARBONLINK_HOSTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/105385 [21:17:19] and I didn't feel like inventing a name [21:18:00] but I'm okay with anything else, I'm just saying that now that it's all neat with the mediawiki hierarchy, maybe it makes sense to not put reqstats in root [21:18:05] under root I mean [21:18:52] wouldn't that make it "cache:a" instead of "a"? [21:19:01] oh wait, no [21:19:22] (03PS5) 10BryanDavis: Kibana puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 [21:19:50] (03CR) 10Faidon Liambotis: "Can we name it "instance" instead of "name"? (and probably "server" instead of "host", too, to match what Graphite source's uses)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/105385 (owner: 10Ori.livneh) [21:19:59] I think it's odd to put reqstats in root, yeah. I don't like 'CDN', though I don't have a better idea. Can we defer that decision? [21:20:15] sure, I don't mind [21:20:21] as long as these graphs start working again, though :) [21:20:30] so it's going to have to be root for now [21:20:37] (they're very useful to me) [21:21:29] (03PS2) 10Ori.livneh: graphite-web: Specify instance names in CARBONLINK_HOSTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/105385 [21:22:00] (03CR) 10Faidon Liambotis: [C: 032] graphite-web: Specify instance names in CARBONLINK_HOSTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/105385 (owner: 10Ori.livneh) [21:22:07] i'll run puppet [21:22:13] (03CR) 10Faidon Liambotis: [V: 032] graphite-web: Specify instance names in CARBONLINK_HOSTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/105385 (owner: 10Ori.livneh) [21:22:24] ok [21:27:27] well, things look a lot better [21:27:35] (03CR) 10BryanDavis: "Antoine created https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/kibana. Someone with Push rights in that project still" [operations/puppet] - 10https://gerrit.wikimedia.org/r/104172 (owner: 10BryanDavis) [21:28:05] reqstats will take a bit to look right [21:28:16] ? [21:28:34] I run puppet on emery but haven't restarted udp2log yet, so I don't think the change has taken effect [21:28:43] ah, i see [21:28:44] but I'd like to restore professor's data first [21:28:52] i did [21:28:54] oh [21:29:04] oh, hm [21:29:21] it still says nan? [21:29:36] not for everything, the 1wk have data [21:29:43] i must not have rsynced properly [21:29:55] re-syncing [21:31:07] ok, i have to run to the office. i'm still at home and i'm interviewing someone in half an hour. [21:31:15] eek [21:31:17] rsync is running. paravoid, ok if i take off from your perspective? [21:31:25] ok [21:31:57] thanks again (x100) for the help [22:09:59] ori: I'm restarting rsync, it was wrong [22:11:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:17:01] (03PS1) 10Faidon Liambotis: gdash dashboards: s/white/black/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/105392 [22:17:16] (03CR) 10Faidon Liambotis: [C: 032] gdash dashboards: s/white/black/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/105392 (owner: 10Faidon Liambotis) [22:17:28] (03CR) 10Faidon Liambotis: [V: 032] gdash dashboards: s/white/black/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/105392 (owner: 10Faidon Liambotis) [22:42:22] ah, the colour wasn't intentional :) [23:28:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [23:30:30] I thought that was fixed? ^^