[00:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:03:56] greg-g, chrismcmahon: OK, now I'm stumped. The files are there, but it still errors saying they don't exist. [00:04:07] #1. The files do exist [00:04:16] #2. It shouldn't be requesting them to start with [00:04:59] bizarre. [00:05:35] WTF [00:05:52] thanks mutante [00:06:52] New patchset: MaxSem; "Rm old debug group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55228 [00:07:38] maybe I need to sync-dir on it! [00:09:02] that should fix it [00:09:12] !log kaldari synchronized php-1.21wmf12/extensions/Thanks 'syncing Thanks ext' [00:09:18] Logged the message, Master [00:10:24] kaldari: No that's not it, I found it [00:10:30] That error is a literal string in a generated file [00:10:46] csteipp: as of 2012-07-20: db48 is the otrs master, db49 and db1048 slave from it. db1048 is the gerrit master, db1046 and db48 slave from it. [00:10:48] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [00:11:56] New review: Anomie; "> I have not used the multiversion shell script cause I am" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [00:12:08] jeremyb_, are you sure OTRS shares with gerrit? [00:12:22] "Gerrit is installed on manganese in the prefix /var/lib/gerrit" [00:12:29] OTRS is on williams by itself I think? [00:12:32] Thehelpfulone: your point? [00:12:33] right [00:12:43] but neither one uses a DB on localhost [00:13:15] Chris' comment on Bugzilla was "It looks like there are other critical [00:13:15] services on that server/db, so the impact would be significant if an attack was [00:13:15] successful." - so I was looking for what else ran on williams [00:13:22] New review: Anomie; "Still, why not use the getRealmSpecificFilename function to resolve the dblist filename?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [00:13:48] RECOVERY - Puppet freshness on professor is OK: puppet ran at Fri Mar 22 00:13:45 UTC 2013 [00:14:00] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Fri Mar 22 00:13:50 UTC 2013 [00:14:11] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Fri Mar 22 00:14:05 UTC 2013 [00:14:50] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Fri Mar 22 00:14:43 UTC 2013 [00:14:59] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Fri Mar 22 00:14:52 UTC 2013 [00:15:30] !log aaron synchronized php-1.21wmf12/includes/Title.php 'deployed cef327e945e2593fe0291880a7d1976cc5a2f248' [00:15:38] Logged the message, Master [00:15:38] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Fri Mar 22 00:15:32 UTC 2013 [00:16:40] !log catrope synchronized wmf-config/ExtensionMessages-1.21wmf12.php 'Remove PHP errors' [00:16:48] Logged the message, Master [00:17:08] RECOVERY - Puppet freshness on colby is OK: puppet ran at Fri Mar 22 00:17:04 UTC 2013 [00:18:21] RoanKattouw: is it likely that we did something wrong during the deployment that caused this? [00:18:34] Well [00:18:38] Kind of [00:18:51] Are you behind the deployment of the Thanks extension? [00:18:55] yep [00:19:02] Yeah then you did something wrong [00:19:20] I'm pretty sure the instructions tell you to regenerate ExtensionMessages-1.21wmfNN.php and check the results [00:19:34] And the results looked something like: [00:20:03] http://pastebin.com/NHUXmPbg [00:21:33] damn that paste link is slow [00:22:02] New patchset: Catrope; "Add Thanks to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55231 [00:22:15] kaldari: Also, that ---^^ [00:22:29] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55231 [00:22:55] Alright [00:23:08] Benny said he had done that [00:23:09] AaronSchulz: I'm being dragged out of here but I think things are reasonably stable now [00:23:15] He had [00:23:19] But it wasn't committed [00:23:22] ah [00:23:46] ok [00:24:03] OK, I'm out of here. If more things fall apart (given how today has gone I wouldn't be surprised), text me, phone number is on officewiki [00:24:09] RoanKattouw: Thanks for your help! [00:30:11] kaldari: I'm done too, but I'll look in on test2 early tomorrow. (and thanks for fixing PageTriage, that was an unexpected breakage) [00:30:33] thank you! [00:54:47] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming 'NavigationTiming: log mobile mode' [00:54:55] Logged the message, Master [01:05:40] New patchset: Ori.livneh; "Correct entry-point in E3Experiments extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55232 [01:08:44] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55232 [01:11:03] !log olivneh synchronized wmf-config/CommonSettings.php 'Temporarily accepting either E3Experiments.php or Experiments.php as extension entry point to facilitate clean-up' [01:11:10] Logged the message, Master [01:17:54] !log olivneh synchronized php-1.21wmf11/extensions/E3Experiments 'Updating E3Experiment extensions's entry point' [01:18:02] Logged the message, Master [01:18:08] !log olivneh synchronized php-1.21wmf12/extensions/E3Experiments 'Updating E3Experiment extensions's entry point' [01:18:14] Logged the message, Master [01:19:42] New patchset: Ori.livneh; "Remove temporary workaround for E3 entry-point migration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55235 [01:20:40] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55235 [01:23:29] !log olivneh synchronized wmf-config/CommonSettings.php 'Remove temporary workaround for E3 entry-point migration (1/2)' [01:23:36] Logged the message, Master [01:24:14] !log olivneh synchronized wmf-config/extension-list 'Remove temporary workaround for E3 entry-point migration (2/2)' [01:24:20] Logged the message, Master [01:28:05] !log Finished correcting E3Experiments extension's entry point from 'Experiments.php' (which does not match the submodule name) to 'E3Experiments.php'. Updated extension-list, CommonSettings.php, and the extension itself in both wmf11 and 12. [01:28:11] Logged the message, Master [01:31:46] New patchset: Yurik; "Mobile dflt redir to RU for Vimpelcom Beeline (m & zero)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55236 [01:47:30] New review: Ori.livneh; "I don't think it would make sense (for me, anyway) to have this be a role that only specific machine..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [02:04:18] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [02:30:32] !log LocalisationUpdate completed (1.21wmf11) at Fri Mar 22 02:30:31 UTC 2013 [02:30:39] Logged the message, Master [02:32:38] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 185 seconds [02:33:22] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 230 seconds [02:37:40] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1043, oops' [02:37:48] Logged the message, Master [02:39:12] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:42:33] !log LocalisationUpdate completed (1.21wmf12) at Fri Mar 22 02:42:32 UTC 2013 [02:42:40] Logged the message, Master [02:45:45] PROBLEM - MySQL disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:45] PROBLEM - MySQL Recent Restart on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:46] PROBLEM - Full LVS Snapshot on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:02] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:02] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - SSH on db1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:36] PROBLEM - MySQL Slave Running on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - MySQL Idle Transactions on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - mysqld processes on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:50] !log asher synchronized wmf-config/db-eqiad.php 'returning db1043' [02:46:57] Logged the message, Master [02:48:44] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1050, crashed' [02:48:51] Logged the message, Master [02:50:16] !log power cycling db1050, unresponsive on serial console [02:50:21] Logged the message, Master [02:51:57] !log on searchidx1001: started incremental indexer, apparently it died on March 21 at 02:08 when it ran at the same time as a cron job import [02:52:04] Logged the message, Master [02:53:22] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [02:58:03] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:01:03] New patchset: Asher; "db1050 died" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55241 [03:01:30] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55241 [03:02:33] PROBLEM - Host db1050 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:04:56] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay seconds [03:05:06] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:05:17] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay seconds [03:05:17] RECOVERY - SSH on db1050 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:05:17] RECOVERY - MySQL Slave Running on db1050 is OK: OK replication [03:05:17] RECOVERY - MySQL Idle Transactions on db1050 is OK: OK longest blocking idle transaction sleeps for seconds [03:05:37] RECOVERY - MySQL Recent Restart on db1050 is OK: OK seconds since restart [03:05:37] RECOVERY - Full LVS Snapshot on db1050 is OK: OK no full LVM snapshot volumes [03:05:37] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [03:07:17] RECOVERY - mysqld processes on db1050 is OK: PROCS OK: 1 process with command name mysqld [03:18:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [03:18:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [03:21:26] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:21:27] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [03:53:17] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:53:35] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:54:40] Anyone awake that can do a security review of a small C++ program for me? [03:56:31] Tomorrow, then [04:09:13] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [04:10:44] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Fri Mar 22 04:10:39 UTC 2013 [04:37:41] New patchset: Yurik; "Rearranged and consolidated carrier detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55242 [04:39:53] New patchset: Yurik; "(bug 46430) Mobile dflt redir to RU for Vimpelcom Beeline (m & zero)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55236 [04:40:33] New patchset: Yurik; "Rearranged and consolidated carrier detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55242 [04:52:10] yuck, what a day [04:53:01] greg-g: heh [04:53:16] TimStarling: what's died mean? not in process list at all? should be pretty easy to get a nagios check for that [04:53:32] (searchidx1001) [04:53:53] (err, icinga [04:53:54] ) [04:54:01] https://xkcd.com/859/ [04:58:33] New review: Ori.livneh; "Well, the base directory names on stat1 and stat1001 ought to match, so I think we want the base nam..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [05:12:06] jeremyb_: it's in a shell script restart loop with set -e [05:12:24] the shell script runs the updater, which updates all wikis and then exits [05:12:41] then the shell script sleeps for 15 minutes then starts the index update again [05:13:13] I found the relevant entries in /var/log/account/pacct [05:13:32] the indexer finished, exited, then after 15 minutes, started again [05:13:59] then after 18 seconds, java exited, and the shell script exited immediately afterwards, so java probably exited with a non-zero exit status [05:14:10] which was fatal because of the set -e [05:14:48] so you can't just monitor for the presence of the java process since it will be absent for 15 minute periods [05:17:00] what would be ideal is an occasional check (say hourly) of the last modification timestamp of /a/search/indexes/status/enwiki [05:17:38] if it's more than, say, 12 hours, it would alert [05:17:44] more than 12 hours ago [05:19:01] TimStarling: (( $(date +%s) - $(stat -c %Y /a/search/indexes/status/enwiki) < 43200 )) [05:19:32] er, > [06:07:58] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [06:10:58] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [06:10:58] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Puppet has not run in the last 10 hours [06:13:58] PROBLEM - Puppet freshness on cp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db1017 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db45 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:14:00] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [06:14:00] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [06:14:58] PROBLEM - Puppet freshness on db1018 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1021 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db53 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db67 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [06:15:00] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [06:15:02] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [06:15:02] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [06:28:58] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [07:01:28] New review: Hashar; "Great! :-D" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55162 [07:01:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:53:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [07:54:03] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [08:04:06] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [08:04:37] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:06:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [08:55:07] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: CRIT replication delay 217 seconds [08:55:19] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 222 seconds [08:58:08] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay 0 seconds [08:58:18] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [09:07:56] !log Recreating Solr index [09:08:02] Logged the message, Master [09:11:49] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [09:12:00] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [09:15:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [09:15:59] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [09:28:42] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [10:11:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [10:15:21] PROBLEM - Puppet freshness on mw1093 is CRITICAL: Puppet has not run in the last 10 hours [10:15:21] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [10:16:22] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:25:23] PROBLEM - Puppet freshness on db1049 is CRITICAL: Puppet has not run in the last 10 hours [10:28:24] PROBLEM - Puppet freshness on mw1065 is CRITICAL: Puppet has not run in the last 10 hours [11:04:24] Hey there, could somebody please pass me ( aklapper@ ) the content for the last two weeks in Bugzilla's "audit_log" table? I'd appreciate it. [12:04:21] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [13:36:32] could somebody please pass me ( aklapper@wm ) the content for the last two weeks in bugzilla.wikimedia.org's "audit_log" table? [13:47:27] New patchset: Demon; "Consolidate gallium and formey replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55259 [13:58:18] hey guys, did any of you get the emails david and I sent to the & analytics lists yesterday? Subject: Re: Packet loss on oxygen and locke now [13:58:36] we both got the 'Your message to Ops awaits moderator approval' reply from *-bounces [14:10:04] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [14:12:04] PROBLEM - Puppet freshness on mw1092 is CRITICAL: Puppet has not run in the last 10 hours [14:13:06] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [14:18:04] PROBLEM - Puppet freshness on mw62 is CRITICAL: Puppet has not run in the last 10 hours [14:19:00] New patchset: Ottomata; "puppetize haproxy (for brewster), the very basics (RT-4660)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [14:20:52] New patchset: Ottomata; "puppetize haproxy (for brewster), the very basics (RT-4660)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [14:24:54] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: Timeout while attempting connection [14:27:16] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:00] mark, you round? wanted to ask about gadolinium varnishncsa [14:30:10] yes [14:30:18] is gadolinium in pmtpa or eqiad? [14:30:22] in eqiad [14:30:32] and oxygen is too, right? [14:30:35] yes [14:30:39] all the element names are [14:30:42] ah ok [14:30:43] good to know [14:31:10] ok, we haven't heard from asher yet, but how about I just remove the extra varnishncsa instance and consume from the stream then [14:31:20] i've got 4 analytics boxes consuming from that stream too, and they've never had a problem [14:31:33] sounds good [14:31:45] how much is the sum of all log traffic now? [14:31:53] i've heard it was approaching 1 Gbps [14:32:02] oo, haven't checked [14:32:25] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 82.86 ms [14:37:41] i've received and read the packet loss emails [14:37:56] but I don't have much to say about it other than "so can we disable the logging on the nginx servers then?" :) [14:40:08] yeah that's what I want to do too, i guess we gotta here from analytics folks [14:45:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 199 seconds [14:45:04] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [14:45:16] mark, re how much traffic, if I watch ifconfig, I see about 0.4 Gbps, which is about what ganglia shows too [14:45:24] 50 M bytes/sec [14:45:39] right [14:45:42] but now it's not peak [14:45:54] above 60-70 it'll start to get problematic [14:46:03] packet loss :) [14:46:21] we def see packet loss during peak times [14:46:39] some of that may also be from the cache servers [14:46:44] is that due to the nic not being able to handle it? [14:46:46] who are also becoming quite congested [14:46:59] i think that's right, according to what erik z is noticing [14:46:59] that is from temporarily full traffic queues during traffic bursts [14:47:07] only portions of hosts show packet loss [14:47:15] not the entire stream [14:47:16] that's entirely possible [14:47:19] do you have some examples? [14:47:24] I was the one that said the approaching 1gbps [14:47:27] it was 70+ when I looked [14:48:04] see Erik Z's latest email, but I will share a google doc with you with his numbers... [14:48:10] hm, maybe that was locke [14:48:16] * paravoid looks again [14:48:32] oxygen max week is 61.9MB/s [14:48:41] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AtLjsFovAGuvdGpuUnhHLWtBZDlkaXdjM2hOTWVZOXc#gid=0 [14:48:44] month 63.7MB/s [14:48:59] locke month is 65.3MB/s [14:49:04] this is the monthly average difference between seq numbers by host group [14:49:13] ssls are way off, but we know why [14:49:35] because the module is broken [14:49:37] we know that for years [14:49:39] upload pmtpas are 20% off, but that might be due to the fact that they don't have much load ? [14:49:45] I wonder why we still collect those streams [14:49:48] good q! [14:49:49] yeah [14:50:00] we're waiting to here from analytics people now to see if we can just remove them [14:50:11] it doesn't work anyway [14:50:25] even after my patches, it bumps the seq number for every *field* [14:50:32] we have what, 10 fields? [14:50:45] that's 90% seq lost [14:50:58] iirc, haven't looked it at for a year [14:51:04] aye, its ok [14:51:15] i think we shoudl just remove it, since they are duplicate logs anyway, as we talked about in that RT ticket [14:51:23] from that spreadsheet I take that recently everything is quite ok except for ssl [14:51:24] am I wrong? :) [14:51:28] if people want SSL stats then we can start logging x-f-proto or just set up a new stream [14:51:43] I think that would be valuable [14:51:49] x-f-proto [14:51:49] naw, since these are large averages, if the number is off by more than a few from 1000 it can indicate something is wrong [14:51:56] see erik z's latest email [14:52:06] - upload squids series knsq* (knams) have 2-3% loss in Feb/Mar, [14:52:16] upload squids in pmtpa have 23% loss since Nov 2012 but their load has dropped to almost zero so that doesn't weigh in on Ganglia trend [14:52:31] not sure if this is significant: [14:52:32] - text squids have < 0.1% loss in recent months [14:53:02] check this too: [14:53:02] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Packet+Loss+Average&vl=%25&x=&n=&hreg%5B%5D=(locke%7Cemery%7Coxygen)&mreg%5B%5D=packet_loss_average>ype=line&glegend=show&aggregate=1 [14:53:17] look at the yearly one [14:53:41] i'm pretty sure the diff in packet loss between emery/locke and oxygen is due to SSLs, since oxygen is not receiving logs from nginx [14:54:00] but, i have no idea why this started happening in october [14:54:03] yeah november is around when we moved upload to eqiad [14:54:31] aye, makes sense, so that 20% loss is probably not a worry [14:54:33] probably just error [14:55:01] but also, relevant to the high traffic convo [14:55:25] for the past few weeks we've been seeing spikes of loss during high load time on oxygen and locke sometimes [14:55:34] during the popedot? [14:55:41] yes, but more regular [14:55:46] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=Packet+Loss+Average&vl=%25&x=&n=&hreg[]=%28locke%7Cemery%7Coxygen%29&mreg[]=packet_loss_average>ype=line&glegend=show&aggregate=1 [14:56:02] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:56:03] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [14:56:50] those oxygen spikes started happening backin January and have only been getting worse [14:57:13] oxygen is running 17 different IP based filters for wikipedia zero, we've been meaning to turn them all into a single filter using the X-CS header [14:57:25] but Evan Rosen is not confident about using it yet [15:00:20] anyyyyyway [15:00:29] hopefully we'll take out ssls from the stream soon [15:00:41] I will make gadolinium use the multicast stream (although we should hear from asher on that too) [15:00:56] what does asher have to do with this anyway? [15:01:06] graphite? [15:01:06] he set up the multicast stream [15:01:23] he just told me not to use it since it would be a spof for the udp2log boxes [15:01:40] that's why there were gonna be two [15:02:13] we discussed this, remember? [15:02:19] you were gonna setup two boxes, one in pmtpa, one in eqiad [15:02:27] and everything would consume from those [15:02:45] sounds familiar…:) [15:03:36] if our udp2log boxes are in eqiad (or at least, oxygen and gadolinium), that is a spof for the 2 of them, which I think is why asher told me not to use it [15:03:49] yeah [15:03:52] but there should be one in pmtpa as well [15:04:11] but the eqiad udp2log boxes wouldn't be able to use it, woudl they? [15:04:25] (i thought mulitcast between the dcs was a little funky) [15:04:37] multicast in pmtpa is the biggest problem [15:04:40] not between DCs [15:04:52] but I think we agreed that in pmtpa we'd have udp2log just forward instead of multicast [15:04:54] or something like that [15:05:00] now, with 60 MB/s that's gonna be a bit of a problem [15:05:11] need 10G interfaces, or multiple GigE [15:05:38] ok ja, hm [15:06:21] well, unless you think otherwise, i'm not going to push on making changes right now other than using multicast stream on gadolinium, and eventually removing locke unicast altogether. [15:06:28] we should have the varnishncsa replacement be capable of sending to multiple senders [15:06:35] that would be cool [15:06:36] oxygen would be a spof for oxygen udp2log and gadolinium [15:06:38] but [15:06:41] i think its ok [15:06:46] gadolinium and oxygen run different filters anyway [15:06:50] and there's no redundancy there [15:06:57] so each one is its own spof anyway :p [15:07:08] yeah then it doesn't matter [15:07:49] plus, oxygen has been running the multicast stream flawlessly for how long now? long time. [15:07:55] if it dies, we'll scramble to get another one up [15:07:58] socat I think? [15:08:01] yeah [15:08:02] socat [15:08:08] I setup a multicast relay on some box a long time ago [15:08:12] I think asher used that [15:08:15] yeah, its socat [15:08:23] i was saying that its been running well for a long time [15:08:31] sure, there's not much to it ;) [15:08:35] sure it is due to fail sometime, but we'll just scramble when it happens [15:08:43] heh [15:08:46] traffic loggers fail all the time [15:08:50] every time I reboot a box [15:08:53] varnish comes up at boot [15:08:59] and varnishncsas don't start until the next puppet run [15:09:03] haha, yeah [15:09:05] no i mean [15:09:05] hah [15:09:06] they also memleak, get killed, restarted, etc [15:09:12] i just mean the socat instance [15:09:15] i know [15:09:16] just saying [15:09:18] oh oh [15:09:18] yeah [15:09:19] totally [15:09:24] you're not getting 100% anyway ;) [15:09:26] yeah [15:09:36] i've caught varnish running but not writing to shmlog [15:09:36] if/when we want to add another udp2log box in pmtpa, we can put effort into adding multicast there [15:09:41] so no logging [15:09:43] ha [15:09:57] or really, it was writing to a different shmlog, but it doesn't matter [15:10:06] things like that ;) [15:32:36] New patchset: Ottomata; "gadolinium now uses udp2log multicast relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55269 [15:40:16] heya, paravoid: theoretical puppet q for you [15:40:32] since i'm removing a varnishncsa logger [15:40:52] i thoguht i'd go ahead and make the varnish::logging define removable via an ensure => absent parameter [15:41:10] but, i think there's a chicken and egg issue with that [15:41:10] https://gist.github.com/ottomata/5222247 [15:41:27] the service has to require that the init.d file exists [15:41:33] in order to start it [15:41:39] but in the reverse scenario [15:41:55] the init.d file has to exist in order for puppet to stop the service (if ensure == absent) [15:42:16] but, if ensure is absent, and the service depends on the file resource [15:42:25] then puppet will remove the init.d file before it tries to stop the service [15:42:27]