[00:14:46] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[00:17:46] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[01:02:27] <icinga-wm>	 PROBLEM - MySQL Processlist on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:02:27] <icinga-wm>	 PROBLEM - MySQL InnoDB on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:02:36] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:02:36] <icinga-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:02:36] <icinga-wm>	 PROBLEM - MySQL Recent Restart on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:03:17] <icinga-wm>	 RECOVERY - MySQL InnoDB on db1007 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[01:03:17] <icinga-wm>	 RECOVERY - MySQL Processlist on db1007 is OK: OK 0 unauthenticated, 0 locked, 5 copy to table, 0 statistics  
[01:03:26] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds  
[01:03:26] <icinga-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay -1 seconds  
[01:03:27] <icinga-wm>	 RECOVERY - MySQL Recent Restart on db1007 is OK: OK 2244372 seconds since restart  
[01:04:46] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[01:08:25] <MaxSem>	 !log db1007 is having tough times due to special page updates
[01:08:42] <morebots>	 Logged the message, Master
[01:13:27] <icinga-wm>	 PROBLEM - MySQL InnoDB on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:13:27] <icinga-wm>	 PROBLEM - MySQL Processlist on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:14:26] <icinga-wm>	 RECOVERY - MySQL Processlist on db1007 is OK: OK 0 unauthenticated, 0 locked, 4 copy to table, 0 statistics  
[01:14:26] <icinga-wm>	 RECOVERY - MySQL InnoDB on db1007 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[01:16:36] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:17:36] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 19 seconds  
[01:18:57] <MaxSem>	 !log Killed a few long queries on db1007
[01:19:12] <morebots>	 Logged the message, Master
[01:20:27] <icinga-wm>	 PROBLEM - MySQL InnoDB on db1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:21:16] <icinga-wm>	 RECOVERY - MySQL InnoDB on db1007 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[01:21:27] <MaxSem>	 fuck
[01:22:12] <greg-g>	 MaxSem: did you save what the queries were?
[01:22:19] <MaxSem>	 yup
[01:22:26] <greg-g>	 I might have to run in a second
[01:22:28] <greg-g>	 good deal :)
[01:22:48] <MaxSem>	 Was about to paste them into email when noticed that there's another wikis that are slamming it now:P
[01:23:10] <greg-g>	 :/
[01:23:20] <greg-g>	 need any help?
[01:23:57] <MaxSem>	 yer any good with linux? how do I look up a process' args?
[01:24:04] <MaxSem>	 (assuming I can sudo as it)
[01:24:45] <greg-g>	 ps aux | grep someprocess
[01:25:09] <greg-g>	 I have to run, ping people if needed :)
[01:26:02] <MaxSem>	 howly fuck
[01:26:24] <MaxSem>	 even if I go on rampage killing those processes it'll just spawn moar
[01:32:32] <Reedy>	 Is it the cronjob/loop?
[01:33:15] <Reedy>	 MaxSem: ^
[01:33:20] <MaxSem>	 yup
[01:33:40] <MaxSem>	 aka mwdeploy 29076  0.0  0.0   4400   616 ?        Ss   01:00   0:00 /bin/sh -c /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Wantedpages > /home/mwdeploy/updateSpecialPages/s7@17-WantedPages.log 2>&1
[01:33:57] <Reedy>	 Killed on terbium
[01:34:16] <Reedy>	 Oh, there's moar
[01:35:18] <Reedy>	 Now they're gone
[01:35:27] <Reedy>	 reedy@terbium:~$ ps aux | grep -i special
[01:35:27] <Reedy>	 reedy     4536  0.0  0.0   9384   928 pts/4    S+   01:35   0:00 grep --color=auto -i special
[01:35:27] <Reedy>	 reedy@terbium:~$
[01:35:34] <MaxSem>	 yup
[01:35:43] <Reedy>	 !log Killed updateSpecialPages and related processes on terbium
[01:35:46] <Reedy>	 They need disabling
[01:35:53] <Reedy>	 We can't keep this happening
[01:36:00] <morebots>	 Logged the message, Master
[01:36:04] <Reedy>	 But they were fine using the "idle" tampa slaves
[01:37:37] <MaxSem>	 and I killed the remaining queries
[01:38:00] <Reedy>	 Wait for icinga-wm to catch up then
[01:38:12] * Reedy  blames Nemo_bis
[01:38:47] <MaxSem>	 also, like 5-6 queries were filesorting the page table on large wikis at the same time - probably they would've been faster if they weren't run in parallel
[01:40:50] <MaxSem>	 hmm - lag is zero but still there are no threads as if LB considered it deadly lagged
[01:41:06] <Reedy>	 Why were there so many processes running simultaneously?
[01:41:33] <MaxSem>	 to DOS it properly?XD
[01:41:36] <Reedy>	 Which I guess is the real issue
[01:41:48] <Reedy>	 mwdeploy 29092  0.0  0.0  12308  1492 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Mostlinked
[01:41:48] <Reedy>	 mwdeploy 29094  0.0  0.0  12308  1496 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Mostrevisions
[01:41:48] <Reedy>	 mwdeploy 29096  0.0  0.0  12308  1492 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Wantedpages
[01:41:48] <Reedy>	 mwdeploy 29104  0.0  0.0  12308  1496 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Fewestrevisions
[01:41:48] <Reedy>	 mwdeploy 29105  0.0  0.0  12308  1496 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Deadendpages
[01:41:50] <Reedy>	 mwdeploy 29132  0.0  0.0  12308  1492 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscriptwikiset updateSpecialPages.php s7.dblist --override --only=Ancientpages
[01:41:52] <Reedy>	 etc
[01:41:56] <MaxSem>	 yup
[01:42:07] <Reedy>	 Need staggering
[01:42:39] <Reedy>	 Noting that paste above was around 25-30% of the processes killed
[01:42:42] <MaxSem>	 we can easily run one of these per DB cluster
[01:44:11] <Reedy>	 Now, when are the next runs actually scheduled....
[01:45:46] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[01:46:05] <Reedy>	 Hmm
[01:46:16] <Reedy>	 Those s7 ones won't run till 17th December
[01:46:27] <MaxSem>	 *bzzt* Previous update run was unsuccessful, re-running after a 15 minutes wait.
[01:46:31] <MaxSem>	 :P
[01:46:46] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[01:47:41] <MaxSem>	 icinga-wm, I hate you too
[01:48:17] <UForgotten>	 don't shoot the messenger ;)
[01:48:47] <MaxSem>	 yeah?
[01:48:53] <MaxSem>	 what about my sleep?
[01:49:16] <MaxSem>	 sspeaking of sleep...
[01:49:18] * MaxSem  is going to bed
[01:49:28] <Reedy>	 The Australians should be awake ;)
[01:50:54] <MaxSem>	 not everyone is a fucking loser like me and you who's not having a week end in addition to not having night sleep:P
[01:51:03] <MaxSem>	 good night;)
[01:52:34] <Reedy>	 MaxSem: We can pipe icinga-wm to your mobile via SMS if you prefer
[02:02:02] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf3) at Sun Nov 17 02:02:01 UTC 2013
[02:02:15] <morebots>	 Logged the message, Master
[02:02:55] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf4) at Sun Nov 17 02:02:54 UTC 2013
[02:03:14] <morebots>	 Logged the message, Master
[02:08:53] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 17 02:08:53 UTC 2013
[02:09:08] <morebots>	 Logged the message, Master
[03:07:54] <grrrit-wm>	 (03CR) 10TTO: "Please see I028589438e502bc1ca30f0148e71b706656331c4, of which this change is partly duplicative." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 (owner: 10Odder)
[03:15:46] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[03:18:46] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[04:05:46] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[04:46:47] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[04:47:46] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[06:16:46] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[06:19:46] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[06:27:16] <icinga-wm>	 PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.  
[06:29:16] <icinga-wm>	 RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active  
[07:06:46] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[07:47:46] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[07:48:46] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[07:56:42] <Nemo_bis>	 Reedy: they just need to be resorted
[08:01:55] <Nemo_bis>	 (commented on the bug)
[08:51:35] <grrrit-wm>	 (03PS1) 10Odder: (bug 56334) Namespace l20n for angwiki and angwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 
[08:54:31] <Nemo_bis>	 twkozlowski: 20?
[09:09:54] <grrrit-wm>	 (03CR) 10TTO: "Nice work." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder)
[09:11:22] <grrrit-wm>	 (03CR) 10Odder: "My apologies; I didn't realise you were working on $wgSitename, too." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 (owner: 10Odder)
[09:12:39] <grrrit-wm>	 (03CR) 10Odder: "Sorry? l20 stands for 'localization' and I had to use an abbreviation to fit in 62 characters :-)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder)
[09:17:46] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[09:18:04] <grrrit-wm>	 (03PS5) 10TTO: Clean up wgSiteName in InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 
[09:20:00] <grrrit-wm>	 (03CR) 10TTO: "That would be l10n: I thought you had made a typo and was making a silly joke." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder)
[09:20:46] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[09:26:01] <grrrit-wm>	 (03PS2) 10Odder: (bug 56334) Namespace l20n for angwiki and angwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 
[09:38:12] <grrrit-wm>	 (03PS2) 10Odder: (bug 44629) Clean up $wgMetaNamespace, $wgMetaNamespaceTalk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 
[10:07:46] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[10:48:46] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[10:49:46] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[12:18:46] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[12:21:46] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[12:40:59] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Take mark off the SMS list [operations/puppet] - 10https://gerrit.wikimedia.org/r/95859 
[12:42:35] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Take mark off the SMS list [operations/puppet] - 10https://gerrit.wikimedia.org/r/95859 (owner: 10Mark Bergsma)
[13:08:06] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[13:49:06] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[13:50:06] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[15:19:06] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[15:22:06] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[16:09:06] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[16:50:06] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[16:51:06] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[17:11:56] <twkozlowski>	 I've been getting quite a few HTTP 503 errors for some time while trying to add some categories with pywikipediabot
[17:17:29] <Steinsplitter>	 twkozlowski: https://bugzilla.wikimedia.org/show_bug.cgi?id=55219?
[17:18:26] <Steinsplitter>	 hm... 503 Service unavailable  :/
[17:28:01] <grrrit-wm>	 (03PS1) 10MaxSem: Serialize special page updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 
[17:36:25] <matanya>	 paravoid: are you available?
[18:20:06] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[18:23:06] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[18:53:03] <twkozlowski>	 Hm, Wikisource is awfully slow for me right now.
[18:59:16] <Nemo_bis>	 twkozlowski: how about Commons? I've been uploading a 5 MB file for what looks like several minutes now
[19:00:01] <twkozlowski>	 Nemo_bis: Same.
[19:04:53] <twkozlowski>	 Also, I'm seeing blank pages after page loads.
[19:10:06] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[19:10:38] <Nemo_bis>	 you're right
[19:11:00] <Nemo_bis>	 !log The 5xx errors mega-spikes every ten minutes are back
[19:11:15] <morebots>	 Logged the message, Master
[19:12:56] <Nemo_bis>	 !log 20 % packet loss from Toolserver to bits-lb.esams.wikimedia.org
[19:13:09] <Nemo_bis>	 (that's usually not a good sign)
[19:13:10] <morebots>	 Logged the message, Master
[19:14:44] * MatmaRex  np U2 - Sunday Bloody Sunday
[19:14:54] <Nemo_bis>	 heh
[19:16:37] <Nemo_bis>	 icinga-wm is on the spot as always
[19:32:51] <jeremyb>	 Nemo_bis: i see similar issues from my local connection and from pmtpa. only the last hop though. so i guess broken internally @ esams
[19:32:54] <jeremyb>	 :(
[19:33:38] <jeremyb>	 i'm getting ~98% loss from here and ~15% from pmtpa
[19:34:02] <jeremyb>	 but it's taking different routes
[19:35:37] <icinga-wm>	 PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , enwiki (259877), Total (278383)  
[19:35:37] <icinga-wm>	 PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , enwiki (259877), Total (278383)  
[19:35:45] <jeremyb>	 to be clear, to the same hostname Nemo_bis was trying. bits-lb.esams.wikimedia.org
[19:36:15] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95884 
[19:36:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95884 (owner: 10Faidon Liambotis)
[19:37:27] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95884 
[19:37:55] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95884 (owner: 10Faidon Liambotis)
[19:39:00] <Nemo_bis>	 jeremyb: yes, mtr gives me 14 % packet loss at last hop to bits-lb.esams.wikimedia.org after 1000 attempts
[19:39:19] <Nemo_bis>	 (from Milan)
[19:39:42] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Fix TTL for gerrit's AAAA to match A [operations/dns] - 10https://gerrit.wikimedia.org/r/95885 
[19:40:00] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Fix TTL for gerrit's AAAA to match A [operations/dns] - 10https://gerrit.wikimedia.org/r/95885 (owner: 10Faidon Liambotis)
[19:40:40] <matanya>	 paravoid: please ping me when you are available. I'd like to talk about some design changes i need you wizdom about
[19:40:48] <paravoid>	 I am here but dealing with the outage
[19:41:08] <paravoid>	 partial outage, but still
[19:41:15] <Nemo_bis>	 yes we noticed :) thanks
[19:41:26] <Aaron|home>	 another outage?
[19:41:27] <matanya>	 when ever you have time. not urgent.
[19:41:33] <jeremyb>	 Aaron|home: see above
[19:41:34] <paravoid>	 Aaron|home: yeah...
[19:41:39] <jeremyb>	 danke paravoid
[19:45:48] <Aaron|home>	 aude: so exactly do end up trying to get output for the same rev ID and content but where the output needs to be different than the first parse that was just done?
[19:46:51] <jeremyb>	 Aaron|home: i wonder if there's a word missing from that sentence?
[19:47:46] <Aaron|home>	 *how exactly
[19:51:06] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[19:52:06] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[20:01:40] <grrrit-wm>	 (03PS1) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 
[20:02:26] <grrrit-wm>	 (03CR) 10Nemo bis: "As said on the bug this morning, I think Ib4c84101c7f04e8b0a96f4c05891f4d1b40154be will be more effective." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 (owner: 10MaxSem)
[20:03:00] <grrrit-wm>	 (03CR) 10Reedy: [C: 031] "LGTM as an improvement over spawning them all at the same time!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 (owner: 10MaxSem)
[20:04:14] <MaxSem>	 Nemo_bis, as a result you're running the same page update on the whole cluster at the same time
[20:04:24] <grrrit-wm>	 (03CR) 10Dereckson: "What about the $wmf variable in docroot/noc/db.php ($wmf = wmfClusters())?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93)
[20:04:34] <MaxSem>	 might not be the best idea
[20:04:58] <Nemo_bis>	 MaxSem: why not?
[20:05:09] <Reedy>	 => rand9)
[20:05:12] <Reedy>	 => rand()
[20:05:53] <Nemo_bis>	 rather, I suspect I may be adding duplicate cronjobs which make puppet cry
[20:06:45] <Reedy>	 You mean rather than them being updated?
[20:06:52] <Reedy>	 We should make a game of site outage or special page updates
[20:06:59] <Nemo_bis>	 ^^
[20:08:35] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Switch bits-lb to eqiad (away from esams) [operations/dns] - 10https://gerrit.wikimedia.org/r/95890 
[20:08:54] <Reedy>	 gotta shoot them all
[20:09:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Switch bits-lb to eqiad (away from esams) [operations/dns] - 10https://gerrit.wikimedia.org/r/95890 (owner: 10Faidon Liambotis)
[20:09:25] <matanya>	 Reedy: whey did you need unzip on tin?
[20:09:37] <Reedy>	 because someone gave me a zip file
[20:09:46] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Switch bits-lb to eqiad (away from esams) [operations/dns] - 10https://gerrit.wikimedia.org/r/95890 
[20:10:20] <Reedy>	 tin also felt left out as fenari already had it
[20:10:27] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Switch bits-lb to eqiad (away from esams) [operations/dns] - 10https://gerrit.wikimedia.org/r/95890 (owner: 10Faidon Liambotis)
[20:11:06] <paravoid>	 *sigh*
[20:11:13] <paravoid>	 that seems to do it
[20:11:34] <paravoid>	 back to just 98% CPU instead of pegged to 100%
[20:11:42] <matanya>	 Reedy: I'd like to add it to a role, so it won't be just thrown there, any idea which?
[20:11:45] <Nemo_bis>	 my ping looks better
[20:11:57] <paravoid>	 yup
[20:12:19] <paravoid>	 funny how that happened the day mark leaves :)
[20:13:03] <Nemo_bis>	 leaves what :O
[20:13:07] <MaxSem>	 did bits cache rate go down?
[20:13:08] <paravoid>	 (for vacation)
[20:13:13] <Nemo_bis>	 ah
[20:13:16] * Nemo_bis  phews
[20:13:17] <paravoid>	 MaxSem: I switched bits to eqiad
[20:13:39] <MaxSem>	 just wondering what happened
[20:14:09] <paravoid>	 the traffic increased organically, the esams LVS box's CPU had trouble keeping up with the load
[20:14:31] <paravoid>	 packets were delayed and some of them lost
[20:15:26] <MaxSem>	 mhm. having only the topmost architect as a DC technician is bound to cause problems:P
[20:15:38] <paravoid>	 we have smarthands
[20:16:54] <matanya>	 looks good from here too
[20:16:56] <MaxSem>	 can we count on them to install a bunch of new servers?;)
[20:17:11] <paravoid>	 it's a little more complicated than that
[20:17:36] <paravoid>	 the CPU load is unbalanced between the CPUs, I experimented a bit with tuning RPS settings
[20:17:41] <paravoid>	 with not much luck
[20:17:53] <paravoid>	 maybe using the second ethernet card would make a difference
[20:18:01] <paravoid>	 I can also use some of the other lvs boxes
[20:19:56] <paravoid>	 it's messy with puppet though, since either I'd need to rework on it a lot, or shuffle traffic around for eqiad too (which it's not a very good idea)
[20:26:39] <paravoid>	 http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=amslvs1.esams.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=network_report&c=LVS+loadbalancers+esams
[20:26:42] <paravoid>	 bits
[20:30:27] <grrrit-wm>	 (03CR) 10Nemo bis: "Quick math to avoid duplicate cronjobs panic (Ib4c84101): we run one page per cluster in each cron, 6*7=42; we call updatequerypages::cron" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis)
[20:31:08] <grrrit-wm>	 (03CR) 10Nemo bis: "* I0a5d8603" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis)
[20:32:22] <paravoid>	 Nemo_bis: btw, thanks for pointing out both the 5xx spike and the packet loss
[20:32:43] <paravoid>	 you did half the investigation, when I jumped in I found the root cause very quickly because of that
[20:32:46] <Nemo_bis>	 you're welcome; it was twkozlowski making me check it though :)
[20:33:14] <Nemo_bis>	 and I remembered what to check because that's what you had done last time, so
[20:33:41] * Nemo_bis  is just shell/script monkey as usual
[20:34:58] <Nemo_bis>	 paravoid: you may want to !log something if the crisis is over :)
[21:21:06] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours  
[21:24:06] <icinga-wm>	 PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours  
[21:46:30] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Switch European bits-lb & mobile-lb back to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/95950 
[21:46:56] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Switch European bits-lb & mobile-lb back to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/95950 (owner: 10Faidon Liambotis)
[21:55:43] <paravoid>	 damn
[21:55:46] <paravoid>	 peak hours are over
[21:55:51] <paravoid>	 I can't test my change :)
[22:11:06] <icinga-wm>	 PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours  
[22:52:06] <icinga-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours  
[22:53:06] <icinga-wm>	 PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours  
[23:11:18] <grrrit-wm>	 (03CR) 10Twotwotwo: "Tried and failed to submit this as a comment on gerrit, so here it is over e-mail:" [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo)
[23:11:44] <grrrit-wm>	 (03CR) 10Twotwotwo: "Except obviously the "Tried and failed to submit this in Gerrit" tag on the comment is no longer accurate. :)" [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo)
[23:26:11] <grrrit-wm>	 (03CR) 10MZMcBride: Make the monthly querypages updates not hit each cluster on the same day (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis)
[23:28:25] <grrrit-wm>	 (03CR) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis)
[23:36:20] <grrrit-wm>	 (03PS1) 10Tim Starling: Set zero load on snapshot hosts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95957 
[23:40:23] <grrrit-wm>	 (03CR) 10Springle: Make the monthly querypages updates not hit each cluster on the same day (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis)
[23:48:04] <grrrit-wm>	 (03PS1) 10Cmjohnson: fixing netboot.cfg for elastic [operations/puppet] - 10https://gerrit.wikimedia.org/r/95960 
[23:49:32] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] fixing netboot.cfg for elastic [operations/puppet] - 10https://gerrit.wikimedia.org/r/95960 (owner: 10Cmjohnson)
[23:51:56] <grrrit-wm>	 (03PS2) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 
[23:56:23] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 
[23:56:25] <paravoid>	 ...my sunday evening...
[23:57:06] <springle>	 :)
[23:58:16] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis)
[23:58:25] <paravoid>	 bleh